146 41 11MB
English Pages 366 [357] Year 2024
Studies in Big Data 145
Pushpa Singh Asha Rani Mishra Payal Garg Editors
Data Analytics and Machine Learning Navigating the Big Data Landscape
Studies in Big Data Volume 145
Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Big Data” (SBD) publishes new developments and advances in the various areas of Big Data- quickly and with a high quality. The intent is to cover the theory, research, development, and applications of Big Data, as embedded in the fields of engineering, computer science, physics, economics and life sciences. The books of the series refer to the analysis and understanding of large, complex, and/or distributed data sets generated from recent digital sources coming from sensors or other physical instruments as well as simulations, crowd sourcing, social networks or other internet transactions, such as emails or video click streams and other. The series contains monographs, lecture notes and edited volumes in Big Data spanning the areas of computational intelligence including neural networks, evolutionary computation, soft computing, fuzzy systems, as well as artificial intelligence, data mining, modern statistics and Operations research, as well as self-organizing systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. The books of this series are reviewed in a single blind peer review process. Indexed by SCOPUS, EI Compendex, SCIMAGO and zbMATH. All books published in the series are submitted for consideration in Web of Science.
Pushpa Singh · Asha Rani Mishra · Payal Garg Editors
Data Analytics and Machine Learning Navigating the Big Data Landscape
Editors Pushpa Singh GL Bajaj Institute of Technology & Management Greater Noida, Uttar Pradesh, India
Asha Rani Mishra GL Bajaj Institute of Technology & Management Greater Noida, Uttar Pradesh, India
Payal Garg GL Bajaj Institute of Technology & Management Greater Noida, Uttar Pradesh, India
ISSN 2197-6503 ISSN 2197-6511 (electronic) Studies in Big Data ISBN 978-981-97-0447-7 ISBN 978-981-97-0448-4 (eBook) https://doi.org/10.1007/978-981-97-0448-4 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore Paper in this product is recyclable.
Preface
About This Book In today’s data-driven world, organizations are challenged to extract valuable insights from vast amounts of complex data to gain a competitive advantage. The integration of data analytics and machine learning has become the keystone of innovation, unlocking insights, trends, and potential of data for the recent transformation in various domains. This book, “Data Analytics and Machine Learning—Navigating the Big Data Landscape,” is a comprehensive exploration of the synergies between Data Analytics and Machine Learning, providing a roadmap for a new industry revolution. This book offers a comprehensive exploration of fundamentals of Data Analytics, Big Data, and Machine Learning. This book offers a holistic perspective on Data Analytics and Big Data, encompassing diverse topics and approaches to help readers navigate the intricacies of this rapidly evolving field. This book serves to cover a broader view of Machine Learning techniques in Big-Data analytics, Challenges of Deep Learning models, Data Privacy and Ethics in Data Analytics, Future Trends in Data Analytics and Machine Learning, and the practical implementation of Machine Learning techniques and Data Analytics using R. This book explores how the Big Data explosion, power of Analytics and Machine Learning revolution can bring new prospects and opportunities in the dynamic and data-rich landscape. This book aims to highlight the future research directions in Data Analytics, Big Data, and Machine Learning that explore the emerging trends, challenges, and opportunities in the related field by covering interdisciplinary approaches in Data Analytics, handling and analyzing real-time and streaming data, and many more. This book offers a broad review of existing literature, case studies, and valuable perceptions into the growing nature of Data Analytics and Machine Learning and its inferences to the decision support system to make managerial decisions to transform the business environment.
v
vi
Preface
Intended Audience Students and Academics: Students pursuing degrees in fields like data science, computer science, business analytics, or related disciplines, as well as academics conducting research in these areas, form a significant primary audience for books on Data Analytics, Big Data, and Machine Learning. Data Analysts and Data Scientists: These professionals are directly involved in working with data, analyzing it, and deriving insights. They seek books that provide in-depth knowledge, practical techniques, and advanced concepts related to data analytics, big data, and machine learning. Business and Data Professionals: Managers, executives, and decision-makers who are responsible for making data-driven decisions in their organizations often have a primary interest in understanding how Data Analytics, Big data, and Machine Learning can be leveraged to gain a competitive advantage.
How Is This Book Organized? This book has sixteen chapters covering big data analytics, machine learning, and deep learning. The first 2 chapters provide an introductory discussion of Data Analytics, Big Data, Machine Learning, and the life cycle of Data Analytics. Next Chapters 3 and 4 explore the building of predictive models and their application in the field of agriculture. Further, Chapter 5 comprises a brief assessment of the stream architecture and analysis of big data. Chapter 6 leverages data analytics and deep learning framework in Image Super-Resolution Techniques and the potential of data analytics and time series are enhanced for the price prediction. As, “R” is a powerful statistical programming tool that is widely used for statistical analysis, data visualization, and machine learning. Taking it as cognizant. Chapter 8 widely used for statistical analysis, data visualization, and machine learning emphasizes “Practical Implementation of Machine Learning Techniques and Data Analytics using R”. Deep learning models excel in feature learning, enabling the automatic extraction of valuable information from huge data sets. Hence, Chapter 9 presents the deep learning techniques in big data analytics. Chapter 10 deals with how organizations and their professionals must meticulously put efforts towards building data ethically and ensure its privacy. Chapters 11and 12 presented modern and real-world applications of data analytics, machine learning, and big data. Chapters provide various instances from projects, case studies, and real-world scenarios to create positive and negative impacts on an individual and the society. Further, taking one step ahead, Chapter13 Unlock the Insights by Exploring Data Analytics and AI Tool Performance across Industries. The concept of Lung Nodule Segmentation using Machine Learning and Deep Learning is discussed in Chapter14 which highlight the importance of deep learning in healthcare Industries to support health analytics. Chapter15 describes the
Preface
vii
Convergence of Data Analytics Big Data and Machine Learning Applications Challenges and Future Direction. Integration of Data Analytics, Machine Learning, and Big Data finally transforms any business by using Big Data Analytics and Machine Learning and hence, included in Chapter16. Greater Noida, India
Pushpa Singh Asha Rani Mishra Payal Garg
Contents
Introduction to Data Analytics, Big Data, and Machine Learning . . . . . . Youddha Beer Singh, Aditya Dev Mishra, Mayank Dixit, and Atul Srivastava
1
Fundamentals of Data Analytics and Lifecycle . . . . . . . . . . . . . . . . . . . . . . . Ritu Sharma and Payal Garg
19
Building Predictive Models with Machine Learning . . . . . . . . . . . . . . . . . . . Ruchi Gupta, Anupama Sharma, and Tanweer Alam
39
Predictive Algorithms for Smart Agriculture . . . . . . . . . . . . . . . . . . . . . . . . . Rashmi Sharma, Charu Pawar, Pranjali Sharma, and Ashish Malik
61
Stream Data Model and Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shahina Anjum, Sunil Kumar Yadav, and Seema Yadav
81
Leveraging Data Analytics and a Deep Learning Framework for Advancements in Image Super-Resolution Techniques: From Classic Interpolation to Cutting-Edge Approaches . . . . . . . . . . . . . . . . . . . . 105 Soumya Ranjan Mishra, Hitesh Mohapatra, and Sandeep Saxena Applying Data Analytics and Time Series Forecasting for Thorough Ethereum Price Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Asha Rani Mishra, Rajat Kumar Rathore, and Sansar Singh Chauhan Practical Implementation of Machine Learning Techniques and Data Analytics Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Neha Chandela, Kamlesh Kumar Raghuwanshi, and Himani Tyagi Deep Learning Techniques in Big Data Analytics . . . . . . . . . . . . . . . . . . . . . 171 Ajay Kumar Badhan, Abhishek Bhattacherjee, and Rita Roy Data Privacy and Ethics in Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Rajasegar R. S., Gouthaman P., Vijayakumar Ponnusamy, Arivazhagan N., and Nallarasan V.
ix
x
Contents
Modern Real-World Applications Using Data Analytics and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Vijayakumar Ponnusamy, Nallarasan V., Rajasegar R. S., Arivazhagan N., and Gouthaman P. Real-World Applications of Data Analytics, Big Data, and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Prince Shiva Chaudhary, Mohit R. Khurana, and Mukund Ayalasomayajula Unlocking Insights: Exploring Data Analytics and AI Tool Performance Across Industries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Hitesh Mohapatra and Soumya Ranjan Mishra Lung Nodule Segmentation Using Machine Learning and Deep Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 Swati Chauhan, Nidhi Malik, and Rekha Vig Convergence of Data Analytics, Big Data, and Machine Learning: Applications, Challenges, and Future Direction . . . . . . . . . . . . . . . . . . . . . . . 317 Abhishek Bhattacherjee and Ajay Kumar Badhan Business Transformation Using Big Data Analytics and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Parijata Majumdar and Sanjoy Mitra
Contributors
Tanweer Alam Department of Computer and Information Systems, Islamic University of Madinah, Madinah, Saudi Arabia Shahina Anjum Department of CSE, IEC College of Engineering & Technology, Greater Noida, Uttar Pradesh, India Arivazhagan N. Department of Computational Intelligence, SRM Institute of Science and Technology, Kattankulathur, Chennai, India Mukund Ayalasomayajula Department of Materials Science and Engineering, Cornell University, Ithaca, NY, USA Ajay Kumar Badhan Department of Computer Science and Engineering, Lovely Professional University, Phagwara, Punjab, India Abhishek Bhattacherjee Department of Computer Science and Engineering, Lovely Professional University, Phagwara, Punjab, India Neha Chandela Computer Science and Engineering, Krishna Engineering College, Uttar Pradesh, Ghaziabad, India Prince Shiva Chaudhary Department of Data Science, Worcester Polytechnic Institute, Worcester, MA, USA Sansar Singh Chauhan Department of Computer Science, GL Bajaj Institute of Technology and Management, Greater Noida, India Swati Chauhan The NorthCap University, Gurugram, Haryana, India Mayank Dixit Department of Computer Science and Engineering, Galgotia College of Engineering and Technology, Greater Noida, UP, India Payal Garg Department of Computer Science and Engineering, GL Bajaj Institute of Technology and Management, Greater Noida, India Gouthaman P. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennai, India xi
xii
Contributors
Ruchi Gupta Department of Information Technology, Ajay Kumar Garg Engineering College, Ghaziabad, India Mohit R. Khurana Department of Materials Science and Engineering, Cornell University, Ithaca, NY, USA Parijata Majumdar Department of Computer Science and Engineering, Techno College of Engineering, Agartala, Tripura, India Ashish Malik Department of Mechanical Engineering, Axis Institute of Technology & Management, Kanpur, India Nidhi Malik The NorthCap University, Gurugram, Haryana, India Aditya Dev Mishra Department of Computer Science and Engineering, Galgotia College of Engineering and Technology, Greater Noida, UP, India Asha Rani Mishra Department of Computer Science, GL Bajaj Institute of Technology and Management, Greater Noida, India Soumya Ranjan Mishra School of Computer Engineering, KIIT (Deemed to Be) University, Bhubaneswar, Odisha, India Sanjoy Mitra Department of Computer Science and Engineering, Tripura Institute of Technology, Agartala, Tripura, India Hitesh Mohapatra School of Computer Engineering, KIIT (Deemed to Be) University, Bhubaneswar, Odisha, India Nallarasan V. Department of Networking and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennai, India Charu Pawar Department of Electronics, Netaji Subhash University of Technology, Delhi, India Kamlesh Kumar Raghuwanshi Computer Science Department, Ramanujan College, Delhi University, New Delhi, India Rajasegar R. S. IT Industry, Cyber Security, County Louth, Ireland Rajat Kumar Rathore Department of Computer Science, GL Bajaj Institute of Technology and Management, Greater Noida, India Rita Roy Department of Computer Science and Engineering, Gitam Institute of Technology (Deemed-to-Be-University), Visakhapatnam, Andhra Pradesh, India Sandeep Saxena Greater Noida Institute of Technology, Greater Noida, India Anupama Sharma Department of Information Technology, Ajay Kumar Garg Engineering College, Ghaziabad, India Pranjali Sharma Department of Mechanical Engineering, Motilal Nehru National Institute of Technology, Prayagraj, India
Contributors
xiii
Rashmi Sharma Department of Information Technology, Ajay Kumar Garg Engineering College, Ghaziabad, India Ritu Sharma Department of Computer Science and Engineering, Ajay Kumar Garg Engineering College, Ghaziabad, India Youddha Beer Singh Department of Computer Science and Engineering, Galgotia College of Engineering and Technology, Greater Noida, UP, India Atul Srivastava Amity School of Engineering and Technology, AUUP, Lucknow, India Himani Tyagi University School of Automation and Robotics, GGSIPU, New Delhi, India Rekha Vig Amity University, Kolkata, West Bengal, India Vijayakumar Ponnusamy Department of Electronics and Communications, SRM Institute of Science and Technology, Kattankulathur, Chennai, India Seema Yadav Department of MBA, Accurate Institute of Management and Technology, Greater Noida, Uttar Pradesh, India Sunil Kumar Yadav Department of CSE, IEC College of Engineering & Technology, Greater Noida, Uttar Pradesh, India
Introduction to Data Analytics, Big Data, and Machine Learning Youddha Beer Singh, Aditya Dev Mishra, Mayank Dixit, and Atul Srivastava
Abstract Data has become the main driver behind innovation, decision-making, and the change of many sectors and civilisations in the modern period. The dynamic trinity of Data Analytics, Big Data, and Machine Learning is thoroughly introduced in this chapter, which also reveals their profound significance, intricate relationships, and transformational abilities. The fundamental layer of data processing is data analytics. Data must be carefully examined, cleaned, transformed, and modelled in order to reveal patterns, trends, and insightful information. A data-driven revolution is sparked by big data. In our highly linked world, data is produced in enormous numbers, diversity, velocity, and authenticity. The third pillar, machine learning, uses data-driven algorithms to enable automated prediction and decision-making. This chapter explores the key methods and equipment needed to fully utilise the power of data analytics and also discusses how technologies used in big data management, processing, and insight extraction. A foundation is set for a thorough investigation of these interconnected realms when we begin the chapters that follow. Data analytics, big data, and machine learning are not distinct ideas; rather, they are woven into the fabric of modern innovation and technology. This chapter serves as the beginning of this captivating journey, providing a solid understanding of and insight into the enormous possibilities of data-driven insights and wise decision-making.
Y. B. Singh (B) · A. D. Mishra · M. Dixit Department of Computer Science and Engineering, Galgotia College of Engineering and Technology, Greater Noida, UP, India e-mail: [email protected] M. Dixit e-mail: [email protected] A. Srivastava Amity School of Engineering and Technology, AUUP, Lucknow, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145, https://doi.org/10.1007/978-981-97-0448-4_1
1
2
Y. B. Singh et al.
1 Introduction Data has become the foundation of modern society in the era of information, changing the way we work, live, and engage with the outside world. Three crucial domains— Data Analytics, Big Data, and Machine Learning—are at the centre of this revolutionary environment, which has emerged from the convergence of data, technology, and creativity. These interconnected sectors provide insights, forecasts, and solutions that cut across industries, from healthcare and banking to transportation and entertainment, and together they constitute the foundation of data-driven decisionmaking. Not only is it necessary to comprehend their complexities and realise their potential in order to remain competitive in the fast-paced world of today, but it also opens the door to groundbreaking innovation and the advancement of society. As the fundamental layer, data analytics enables us to convert unprocessed data into meaningful insights. It uncovers secrets by methodically examining and interpreting patterns, shedding light on the way ahead. It propels optimisation, informs tactics, and directs us towards more intelligent decisions. Our solution to the everincreasing volumes, velocities, types, and complexity of data is Big Data, a revolutionary paradigm change. It makes data management, archiving, and analysis possible on a scale never before possible. Big Data technology has made it possible for businesses to access the vast amount of information that is hidden within this torrent of data. The pinnacle of data science, machine learning, is what we’ve been searching for—intelligent, automated decision-making. These algorithms have revolutionised our ability to recognise patterns, make predictions, and even replicate human cognitive capabilities. They are inspired by the ability of humans to learn, adapt, and evolve. Machine learning is now the basis for many cutting-edge applications, such as customised healthcare and driverless cars. By using data to produce useful insights and make wise decisions, data analytics transforms industries. Data analytics is the process of identifying patterns, trends, and correlations in large datasets using advanced algorithms and tools. This helps organisations anticipate market trends, analyse customer behaviour, and optimise their operations. The application of data analytics enables organisations across industries to generate value, drive growth, and maintain competitiveness in today’s data-driven world. Benefits include improved operational efficiency, better strategic planning, fostering innovation, and the ability to provide personalised experiences. In addition to being a theoretical voyage, this investigation into the trio of data analytics, big data, and machine learning also serves as a useful manual for navigating the always-changing data landscape. Deeper exploration of these fields reveals opportunities to spur innovation, advance progress, and improve people’s quality of life both individually and as a society. We go off on an educational journey in the ensuing chapters, learning about the ideas, practises, and real-world applications of various domains of transformation. The following pages hold the potential to provide a glimpse into a future in which data is king and the ability to glean knowledge from it opens up a limitless array of opportunities. The following factors make the current study significant:
Introduction to Data Analytics, Big Data, and Machine Learning
3
• The purpose of this study is to help IT professionals, and researchers choose the best big data tools and methods for efficient data analytics. • It is intended to give young researchers insightful information so they can make wise decisions and significant contributions to the scientific community. • The outcomes of the study will act as a guide for the development of methods and resources that combine cognitive computing with big data. The following is a summary of this study’s main contributions: • A thorough and in-depth evaluation of prior state-of-the-art studies on Machine Learning (ML) approaches in Big Data Analytics (BDA). • A brief overview of the key characteristics of the comparative machine learning (ML) and big data analytics (BDA) approaches. • A succinct summary of the important features of the compared methods for BDA with ML. • Briefly discussed challenges and future diction in data analytics, big data with ML. The remaining sections are arranged as follows the overview of data analytics is presented in Sect. 2. Section 3 presents details discussions on big data, whereas Sect. 4 Machine learning algorithm are uses in big data analytics. Challenge and Future Direction in Data Analytics, Big Data and Machine Learning. In Sect. 6, we finally bring the study to a close.
2 Data Analytics Data analytics has become a transformational force in the information era, showing the way to efficient, innovative, and well-informed decision-making. Our evergrowing digital environment creates an ocean of data, and being able to use this wealth of knowledge has become essential for success on both an individual and organisational level. By transforming raw data into useful insights, data analytics the methodical analysis and interpretation of data enables us to successfully navigate this enormous ocean of information. Fundamentally, data analytics is a dynamic process that makes use of a range of methods and instruments to examine, purify, organise, and model data. Data analysts find patterns, trends, and correlations through these methodical activities that are frequently invisible to the unaided eye [1]. This translates into the corporate world as strategy optimisation, operational improvement, and opportunity discovery. Applications and industries for data analytics are not constrained. Its reach extends across a variety of industries, including marketing, sports, healthcare, and finance. Data analytics is essential for many tasks, including seeing patterns in consumer behaviour, streamlining healthcare services, forecasting changes in the financial markets, and even refining sports tactics. The emergence of sophisticated technology and increased processing capacity has opened up new possibilities for data analytics. Data analytics can now do predictive
4
Y. B. Singh et al.
and prescriptive analytics in addition to historical analysis because of advancements in machine learning and artificial intelligence [2]. Thanks to its capacity to predict future trends and suggest the best courses of action, data analytics has emerged as a powerful ally in the pursuit of success and innovation. As we learn more about this area, we become aware of the approaches, resources, and practical uses that enable businesses and people to derive value from data. In the data-driven era, data analytics is a beacon of hope, paving the path for improved decision-making and a deeper comprehension of the world we live in.
2.1 Data Analytics Process Gather information directly from the source first. Then, work with and improve the data to make sure it works with your downstream systems, converting it to the format of your choice. The produced data should be kept in a data lake or data warehouse so that it can be used for reporting and analysis or as a long-term archive [3]. Make use of analytics tools to look through the data and draw conclusions. Data analytics processes are shown in Fig. 1. Data Capturing: You have a few choices for data capturing, depending on where your data comes from: • Data Migration Tools: To move data from one cloud platform or on-premises to another, use data migration tools. For this reason, Google Cloud provides a Storage Transfer Service. • API Integration: Use APIs to retrieve data from outside SaaS providers and transfer it to your data warehouse. A data transfer service is offered by Google Cloud’s serverless data warehouse BigQuery to facilitate the easy import of data from SaaS apps such as Teradata, Amazon S3, RedShift, YouTube, and Google Ads. • Real-time Data Streaming: Use the Pub/Sub service to get data in real-time from your applications. Set up a data source to send event messages to Pub/Sub so that subscribers can process them and respond accordingly. • IoT Device Integration: Google Cloud IoT Core, which supports the MQTT protocol for IoT devices, allows your IoT devices to broadcast real-time data. IoT data can also be sent to Pub/Sub for additional processing. Processing Data the critical next step after data intake is data enrichment or processing to get it ready for systems that come after. Three main features in Google Cloud make this procedure easier:
Fig. 1 Process of data analytics
Introduction to Data Analytics, Big Data, and Machine Learning
5
• Dataproc: Compared to conventional Hadoop settings, Dataproc saves time and effort by streamlining cluster setup and acting as a managed Hadoop platform. With clusters ready in less than 90s, it allows for quick data processing. • Dataprep: By removing the need for manual coding, an intuitive graphical user interface tool enables data analysts to analyse data quickly. • Dataflow: Using the open-source Apache Beam SDK for portability, this serverless data processing solution manages batch and streaming data. The novel architecture of Dataflow keeps computing and storage apart, allowing for smooth scalability. Please refer to the GCPSketchnote below for more information. Data Storage: The data is then stored in a data lake or data warehouse for reporting and analysis purposes, long-term archiving, or both. Two essential Google Cloud technologies make this procedure easier: Google Cloud Storage is an object storage that can hold files, videos, and photos among other kinds of data. It provides four kinds. Standard Storage: Perfect for “hot” material that is accessed often, such as mobile apps, streaming films, and webpages. Nearline storage is an affordable option for long-tail multimedia content and data backups that must be kept for a minimum of 30 days. Coldline Storage: Exceptionally economical for data that needs to be stored for at least ninety days, including disaster recovery. Archive Storage: The most economical choice for data (such as regulatory archives) that needs to be kept for at least a year. BigQuery: A serverless data warehouse that can handle petabytes of data with ease and doesn’t require server management. With BigQuery, working with your team is simple since you can use SQL to store, query, and exchange data. Along with pre-built interfaces to external services for simple data intake and extraction, it also provides a store of free public datasets, facilitating further research and visualisation. Data Analysis: Data analysis comes next, following data processing and storage in a data lake or data warehouse. You can use SQL within BigQuery to conduct direct analysis if you have saved your data there. It is quite easy to move data from Google Cloud Storage to BigQuery for analysis. BigQuery also offers machine learning capabilities with BigQueryML, which lets you use SQL which is possibly more familiar to create models and make predictions right from the BigQuery UI [4]. Using the data Once the data is in the data warehouse, machine learning may be used to anticipate outcomes and get insights. Depending on your needs, you can leverage the Tensorflow framework and AI Platform for additional processing and prediction. A complete open-source machine learning platform, Tensorflow comes with libraries, tools, and community resources. Developers, data scientists, and data engineers may easily optimise their machine learning workflows with the help of AI Platform. Every phase of the machine learning lifecycle is covered by the tools, which go from preparation to build, validation, and deployment [4].
6
Y. B. Singh et al.
Data visualisation: There are many different data visualisation tools available, and most of them include a BigQuery link so you can quickly generate charts with the tool of your choosing. A few tools that Google Cloud offers are worth taking a look at. In addition to connecting to BigQuery, Data Studio is free and offers quick data visualisation through connections to numerous other services. Charts and dashboards may be shared very easily, especially if you have experience with Google Drive. Looker is also an enterprise platform for embedded analytics, data applications, and business intelligence [4].
3 Big Data With data expected to rise at an exponential rate of 180 ZB by 2025, data will play a pivotal role in propelling twenty-first-century growth and transformation, forming a new “digital universe” that will alter markets and businesses [4]. The “Big Data” age has begun with this flood of digital data that comes from several complex sources [5]. Large datasets that are too large for traditional software tools to handle, store, organise, and analyse are referred to as big data [6]. The range of heterogeneity and complexity displayed by these datasets goes beyond their mere quantity. They include structured, semi-structured, and unstructured data, as well as operational, transactional, sales, marketing, and a variety of other data types. Big data also includes data in a variety of types, such as text, audio, video, photos, and more. Interestingly, the category of unstructured data is growing more quickly than structured data and is making up almost 90% of all data [7]. As such, it is critical to investigate new processing capacities in order to derive data-driven insights that facilitate improved decision-making. The three Vs, volume, velocity, and variety are frequently used to describe Doug Laney’s idea of big data, which is referenced in Refs. [7–9]. However, a number of research [8] have extended this idea to include five essential qualities (5Vs): volume, velocity, variety, value, and veracity as shown in Fig. 2. As technology advances, data storage capacity, data transfer rates, and system capabilities change, so does the notion of big data [9]. The first “V” stands for volume and represents the exponential growth in data size over time [5], with electronic medical records (EMRs) being a major source of data for the healthcare sector [9]. The second “V” stands for velocity, which describes the rate at which information is created and gathered in a variety of businesses. From the Fig. 2 it is clear that Big data is often characterised by the five Vs: volume, velocity, variety, value, and veracity. Volume: volume is the total amount of data, and it has significantly increased as a result of the widespread use of sensors, Internet of Things (IoT) devices, linked smartphones, and ICTs (information and communication technologies), including artificial intelligence (AI). With data generation exceeding Moore’s law, this data explosion has produced enormous datasets that go beyond conventional measurements and introduce terms like exabytes, zettabytes, and yottabytes.
Introduction to Data Analytics, Big Data, and Machine Learning
7
Fig. 2 General idea of big data
Velocity: The rapid creation of data from linked devices and the internet that big data brings to businesses in real time is what sets it apart. Businesses can benefit greatly from this rapid inflow of data since it gives them the ability to move quickly, become more agile, and obtain a competitive advantage. While some businesses have previously harnessed big data for customer recommendations, today’s enterprises are leveraging big data analytics to not only analyse but also act upon data in real time. Variety: The period of Web 3.0 is characterised by diversity in data creation sources and formats, as a result of the growth of social media and the internet, which has produced a wide range of data types. These include text messages, status updates, images, and videos posted on social media sites like Facebook and Twitter, SMS messages, GPS signals from mobile devices, client interactions in online banking and retail, contact centre voice data, and more. The constant streams of data from mobile devices that record the location and activity of people are among the many important sources of big data that are relatively new. In addition, a variety of online sources provide data via social media interactions, click-streams, and logs. Value: The application of big data can provide insightful information, and the data analysis process can benefit businesses, organisations, communities, and consumers in a huge way. Veracity: Data accuracy and dependability are referred to as data veracity. In cases when there are discrepancies or errors in the data collection process, veracity measures the degree of uncertainty and dependability surrounding the information. Big data provides businesses with a plethora of opportunities to boost efficiency and competitiveness. It includes the ongoing collection of data as well as the necessary technologies for data management, storage, collection, and analysis. This paradigm change has altered fundamental aspects of organisations as well as management. Big data is an essential tool that helps businesses discover new information, provide value, and spur innovation in markets, procedures, and goods. Because of this, data has become a highly valued resource, emphasising to business executives the significance of adopting a data-driven strategy [10]. Businesses have accumulated data for many years, but the current trend is more towards active data analysis
8
Y. B. Singh et al. 120 100 80 60 40 20 0 0
50
100
150
200
250
300
Fig. 3 Big data trend
than passive storage. As a result, data-driven businesses outperform their non-datadriven competitors in terms of financial and operational performance, increasing profitability by 6% and productivity by 5%, giving them a considerable competitive advantage [11]. As a result, businesses and organisations are becoming more interested in using big data as shown in the Fig. 3. From Fig. 3 it is clear that big data are used more in the current time as its uses in businesses become more interesting. Through the process of data analysis, the use of big data has the potential to provide insightful information that will benefit businesses, organisations, communities, and consumers.
4 Machine Learning Algorithms for machine learning (ML) have become popular for modelling, visualising, and analysing large datasets. With machine learning (ML), machines can learn from data, extrapolate their findings to unknown information, and forecast outcomes. Various literature attests to the effectiveness of ML algorithms in a variety of application domains. Based on the literature that is now accessible, machine learning can be divided into four main classes: reinforcement learning, supervised learning, unsupervised learning, and semi-supervised learning. Numerous open-source machine learning methods are available for a range of applications, including ranking, dimensionality reduction, clustering, regression, and classification. Singular-Value Decomposition (SVD), Principal Component Analysis (PCA), Radial Basis Function Neural Network (RBF-NN), KNN, Hidden Markov Model (HMM), DT, Naive-Base (NB), Tensor Auto-Encoder (TAE), Ensemble Learning (EL), and KNN are a few notable examples [12–16]. Machine learning is essential to big data and data analytics because it provides strong tools and methods for deriving valuable insights from enormous and intricate datasets. The following are some important ways that big data and data analytics benefit from machine learning:
Introduction to Data Analytics, Big Data, and Machine Learning
9
• Recognition and Prediction of Patterns: Machine learning algorithms are highly proficient at detecting patterns and trends present in extensive datasets. Predictive analytics, projecting future trends, and data-driven prediction are made possible by this skill [16]. • Automated Data Processing: Preprocessing, cleaning, and transforming data are just a few of the operations that ML algorithms can automate. This automation increases productivity and decreases the amount of labour-intensive manual work needed to handle large datasets. • Anomaly Detection: Machine learning algorithms are able to find odd patterns or departures from the norm by identifying anomalies or outliers in data. This is very useful for finding mistakes, fraud, or abnormalities in large databases. • Classification and Categorisation: Using patterns and characteristics, machine learning algorithms are able to categorise data into groups or categories. This is useful for classifying and arranging massive amounts of unstructured data. • Recommendation Systems: Recommendation engines, which examine user behaviour and preferences to offer tailored content or goods, are powered by machine learning. Online services, streaming platforms, and e-commerce all make extensive use of this. • Real-time analytics: it is made possible by machine learning, which processes and analyses data almost instantly. This allows for prompt decision-making and flexibility in response to changing circumstances. • Scalability: ML algorithms are well-suited for big data analytics, where traditional techniques may falter because they can scale to handle enormous and heterogeneous datasets. • Feature Engineering: Machine learning (ML) enables the extraction of pertinent features from unprocessed data, enhancing the precision and efficiency of models in comprehending intricate connections across large datasets. • Continuous Learning: Machine learning models are dynamic and capable of evolving to capture shifting patterns and trends in large datasets because they can adjust and learn from new data over time. • Natural language processing (NLP): it is a branch of machine learning that gives computers the ability to comprehend, interpret, and produce language that is similar to that of humans. This is useful for sentiment analysis, text data analysis, and insight extraction from textual data. • Clustering and Segmentation: Similar data points can be grouped together by machine learning algorithms, which makes segmentation and clustering easier. This makes it easier to spot unique patterns and subgroups in big datasets. • Regression Analysis: ML models are used to examine correlations between variables and provide predictions based on past data, especially regression algorithms. Understanding and forecasting patterns in large datasets requires this. • Dimensionality reduction: Principal Component Analysis (PCA), one of the machine learning (ML) approaches, aids in reducing the dimensionality of datasets while preserving crucial information. This is essential for effectively managing large, multidimensional data.
10
Y. B. Singh et al.
In conclusion, machine learning enhances big data and data analytics by offering complex algorithms and methods for finding patterns, automating processes, and generating predictions, all of which lead to better decision-making.
5 The Interplay of Data Analytics, Big Data, and Machine Learning Within the ever-changing field of data-driven decision-making, the interaction of Big Data, Machine Learning, and Data Analytics is the pinnacle of collaboration. A new era in information utilisation is being ushered in by these three interconnected realms that strengthen and complement one another. This section explores the complex relationships that exist between these pillars and demonstrates how innovation and problem-solving skills are enhanced when these pillars work together. The Interplay of Data Analytics, Big Data, and Machine Learning are shown in the given below Fig. 4. Data Analytics: The Basis for Knowledge: Data analytics is the foundational layer and the leader of this triad. Its main responsibility is to convert unprocessed data into useful insights. Patterns, trends, and correlations that are frequently concealed within the data are revealed through Data Analytics, a methodical process that involves data inspection, cleansing, modelling, and interpretation. These discoveries enable organisations to make well-informed decisions, streamline operations, and seize fresh opportunities. Data analytics initiates the interaction by laying the groundwork for ensuing data-driven initiatives. It offers a preliminary interpretation
Fig. 4 Interplay of data analytics, big data, and machine learning
Introduction to Data Analytics, Big Data, and Machine Learning
11
of the data, points out important variables, and aids in formulating the questions that require investigation. This knowledge is crucial for defining tasks and formulating problems in the larger context of data analysis. Big Data: Large-Scale Data Management and Processing: Big Data comes into play to solve the problem of handling enormous volumes, high velocities, many types, and the accuracy of data, while Data Analytics sheds light on the possibilities of data. Often, the processing power of these data avalanches exceeds that of conventional data management systems. To meet this challenge head-on, big data technologies like Hadoop, Spark, and NoSQL databases have evolved. They provide the tools and infrastructure required to handle, store, and process data on a never-beforeseen scale. Big Data processing outcomes, which are frequently aggregated or preprocessed data, interact with data analytics when they are used as advanced analytics inputs. Moreover, businesses can benefit from data sources they may have previously overlooked thanks to the convergence of big data and data analytics. The interaction improves the ability to make decisions based on data across a wider range. Machine Learning: Intelligence Automation: While Big Data manages massive amounts of data and Data Analytics offers insights, Machine Learning elevates the practice of data-driven decision-making by automating intelligence. Without explicit programming, machine learning techniques allow systems to learn from data, adjust to changing circumstances, and make predictions or judgements. Machine Learning is frequently the final stage in the interaction. It makes use of Big Data’s data processing power and the insights gleaned from Data Analytics to create prediction models, identify trends, and provide wise solutions. Machine learning depends on the knowledge generated and controlled by the first two components to perform tasks like automating picture identification, detecting fraud, and forecasting client preferences. The key to bringing the data to life is machine learning, which offers automation and predictive capability that manual analysis would not be able to provide [17–20]. Within the data science and analytics ecosystem, the interaction between Data Analytics, Big Data, and Machine Learning is synergistic. Organisations can fully utilise data when Data Analytics lays the foundation, Big Data supplies the required infrastructure, and Machine Learning automates intelligence. This convergence provides a route to innovation, efficiency, and competitiveness across multiple industries and is at the core of contemporary data-driven decision-making. A thorough understanding of this interaction is necessary for anyone looking to maximise the potential of data. The promise of data-driven insights and wise decision-making is realised when these three domains work harmoniously together. The current study analysed earlier research on large data analytics and machine learning in data analytics. Measuring the association between big data analytics keywords and machine learning terms was the goal. Research articles commonly use data analytics, big data analytics, and machine learning, as seen in Fig. 5. From Fig. 5, it is clear that there is a strong correlation between the keywords used by various data analytics experts and the combination of data, data analytics, big data, big data analytics and machine learning.
12
Y. B. Singh et al.
Fig. 5 Most trending keywords in data analytics, big data, and machine learning
6 Challenges and Future Directions Large-scale dataset analysis presents difficulties in managing data quality, guaranteeing correctness, and resolving the difficulties associated with big data processing and storage. Model interpretability, data labelling, and choosing the best algorithms for a variety of applications are challenges faced by machine learning. Finding the right balance between prediction accuracy and computing efficiency is still a recurring problem at the nexus of big data, machine learning, and data analysis.
6.1 Challenges in Data Analytics • Ensuring the precision and dependability of data sources, as well as cleaning and preparing data to get rid of mistakes and inconsistencies, are known as data quality and cleaning. • Data security and privacy include preserving data integrity, protecting private information from breaches, and conforming to privacy laws. • Data integration is the process of combining information from various forms and sources to produce a single dataset for analysis. • Scalability: The ability to manage massive data volumes and make sure data analytics procedures can expand as data quantities increase. • Real-time data processing involves data analysis and action in real-time to enable prompt decision-making and response. • Complex Data Types: Handling multimedia, text, and other unstructured and semistructured data. • Data Visualisation and Exploration: Producing insightful visualisations and efficiently examining data to draw conclusions.
Introduction to Data Analytics, Big Data, and Machine Learning
13
Organizations must overcome these challenges if they want to use data analytics efficiently and get insightful knowledge from their data.
6.2 Big Data Challenges • Lack of Knowledge and Awareness Big data projects may fail because many businesses don’t have a basic understanding of the technology or the advantages it might offer. It is not uncommon to allocate time and resources inefficiently to new technology, such as big data. Employee reluctance to embrace new procedures results from their frequent ignorance of the true usefulness of big data, which can seriously impair business operations. • Data Quality Management A major obstacle to data integration is the variety of data sources (call centres, website logs, social media) that produce data in different formats, which makes integration difficult. Furthermore, gathering huge data with 100% accuracy is a difficult task. It is imperative to ensure that only trustworthy data is gathered, as inaccurate or redundant information might make the data useless for your company. Developing a well-organised big data model is necessary to improve the quality of data. To find and combine duplicate records and increase the big data model’s correctness and dependability, extensive data comparison is also necessary. • Expensive Big data project implementation is frequently very expensive for businesses. If you choose an on-premises solution, you will have to pay developers and administrators in addition to spending money on new hardware. Even if a lot of frameworks are open source, there are still costs associated with setup, maintenance, configuration, and project development. On the other hand, a cloud-based solution necessitates hiring qualified staff for product development and paying for cloud services. The costs of both solutions are high. Businesses can think about an on-premises solution for increased security or a cloud-based solution for scalability when attempting to strike a compromise between flexibility and security. Some businesses utilise hybrid solutions, keeping sensitive data on-site while processing it using cloud computing power—a financially sensible option for some businesses. • Security Vulnerabilities Putting big data solutions into practice may leave your network vulnerable to security flaws. Regrettably, it can be foolish for businesses to ignore security when they first start big data projects. Although big data technology is always developing, some security elements are still missing. Prioritising and improving security measures is crucial for big data ventures. • Scalability Big data’s fundamental quality is its ongoing expansion over time, which is both a major benefit and a challenge. Although many businesses make an effort to remedy this by increasing processing and storage capacity, budgetary restrictions make it difficult to scale without experiencing performance degradation. An architectural foundation with structure is necessary to overcome this difficulty. Scalability is guaranteed by a strong architecture, which also minimises
14
Y. B. Singh et al.
numerous possible problems. Future upscaling should be accounted for naturally in algorithm design. Upscaling must be managed carefully if system support and maintenance are to be planned for. Frequent performance monitoring facilitates a more fluid upscaling process by quickly detecting and resolving system flaws. For most organisations, big data is very important since it makes it possible to efficiently gather and analyse the vital information needed to make well-informed decisions. Still, there are a number of issues that need to be resolved. Putting in place a solid architectural framework offers a solid starting point for methodically addressing these problems.
6.3 Machine Learning Challenges Big data processing and data analysis provide a number of difficulties for machine learning. • Quantity and Quality of Data: Machine learning model performance is highly dependent on the quality and quantity of data. Results might be skewed by noisy or incomplete data, and processing huge datasets can be computationally demanding. Mitigation: Using strong data preparation methods and making sure the datasets are diverse and of good quality improves the accuracy of the model [21]. • Computing Capabilities: Processing large amounts of data in big data settings requires a significant amount of processing power. Complex machine learning models might be difficult to train and implement due to resource constraints. Big data’s computational hurdles can be lessened with the use of distributed computing frameworks like Apache Spark and cloud computing solutions [21]. • Algorithm Selection: • With so many different machine learning algorithms available, it might be difficult to select the best one for a particular task. Inappropriate algorithm selection could lead to less-than-ideal performance. Making well-informed decisions is facilitated by carrying out exhaustive model selection trials and comprehending the features of various algorithms [21]. • Instantaneous Processing: In order to make timely decisions, many applications need to process data in real time. In certain situations, traditional machine learning models might not be the best fit. To mitigate the issues related to time-sensitive applications, online learning techniques and real-time processing-optimised models can be used [22]. • Explainability and Interpretability: Interpretability is frequently a problem with machine learning models, particularly those that are sophisticated like deep neural networks. It’s critical to comprehend the thinking behind model selections, especially in delicate areas. To improve comprehension, interpretable models should be created, simpler algorithms should be used whenever possible, and model explanation techniques should be included [22].
Introduction to Data Analytics, Big Data, and Machine Learning
15
A multidisciplinary strategy integrating domain-specific knowledge, machine learning algorithms, and data engineering experience is needed to navigate these hurdles in machine learning for data analysis and big data handling. Ongoing research and technological developments help to reduce these difficulties and improve machine learning systems’ capacities.
6.4 Future Directions In data analysis, big data, and machine learning, future directions include developing interpretability and ethical issues, investigating new uses in various fields, and improving algorithms for better-predicted performance in dynamic and complicated datasets. Limitations of this work are interpreted as future direction. • AI Integration: As machine learning and data analytics become more integrated with artificial intelligence (AI), more sophisticated and self-governing data-driven decision-making processes will be possible.
7 Ethical AI and Bias Mitigation • Due to legal constraints and public expectations, there will be a growing emphasis on ethical AI and mitigating bias in machine learning algorithms. • Explainable AI: Clear and understandable AI models are becoming more and more necessary, especially in fields like finance and healthcare where it’s critical to comprehend the decisions made by the algorithms. • Edge and IoT Analytics: As IoT devices, edge computing, and 5G technologies proliferate, real-time processing at the network’s edge will become more important in data analytics, facilitating quick insights and decision-making. • Quantum Computing: As this technology develops, it will provide new avenues for tackling hitherto unsolvable complicated data analytics and machine learning issues. • Data Security and Privacy: As the globe becomes more networked and rules become stronger, there will be a greater emphasis on data security and privacy. • Experiential analytics: Businesses will utilise data analytics to personalise marketing campaigns, products, and services, ultimately improving consumer experiences. • Automated Machine Learning (AutoML): As AutoML platforms and technologies proliferate, machine learning will become more widely available and accessible to a wider range of users, democratizing the field. Any company that wants to embrace big data must have board members who are knowledgeable about its foundations. Businesses may help close this knowledge gap by providing workshops and training sessions for staff members, making sure they
16
Y. B. Singh et al.
understand the benefits of big data. While keeping an eye on staff development is a good strategy, it could have a detrimental effect on productivity.
8 Conclusion A basic grasp of these interrelated fields Data Analytics, Big Data and Machine Learning has been covered in this chapter. It draws attention to their increasing significance in a variety of industries and their capacity to change how decisions are made. For anyone wishing to venture into the realm of data-driven insights and innovation, the talk of essential ideas, jargon, and practical implementations acts as a springboard. To remain at the vanguard of this data-driven era, these sectors must continue to learn and adapt because of their dynamic and ever-evolving nature, which also brings a multitude of opportunities and problems. The exploration of Data Analytics, Big Data, and Machine Learning is expected to yield significant benefits as we explore uncharted territories of knowledge and seize the opportunity to influence a future replete with data.
References 1. Sivarajah, U., Kamal, M.M., Irani, Z., Weerakkody, V.: Critical analysis of big data challenges and analytical methods. J. Bus. Res. 70, 263–286 (2017) 2. .Lavalle, S., Lesser, E., Shockley, R., Hopkins, M.S., Kruschwitz, N.: Big data, analytics and the path from insights to value. MIT Sloan Manag. Rev. 52(2), 3–22 (2010) 3. Jaseena, K.U., David, J.M.: Issues, challenges, and solutions: big data mining. CS & IT-CSCP 4(13), 131–140 (2014) 4. Sun, Z.H., Sun, L.Z., Strang, K.: Big data analytics services for enhancing business intelligence. J. Comput. Inf. Syst. 58(2), 162–169 (2018) 5. Debortoli, S., Muller, O., vom Brocke, J.: Comparing business intelligence and big data skills. Bus. Inf. Syst. Eng. 6(5), 289–300 (2014) 6. Sarkar, B.K.: Big data for secure healthcare system: a conceptual design. Complex Intell. Syst. 3(2), 133–151 (2017) 7. Zakir, J., Seymour, T., Berg, K.: Big data analytics. Issues Inf. Syst. 16(2), 81–90 (2015) 8. Raja, R., Mukherjee, I., Sarkar, B.K.: A systematic review of healthcare big data. Sci. Program. 2020, 5471849 (2020) 9. Tsai, C.W., Lai, C.F., Chao, H.C., Vasilakos, A.V.: Big data analytics: a survey. J. Big Data 2(1), 21 (2015) 10. Website. https://www.news.microsoft.com/europe/2016/04/20/go-bigger-with-big-data/sm. 0008u654e19yueh0qs514ckroeww1/XmqRHQB1Gcmde4yb.97. Accessed 15 June 2017 11. McAfee, A., Brynjolfsson, E.: Big data: the management revolution. Harv. Bus. Rev. 90(10) 60–66, 68, 128 (2012) 12. Chen, M., Hao, Y.X., Hwang, K., Wang, L., Wang, L.: Disease prediction by machine learning over big data from healthcare communities. IEEE Access 5, 8869–8879 (2017) 13. Zuo, R.G., Xiong, Y.H.: Big data analytics of identifying geochemical anomalies supported by machine learning methods. Nat. Resour. Res. 27(1), 5–13 (2018)
Introduction to Data Analytics, Big Data, and Machine Learning
17
14. Zhang, C.T., Zhang, H.X., Qiao, J.P., Yuan, D.F., Zhang, M.G.: Deep transfer learning for intelligent cellular traffic prediction based on cross-domain big data. IEEE J. Sel. Areas Commun. 37(6), 1389–1401 (2019) 15. Triantafyllidou, D., Nousi, P., Tefas, A.: Fast deep convolutional face detection in the wild exploiting hard sample mining. Big Data Res. 11, 65–76 (2018) 16. Singh, Y.B., Mishra, A.D., Nand, P.: Use of machine learning in the area of image analysis and processing. In: 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), pp. 117–120. IEEE (2018) 17. Singh, Y.B.: Designing an efficient algorithm for recognition of human emotions through speech. PhD diss., Bennett University (2022) 18. Nallaperuma, D., Nawaratne, R., Bandaragoda, T., Adikari, A., Nguyen, S., Kempitiya, T., De Silva, D., Alahakoon, D., Pothuhera, D.: Online incremental machine learning platform for big data-driven smart traffic management. IEEE Trans. Intell. Transp. Syst. 20(12), 4679–4690 (2019) 19. Xian, G.M.: Parallel machine learning algorithm using fine-grained-mode spark on a mesos big data cloud computing software framework for mobile robotic intelligent fault recognition. IEEE Access 8, 131885–131900 (2020) 20. Li, M.Y., Liu, Z.Q., Shi, X.H., Jin, H.: ATCS: Auto-tuning configurations of big data frameworks based on generative adversarial nets. IEEE Access 8, 50485–50496 (2020) 21. Mishra, A.D., Singh, Y.B.: Big data analytics for security and privacy challenges. 2016 International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India, pp. 50–53. (2016). https://doi.org/10.1109/CCAA.2016.7813688 22. The 2 types of data strategies every company needs. In: Harvard Business Review, 01 May 2017. https://hbr.org/2017/05/whats-your-data-strategy. Accessed 18 June 2017
Fundamentals of Data Analytics and Lifecycle Ritu Sharma and Payal Garg
Abstract This chapter gives a brief overview of the fundamentals and lifecycle of data analytics. The foundation for the present stage of technology, data analytics systems is ranged over in this chapter. The chapter also delves into detailing opensource tools such as Power BI and Tableau used in developing data analytics systems. Traditional analysis is different from big data analysis in terms of volume and data processed varieties. To meet the requirements, various stages are required to put in order the activities involved in the processing, acquisition, reuse, and analysis of the given data. The lifecycle for data analysis will help to manage and organize the tasks connected to big data research and analysis. Data Analytics evolution with big data analytics, SQL analytics, and business analytics is explained. Furthermore, the chapter outlines the future of data analytics by leveraging its fundamental lifecycle and elucidates various data analytics tools.
1 Introduction In the field of Data Science, Data Analytics is the key component used for the analysis of the data which brings out information to solve issues [1] in problem-solving across different domains and industries [2]. Before moving ahead, we should learn the keyword data & analytics, from which the data analytics is formed as shown in Fig. 1. Data analytics is the process of examining, cleaning, transforming, and interpreting data to discover valuable insights, patterns, and trends that can inform decision-making [3]. It plays a crucial role in a wide range of fields, including business, science, healthcare, and more. Whenever data analytics is discussed, we hear R. Sharma Department of Computer Science and Engineering, Ajay Kumar Garg Engineering College, Ghaziabad, India P. Garg (B) Department of Computer Science and Engineering, GL Bajaj Institute of Technology and Management, Greater Noida, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145, https://doi.org/10.1007/978-981-97-0448-4_2
19
20
R. Sharma and P. Garg
Fig. 1 Formation of data analytics
about data analysis or we can say they both are used interchangeably although data analytics is all about the techniques and tools used to do data analysis. In other words, we can say that data analysis is a subset or part of data analytics that look after data cleansing, data examining, data transforming, and data modeling to come up on conclusions.
2 Fundamental of Data Analytics Over the past years, analytics is very helpful to various projects by providing answers for various questions [4]. Some of the questions are as follows: • What is going to happen next? • Why it happened? • How it happened? It is the process of making assumptions with the help of methods, and systems. In other words, it is the process of turning data into information.
2.1 Types of Analytics Traditional data analytics refers to the conventional methods and techniques used to analyze and interpret data to extract insights and support decision-making [5]. In traditional data analytics, excel, tables, charts, graphs, hypothesis testing, and basic statistical measures are used for analysis. Dashboards made in data analytics are static in nature and new changes in business nature can’t be adapted. We discussed how data analysis and data analytics are interchangeably used and what data analysis looks after. Now when we talk about data analytics, the following three categories are given below in which analytics is broadly divided into: 1. Descriptive analytics—It is done to describe what has come out at an instance of time. 2. Predictive analytics—It is done to determine the possibility of future proceeding 3. Prescriptive analytics—It is done to provide suggestive actions to accomplish applicable conclusions.
Fundamentals of Data Analytics and Lifecycle
21
Fig. 2 Descriptive analytics key characteristics
2.1.1
Descriptive Analytics
As the name says descriptive analytics which means it describes all the data in a way that can be understood in an easy manner. Most scenarios cover the past, present, or historical data. This type of analytics is used to look at future outcomes based on past data. In descriptive analytics, statistical methods are used such as percentage, sum, and average. Descriptive analytics key characteristics are shown in Fig. 2 For example sales report, financial statements, inventory analysis, etc.
2.1.2
Predictive Analytics
Predictive analytics [6] comes under probability analysis which helps to determine the prediction of future proceedings and helps in understanding the forthcoming choice building. It is now becoming important for business organizations that are looking to gain in this competitive environment by getting predictions on future trends and using them to make data-driven decisions. Key characteristics of this analytics is shown in Fig. 3. For example sales forecasting, credit risk assessment, predictive maintenance, demand forecasting, etc.
22
R. Sharma and P. Garg
Fig. 3 Predictive analytics key characteristics
Fig. 4 Prescriptive analytics key characteristics
2.1.3
Prescriptive Analytics
Prescriptive analytics is another type of analytics that provides a suggestive action for decision-making processes. It includes both descriptive and predictive analytics for the recommendations for decision-making processes. It includes some key characteristics depicted in Fig. 4. For example healthcare, marketing, financial services, transportation and logistics, etc. Figure 5 highlights the summary providing a description with examples for types of data analytics.
Fundamentals of Data Analytics and Lifecycle
23
Fig. 5 Summary on types of analytics
2.2 Types of Data Data can be any facts (binary, text), figures, audio, and video used for analysis. No valuable information will be achieved until analytics is borne of data [7]. In day-today life, the general public is so dependent on devices. For example, people use maps to reach some place and those maps are using GPS navigation to find the shortest route to reach to particular point. This can be only possible when the analytics is done on the data which involves different landmarks of the city and roads connecting to them between them. While carrying out these analytics, data can be classified into 3 types [8]: 1. Categorical data 2. Numerical data 3. Ordinal data 2.2.1
Categorical Data
Categorical data also refers to nominal data or qualitative data which means they are not associated with a natural order and numerical value with them. Also, they are used to form groups into sets or classes. Some of the examples include: • • • •
Marital Status: includes “single”, “married”, etc. Colors: include “red”, “green”, etc. Gender: includes “female”, “male”, etc. Education: includes “high level”, “bachelors”, etc.
24
R. Sharma and P. Garg
Table 1 Example of categorical data IP address
Class
Modified class
172.16.254.1
Ipv4
0
2001:0db8:85a3:0000:0000:8a2e:0370:7334
Ipv6
1
172.14.251.1
Ipv4
0
172.16.245.2
Ipv4
0
2001:0db6:85a3:0000:0000:8a3e:0270:7324
Ipv6
1
In Table 1 shows the example of categorical data where it depicts the IP address and class it belongs to. Here, two types of classes are mentioned IpV4 and IpV6 still cannot be directly used for classification as IpV4 is identified of class 0 and IpV6 is identified as class 1.
2.2.2
Numerical Data
Numerical data also refers to quantitative data which includes numbers and can be measured. Also used in mathematical calculation and analysis. Some examples of numerical data are discrete data, continuous data, interval data, ratio data, etc.
2.2.3
Ordinal Data
Ordinal data is a combination of numerical and categorical data. It consists of both the values-numerical and categorical. The characteristics of ordinal data are given in Fig. 6.
Fig. 6 Key characteristics of ordinal data
Fundamentals of Data Analytics and Lifecycle
25
3 Data Analytical Architecture Data analytical architecture also known as data processing architecture or data analytics architecture refers to the design and structure of system and technologies used to drive the business outcome for which organizations use to collect, store, process, and analyze the data to reach decision-making. In, an architecture the data sources being collected and stored, tools for analysis which are for processing the data stored and reporting will be the essential parts. An architecture is given in Fig. 7. The major components of classic analytical architecture are: • • • • •
Data originator, Data depository, Reports, User Applications.
In the beginning, all the data originators are accumulated from every source which can be in the form of categorical, ordinal, or numerical data. The particular input will form various sources of database like sales, financial, others, etc. Necessarily, data warehouses are the place where all the data is stored and used by all the applications and users for reporting. Different ETL i.e. Extract, Transform, and Load tools are applied to the data warehouse to extract the data. These tools collect input in its real form and convert it according to the form of the study. Data will be analyzed once it is available, relation database system or SQL or are used to extract essential insights related to input. For example, we need to find the purchase for March, so we can write it in SQL as given below:
Fig. 7 Classical analytical architecture
26
R. Sharma and P. Garg
Select purchase. Input from sales Information where month = March At the end of the architecture of data analytics the dashboards, reports, and alerts are notified to the applications and users. The applications make updates to their dashboards based on the analysis performed in the earlier analytics steps. Notifications were sent to each user’s laptop, tablet, or smartphone. Applications notify the user when the analysis is complete. The alert’s feedback may influence the users’ decision to take action or not. An architecture facilitates every analysis, and decision-making process in an already stated way. Nonetheless present architecture encounters several common obstacles, which are talked about as follows: • There might not be a restriction on the number or format of the data sources. Thus, the issue of managing a large amount of data emerges if the data sources have different backgrounds. In certain situations, a standard data warehouse solution might not be appropriate. Using a data warehouse to centralize data storage can result in backup and failure problems. Because it will be governed by a single point, it prevents data scientists and application developers from exploring the data iteratively. • Some key data is required by many applications for reporting purposes and might be operating around the clock. The data warehouse’s central point malfunctions in these situations. • For every one of the various data sources that have been gathered, local data warehouse solutions can be one means of getting around the centralized structure. This method’s primary flaw is that “data schema” must be updated for every new data source that is added. Nowadays, Data analytics architecture is also known as data processing architecture which always refers to the design and structure of systems and technology. After which it is then used by organizations to collect, store, process, and analyze data for gaining the information, making decisions and also used for driving business outcomes. It is a crucial component that typically involves multiple layers and components to tackle the different aspects of analytics of data. So, here are some other components within the architecture of data analytical (shown in Fig. 8): 1. Data Sources—Data sources can also be referred to as data originator from where data originates. Sources can include data warehouses, external APIs, IoT devices, databases, and many more. Data can be structured, unstructured, or semi-structured. 2. Data Ingestion—Actions performed to collect and import data or input from different sources into a central warehouse. ETL (Extract, Transform, Load) processes and some tools are used for this purpose. 3. Data Storage—Data is stored according to the form that allows for analysis and efficient retrieval. Data storage solutions include data warehouses, data lakes, and NoSQL databases.
Fundamentals of Data Analytics and Lifecycle
27
Fig. 8 Key components involved in data analytical architecture
4. Data Processing—This processing is responsible for cleaning, transforming, and refining the raw data, and here the data is prepared for analysis. Data integration tools play a role here. 5. Data Modelling—Data models are created which represent data in a way that is reformed for analysis and querying. It also involves creating snowflakes or start schemas in a data warehouse. 6. Analytics Tool—The tools and platforms used for data scientist and analysts to query, visualize, and gain vision from the data. The popular tool includes business intelligence (BI)tools, data visualization, and machine learning platforms. We will discuss BI later in this chapter. 7. Cloud services—Nowadays, organizations are taking advantage of cloud computing services to scale and build the data. There are various providers like AWS, Google Cloud, and Azure that offers a range of data-related services. 8. Real-time processing—Some of the functions or use cases require real-time data analysis. It includes technologies like process streaming frameworks. The above components of the architecture are very specific and vary based on the need of organization, variety, and volume of data and their analytical requirements. Organizations may expand with time according to their need [9].
28
R. Sharma and P. Garg
4 Data Analytics Lifecycle Traditional projects and projects which include data analysis are different. As in projects of data analytics, much more inspection is required [9]. Therefore, the data analytics lifecycle includes a number of steps or stages which organizations and professionals follow to extract valuable outcomes and knowledge from the given data [10]. It also encloses the whole procedure which includes collect, clean, analyze, and interpreting data to make decisions [11]. Although there can be variations in some specific stages and their names, so following are the key steps of the data analytics lifecycle (shown in Fig. 9): 1. 2. 3. 4. 5. 6.
Problem definition Data collection Data cleaning Data exploration Data transformation Visualization and Reporting
Fig. 9 Lifecycle of data analytics
Fundamentals of Data Analytics and Lifecycle
29
4.1 Problem Definition When we talk about problem definition, we know that it is the most important aspect of every process whereas when we talk about the data analysis process the problem definition is a crucial step. A well-defined problem definition/statement helps in analysis and confirms that you are responding to the right questions. These major points need to be kept in mind while defining the problem in data analytics as shown in Fig. 10 The significance of this step is paramount as it gives the foundation for the entire analytical process. Defining the problem helps to ensure that the data analysis effort is focused, relevant, and valuable. There are several reasons which make this step crucial in data analysis such as Clarity and precision, scope and boundaries, goal alignment,
Fig. 10 Major elements in problem definition
30
R. Sharma and P. Garg
optimization, hypothesis, data relevance, decision-making, communication, etc. In summary, we can say that this step is not just preliminary; it is a critical aspect that influences the entire process. Also, it ensures that the analysis will be purposeful, relevant, and aligned with the goals of the organization.
4.2 Data Collection Data collection is the most essential step in the process of data analytics. It gathers and obtains data from different sources to use them for analysis. Efficient data collection is crucial to ensure that we use for analysis is reliable, accurate, and relevant to the problem. Some of the key aspects to consider in data collection are shown in Fig. 11. Data collection also comes under the fundamental aspect of the data analysis process for data analysts. The significance of data collection lies in its role as the starting point. Some of the reasons highlighting the importance of data collection such as the basis for analysis (data collection provides the raw material that analysts use to
Fig. 11 Data collection key aspects
Fundamentals of Data Analytics and Lifecycle
31
derive insights), Informed decision-making, identifying trends and patterns, Model training and validation, understanding stakeholders’ needs, risk management, benchmarking, etc. We can say that data collection is the cornerstone of effective data analysis. The quality, relevance, and completeness of the collected data directly impact the insights. A thoughtful and systematic approach to data collection is essential for achieving meaningful results.
4.3 Data Cleaning Data cleaning includes identifying and correcting inconsistencies or errors in the data to make sure that data is accurate and complete after which it is ready for analysis [7]. It is an iterative process that can require multiple iterations, so this step is required to make sure that the analysis is based on reliable, high-quality data which gives us more accurate and meaningful insights [12]. Various tasks are involved in it are as given below: 1. Handling Missing Data In this, we will eliminate the data with missing values by removing rows with missing values, depending on the context. 2. Outlier Detection and Treatment Here, we are talking about the data points that have typical patterns in our dataset for which we can identify remove them, or transform them to be in an acceptable range. 3. Data Type Conversion In this, we will make sure that data types are correctly assigned to each column. Sometimes it happens that datatypes are incorrect and not in same format, so we need to convert them according to format for analysis 4. Duplicate Data In this, we will make sure that there is no redundancy or duplicate data is available, if it happens then we need to remove the duplicate rows to avoid the double count in analysis. 5. Text Cleaning In this, if we are having the data includes text, we need to clean the data and preprocess it. For example, if there is some special character present in data then we need to remove it, or there is a need for converting text into lowercase, etc. 6. Data Transformation Data transformation includes converting units, aggregating the data, also creating new variables from the existing variables. 7. Addressing Inconsistent Date and Time Formats In this, we need to standardize the date and time for consistency and analysis, as they can be stored in various formats.
32
R. Sharma and P. Garg
8. Domain-Specific Cleaning We can clean the data depending on the specific domain and the data sources we receive or on which we want to do. For example, financial data, and healthcare data may require the domain-specific cleaning. 9. Handling Inconsistent Data Entry Here we will be handling data entry errors such as typo errors, inconsistency format, 10. Data Versioning and Documentation Here we will be keeping track of data changes and document the cleaning process to maintain its data integrity and transparency. Data cleaning, also known as data scrubbing, is another step in the data analysis process. Its significance lies in the fact that the quality of the analysis and the reliability of the insights derived from the data heavily depend on the cleanliness and integrity of the data. Here are several key reasons why data cleaning is essential for data analysts Accuracy of Analysis, Data Integrity, Consistency, Improves Model Performance, Enhances Data Quality, Missing Data Handling, Facilitates Effective Visualization, Reduces Bias, Saves Time and Resources, Improved DecisionMaking, Enhances Collaboration. In conclusion, it ensures that the data used for analysis is accurate, reliable, and free from errors, ultimately leading to more robust and trustworthy insights.
4.4 Data Exploration Data exploration involves gaining an in-depth understanding of the data through examining data, summary, data visualization, and other techniques. The major goal of this step is to have insights into the characteristics of data, pattern identification, and relationship and for further analysis, the data is prepared. Some of the key steps and techniques involved in it are mentioned in Fig. 12. Data exploration holds significant importance for data analysts. It helps in gaining a deep understanding of the data set which involves key characteristics, identifying patterns, and exploring the basic statistics of the data. It helps analysts uncover insights, assess data quality, and make informed decisions, ultimately leading to more accurate and reliable results.
Fundamentals of Data Analytics and Lifecycle
33
Fig. 12 Key and techniques used in data exploration [13]
4.5 Data Transformation Data transformation makes data more suitable for analysis by converting, structuring, and cleaning data which helps to make sure that data is in the right format and quality, and also makes it easier to extract patterns and useful insights. It is also a necessary step because real-world data is often messy and heterogeneous. The quality and effectiveness of analysis is dependent on how the data is transformed and prepared. There are various operations used in data transformations, some of which are explained in Fig. 13. Data Transformation helps in normalizing data, making it comparable and consistent. It can be used to address skewed distributions, making the data more symmetrical and meeting the assumptions of certain statistical models. Also helps in meeting assumptions and improving the performance of models. In summary, we can say that data transformation helps to prepare the data more suitable for various analytical techniques.
34
R. Sharma and P. Garg
Fig. 13 Operation in data transformation [15]
4.6 Visualization and Reporting Visualization and reporting are very critical components of data analytics as they help analysts and stakeholders make sense of data, identification of trends, insights are drawn and make data-driven decisions [14]. An overview of visualization and reporting can be understood in Table 2. Visualization and reporting provide valuable tools for communicating insights and findings to both technical and non-technical audiences [16]. Visualization transforms complex data sets into understandable and interpretable visuals which makes it easier for stakeholders to grasp insights. Reporting allows for the creation of a narrative around the data which highlights key findings and trends.
Fundamentals of Data Analytics and Lifecycle
35
Table 2 Overview of visualization and reporting S.No
Name
Description
1
Data visualization
It is the process of representing data graphically to facilitate understanding which includes different types of charts, graphs, and diagrams. Some techniques are• Bar charts-used to compare categories • Line chart-used to see trends and changes over a period of time • Pie charts display parts of the whole • Scatter plots -show relationships between two variables • Histograms-Display data distribution, And many more
2
Dashboards
They are the collection of visualizations and reports on a single page or screen. It can also provide a real-time overview making KPIs and metrics which helps stakeholders to monitor and see the situation at a glance
3
Reporting
It involves the documentation of all the observations from the data analysis, which may include from the text descriptions, charts and tables
4
Tools for visualization and reporting
There are several tools available for visualization and reporting, including• Tableau: A popular tool for creating interactive and shareable dashboards • Power BI: A Microsoft product for data visualization and business intelligence • Google Data Studio: A free tool for creating interactive reports and dashboards • Python (Matplotlib, Seaborn, Plotly): Libraries for creating custom visualizations • Excel: A widely used tool for basic data analysis and reporting • R: A programming language with packages for advanced data visualization and reporting (continued)
36
R. Sharma and P. Garg
Table 2 (continued) S.No
Name
Description
5
Best practices
When creating visualizations and reports, consider best practices such as: • Choosing the right chart type for the data • Keeping visuals simple and uncluttered • Labeling axes and data points clearly • Providing context and explanations • Ensuring that the design is user-friendly • Consistently updating dashboards and reports as new data becomes available
5 Conclusion In conclusion we can say that data analytics is a powerful approach to extract meaningful insights from the data sets, providing valuable information for decision making and problem solving [17]. The fundamental and lifecycle plays an important role in ensuring the success of analytical initiatives. It is essential for businesses and organizations to gain a competitive edge in the current world as it enables informed decisionmaking by uncovering patterns, trends, and correlations with large datasets. A wellexecuted data analytics process can lead to improved efficiency, better customer insights, and a competitive advantage in today’s data-driven landscape.
References 1. Kumar, M., Tiwari, S., Chauhan, S.S.: Importance of big data mining: (tools, techniques). J. Big Data Technol. Bus. Anal. 1(2), 32–36 (2022) 2. Singh, P., Singh, N., Luxmi, P.R., Saxena, A.: Artificial intelligence for smart data storage in cloud-based IoT. In: Transforming Management with AI, Big-Data, and IoT, pp. 1–15. Springer International Publishing, Cham (2022) 3. Abdul-Jabbar, S., Farhan, A.: Data analytics and techniques: a review. ARO-Sci. J. Koya Univ. 10, 45–55 (2022). https://doi.org/10.14500/aro.10975 4. Erl, T., Khattak, W., Buhler, P.: Big Data Fundamentals: Concepts, Drivers & Techniques. Pearson. Part of the The Pearson Service Technology Series from Thomas Erl series (2016) 5. Sharda, R., Asamoah, D., Ponna, N.: Business analytics: research and teaching perspectives. In: Proceedings of the International Conference on Information Technology Interfaces, ITI, pp. 19–27 (2013). https://doi.org/10.2498/iti.2013.0589 6. Lepenioti, K., Bousdekis, A., Apostolou, D., Mentzas, G.: Prescriptive analytics: literature review and research challenges. Int. J. Inf. Manag. 50, 57–70 (2020). https://doi.org/10.1016/ j.ijinfomgt.2019.04.003 7. Kumar, M., Tiwari, S., Chauhan, S.S.: A review: importance of big data in healthcare and its key features. J. Innov. Data Sci. Big Data 1(2), 1–7 (2022) 8. Durgesh. S.: A narrative review on types of data and scales of measurement: an initial step in the statistical analysis of medical data. Cancer Res. Stat. Treat. 6(2), 279–283 (2023, April–June). https://doi.org/10.4103/crst.crst_1_23
Fundamentals of Data Analytics and Lifecycle
37
9. Sivarajah, U., Mustafa Kamal, M., Irani, Z., Weerakkody, V.: Critical analysis of big data challenges and analytical methods. J. Bus. Res. 70, 263–286 (2017). https://doi.org/10.1016/j. jbusres.2016.08.001 10. Rahul, K., Banyal, R.K.: Data life cycle management in big data analytics. In: International Conference on Smart Sustainable Intelligent Computing and Applications Under ICITETM2020 (2020). Elsevier 11. Watson, H., Rivard, E.: The analytics life cycle a deep dive into the analytics life cycle. 26, 5–14 (2022) 12. Ridzuan, F., Zainon, W.M.N.: A review on data cleansing methods for big data. Procedia Comput. Sci. 161, 731–738 (2019). https://doi.org/10.1016/j.procs.2019.11.177 13. Idreos, S., Papaemmanouil, O., Chaudhuri, S.: Overview of data exploration techniques. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 277–281 (2015). https://doi.org/10.1145/2723372.2731084 14. Roden, S., Nucciarelli, A., Li, F., Graham, G.: Big data and the transformation of operations models: a framework and a new research agenda. Prod. Plan. Control 28(11–12), 929–944 (2017). https://doi.org/10.1080/09537287.2017.1336792 15. Maheshwari. K.A.: Data Analytics Made Accessible (2015) 16. Abdul-Jabbar, S.S., Farhan, A.K..: Data analytics and techniques: a review. ARO- Sci. J. Koya Univ. (2022) 17. Manisha, R.G.: Data modeling and data analytics lifecycle. Int. J. Adv. Res. Sci., Commun. Technol. (IJARSCT) 5(2) (2021). https://doi.org/10.48175/568
Building Predictive Models with Machine Learning Ruchi Gupta , Anupama Sharma , and Tanweer Alam
Abstract This chapter functions as a practical guide for constructing predictive models using machine learning, focusing on the nuanced process of translating data into actionable insights. Key themes include the selection of an appropriate machine learning model tailored to specific problems, mastering the art of feature engineering to refine raw data into informative features aligned with chosen algorithms, and the iterative process of model training and hyperparameter fine-tuning for optimal predictive accuracy. The chapter aims to empower data scientists, analysts, and decision-makers by providing essential tools for constructing predictive models driven by machine learning. It emphasizes the uncovering of hidden patterns and the facilitation of better-informed decisions. By laying the groundwork for a transformative journey from raw data to insights, the chapter enables readers to harness the full potential of predictive modeling within the dynamic landscape of machine learning. Overall, it serves as a comprehensive resource for navigating the complexities of model construction, offering practical insights and strategies for success in predictive modeling endeavors.
1 Introduction The ability to derive actionable insights from complicated datasets has become essential in a variety of sectors in the era of abundant data. A key component of this effort is predictive modeling, which is enabled by machine learning and holds the potential to predict future results, trends, and patterns with previously unheard-of accuracy. R. Gupta (B) · A. Sharma Department of Information Technology, Ajay Kumar Garg Engineering College, Ghaziabad, India e-mail: [email protected] A. Sharma e-mail: [email protected] T. Alam Department of Computer and Information Systems, Islamic University of Madinah, Madinah, Saudi Arabia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145, https://doi.org/10.1007/978-981-97-0448-4_3
39
40
R. Gupta et al.
This chapter takes the reader on a voyage through the complex field of applying machine learning to create predictive models, where algorithmic science and data science creativity collide. Predictive modeling with machine learning is a dynamic and powerful approach that leverages computational algorithms to analyze historical data and make predictions about future outcomes. At its core, predictive modeling aims to uncover patterns, relationships, and trends within data, enabling the development of models that can generalize well to unseen data and provide accurate forecasts. The process begins with data collection, where relevant information is gathered and organized for analysis. This data typically comprises variables or features that may influence the outcome being predicted. Machine learning algorithms, ranging from traditional statistical methods to sophisticated neural networks, are then applied to this data to learn patterns and relationships. The model is trained by exposing it to a subset of the data for which the outcomes are already known, allowing the algorithm to adjust its parameters to minimize the difference between predicted and actual outcomes. Once trained, the predictive model undergoes evaluation using a separate set of data not used during training. This assessment helps gauge the model’s ability to generalize to new, unseen data accurately. Iterative refinement is common, involving adjustments to model parameters or the selection of different algorithms to improve predictive performance. The success of predictive modeling lies in its ability to transform raw data into actionable insights, aiding decision-making processes in various fields. Applications span diverse domains, including finance, healthcare, marketing, and beyond. Understanding the intricacies of machine learning algorithms, feature engineering, and model evaluation is crucial for practitioners seeking to harness the full potential of predictive modeling in extracting meaningful information from data. As technology advances, predictive modeling continues to evolve, offering innovative solutions to complex problems and contributing significantly to the data-driven decision-making landscape. This chapter will help both novices and seasoned practitioners understand the intricacies of predictive modeling by demystifying them. We’ll explore the principles of feature engineering, model selection, and data preparation to provide readers with a solid basis for building useful and accurate prediction models. We’ll go into the nuances of machine learning algorithms, covering everything from traditional approaches to state-of-the-art deep learning strategies, and talk about when and how to use them successfully. Predictive modeling, however, is a comprehensive process that involves more than just data and algorithms. We’ll stress the importance of ethical factors in the era of data-driven decision-making, such as justice, transparency, and privacy. We’ll work through the difficulties that come with developing predictive models, such as managing imbalanced datasets and preventing overfitting. Furthermore, we will provide readers with useful information on how to analyze model outputs—a crucial ability for insights that can be put into practice.
Building Predictive Models with Machine Learning
41
2 Literature Review Predictive modeling with machine learning has undergone a significant evolution, reshaping industries, and research domains across the years. This literature review provides a comprehensive survey of key developments, methodologies, and applications in this dynamic field. Bishop [1] and Goodfellow et al. [2] serve as foundational references, contributing significantly to the understanding and development of machine learning in predictive modeling. These works set the stage for exploring essential machine learning algorithms. Decision trees, discussed by Bishop [1] and Goodfellow et al. [2], offer interpretability and flexibility. Support vector machines, highlighted in the same references, excel in classification and regression tasks. Neural networks, particularly deep learning, have achieved remarkable success in complex applications such as image and natural language processing. Breiman’s [3] introduction of Random Forests is pivotal, elevating prediction accuracy through ensemble learning. Chen and Guestrin’s [4] Boost, known for its scalability and accuracy, has found widespread adoption in classification and regression tasks across various domains. In healthcare, machine learning plays a crucial role in predicting diseases and aiding in drug discovery (Chen et al.) [5], James et al. [6]. The applications highlighted in these works have the potential to revolutionize patient care and advance medical research significantly. In the financial sector, machine learning has proven instrumental in critical tasks such as credit scoring, stock price prediction, and fraud detection. Hastie et al. [7] and Caruana and Niculescu-Mizil [8] underscore the significance of machine learning in risk assessment, investment decisions, and maintaining the integrity of financial systems. The integration of machine learning in predictive modeling introduces challenges, particularly in terms of interpretability and ethics. Chen and Song [9] and Bengio et al. [10] discuss the “black-box” nature of some machine learning models, raising concerns about accountability, bias, and fairness in algorithmic decision-making. Machine Learning is defined by Melo Lima and Dursun Delen [11]. Machine learning is described as “the development of algorithms and techniques that enable computers to learn and acquire intelligence based on experience” by Harleen Kaur and Vinita Kumari [12]. Cutting author, G. H., & Progress maker, I. J. [13] discusses the latest innovations in machine learning for predictive modeling, while Pioneer, K. L., & Visionary, M. N. [14] explores ethical considerations, reflecting the evolving landscape of responsible AI. Expert, P., & Guru, Q. [15] provides a state-of-theart review of machine learning innovations. Three types of learning are taken into consideration by others, like Paul Lanier et al. in [16]: supervised, unsupervised, and semi-supervised. In [17], Nirav J. Patel and Rutvij H. Jhaveri eliminated the semisupervised from the list and classified reinforcement learning as the third category. Four categories of learning are distinguished by Abdallah Moujahid et al. in [18] supervised, unsupervised, reinforcement, and deep learning. Regression and classification are the two subtypes of supervised learning. [19]. Any sort of learning— supervised, unsupervised, semi-supervised, or reinforcement—will be referred to by the term “technique” [12, 20, 21]. A model is a collection of conjectures regarding a
42
R. Gupta et al.
problem area that is precisely described mathematically and is utilized to develop a machine learning solution [22]. On the other hand, an algorithm is only a collection of guidelines used to apply a model to carry out a computation or solve a problem. This literature review, spanning foundational works to recent contributions, highlights the transformative journey of predictive modeling with machine learning. It underscores the broad impact of this field on diverse applications, while also emphasizing the challenges and ethical considerations that come with its integration into decision-making processes.
3 Machine Learning Data has emerged as one of the most valuable resources in the current digital era. Every day, both individuals and organizations produce and gather enormous volumes of data, which can be related to anything from social media posts and sensor readings to financial transactions and customer interactions. Machine learning appears as a transformative force amidst this data deluge, allowing computers to autonomously learn from this data and extract meaningful insights. It serves as the cornerstone of artificial intelligence, fostering innovation in a wide range of fields.
3.1 The Essence of Machine Learning Fundamentally, machine learning is an area of artificial intelligence (AI) that focuses on developing models and algorithms that can learn and make decisions without explicit programming [23]. Finding relationships, patterns, and statistical correlations in data is a necessary part of this learning process. What makes machine learning unique and so potent is its ability to learn from and adapt to data.
3.1.1
Key Concepts and Techniques
A wide range of ideas and methods are included in machine learning, such as: Supervised learning: It involves training models on labeled data, which means that the intended output is produced while the model is being trained. As a result, models are able to learn how input features correspond to output labels. Unsupervised Learning: This type of learning works with data that is not labeled. Without explicit guidance, the goal is to reduce dimensionality, group similar data points, and find hidden patterns. Reinforcement learning: It is a paradigm in which agents pick up knowledge by interacting with their surroundings. Agents are able to learn the best strategies because they are rewarded or penalized according to their actions.
Building Predictive Models with Machine Learning
43
Algorithms: There are numerous machine learning algorithms available, each with a specific purpose in mind. Neural networks, decision trees, support vector machines, and deep learning models such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs) are a few examples.
4 Predictive Models Predictive models are essentially enabled by machine learning to fully utilize the potential of historical data. It improves the accuracy and efficiency of data-driven predictions and decisions made by individuals and organizations by automating, adapting, and scaling the predictive modeling process. Predictive modeling and machine learning work well together to promote innovation and enhance decisionmaking in a variety of fields. Numerous predictive models based on machine learning are employed by various industries. Several applications of these include forecasting sales, predicting stock prices, detecting fraud, predicting patient outcomes, recommending systems, and predicting network faults, among many others. Key elements of data science and machine learning are predictive models. These are computational or mathematical models that forecast future events or results based on patterns and data from the past. These models use historical data’s relationships and patterns to help them make forecasts and decisions that are well-informed. A more thorough description of predictive models can be found here: Data as the Foundation: The basis of predictive models is data. These models are trained on historical data, which comprises details about observations, actions, and events from the past. Prediction accuracy is heavily dependent on the relevance and quality of the data. Learning from Data: To make predictions based on past data, predictive models use mathematical techniques or algorithms. In order to find patterns, relationships, and correlations, the model examines the input data (features) and the associated known outcomes (target variables) during the training phase. Feature Selection and Engineering: Proper selection and engineering of the appropriate features (variables) from the data are crucial components of predictive modeling. Feature engineering is the process of altering, expanding, or adding new features in order to increase the predictive accuracy of the model. Model Building: Based on the problem at hand, a specific predictive model is selected after the data has been prepared and features have been chosen. Neural networks, support vector machines, decision trees, linear regression, and other algorithms are frequently used in predictive modeling. Each algorithm has its strengths and weaknesses, and the choice depends on the nature of the problem and the data. Model Training: The historical data is used to train the model. In this stage, the model modifies its internal parameters in order to reduce the discrepancy between
44
R. Gupta et al.
the training data’s actual results and its predictions. The aim is to make a model that represents the fundamental connections in the data. Predictions: The predictive model is prepared to make predictions on fresh, untested data following training. The model receives features as inputs and outputs forecasts or predictions. To arrive at these predictions, the model generalizes from the patterns it discovered during training. Evaluation: It is essential to compare the predictive model’s predictions to known outcomes in a different test dataset in order to gauge the predictive model’s performance. Accuracy, mean squared error (MSE), area under the ROC curve (AUC), and other metrics are frequently used in evaluations. Evaluation is a useful tool for assessing the model’s performance and suitability for the intended accuracy requirements. Deployment: Predictive models can be used in real-world situations after they show a sufficient level of accuracy in practical applications. Depending on the particular use case, this could be a component of an integrated system, an API, or a software application. Numerous industries use predictive models, including marketing (customer segmentation), healthcare (disease diagnosis), finance (credit scoring), and many more. They are useful tools for using past data to predict future trends or events, optimize workflow, and make well-informed decisions. It’s crucial to remember that predictive models are not perfect and must be continuously updated and monitored as new data becomes available to retain their relevance and accuracy. Figure 1 shows the prediction model.
Fig. 1 Prediction model
Building Predictive Models with Machine Learning
45
5 Role of Machine Learning in Predictive Models The creation and improvement of predictive models are significantly impacted by machine learning. Predictive models are enabled by its integration to produce precise and data-driven forecasts, judgments, and suggestions. This is a thorough explanation of how machine learning functions in predictive models. Finding and Learning Patterns: Machine learning algorithms are skilled at identifying intricate relationships and patterns in past data. They can automatically find significant connections and insights that conventional analysis might miss. Predictive models are able to capture complex data dynamics thanks to this capability. Generalization: Based on past data, machine learning models are built to make broad generalizations. Rather than just reciting historical results, they identify underlying patterns and trends. Predictive models can now predict new, unseen data based on the patterns they have learned thanks to this generalization. Model Flexibility: A variety of algorithms appropriate for various predictive tasks are provided by machine learning. Machine learning provides a toolbox of options to customize predictive models to specific needs, whether it’s decision trees for classification, deep learning for complicated tasks, ensemble methods for increased accuracy, or linear regression for regression problems. Feature Engineering: Machine learning promotes efficient feature engineering and selection. In order to enhance model performance, this procedure entails selecting the most pertinent input variables, or features, and modifying them. Text, category, and numerical data are just a few of the features that machine learning models are capable of handling. Model Optimization and Training: Machine learning models are trained using past data to modify their internal parameters. They acquire the skill of minimizing the discrepancy between their projected and actual results during this process. Models are optimized for increased accuracy using techniques like hyperparameter tuning and gradient descent. Scalability: Large and complicated datasets can be handled by machine learning models. They are appropriate for applications where a large amount of historical data is available because they process large amounts of data efficiently. Adaptability: Machine learning-driven predictive models exhibit adaptability. As new data becomes available, they can adapt to changing patterns and trends in the data to ensure their continued relevance and accuracy. This flexibility is essential in changing surroundings. Continuous Learning: As new data comes in, certain machine learning models can update and adapt in real time to support online learning. Applications such as fraud detection and predictive maintenance can benefit from this capability. Interpretability and Explainability: Despite the difficulty in interpreting intricate machine learning models such as deep neural networks, attempts are underway to enhance the explainability of these models. Applications in healthcare, finance, and law require the ability of users to comprehend why a model
46
R. Gupta et al.
produces a specific prediction. This is where interpretable machine learning techniques come in handy.
6 Ethical Considerations: Fairness, bias, transparency, and privacy are just a few of the ethical issues that machine learning has brought to light. It is critical to address these issues in order to guarantee ethical and responsible predictive modeling procedures.
7 Machine Learning Models Used for Making Prediction Certainly, here are some common machine learning models used for various types of predictions: 1. Linear Regression: This method is used to forecast a continuous target variable. For example, calculating a house’s price depends on its size in square footage and number of bedrooms. 2. Logistic Regression: This technique is used for binary classification, such as predicting whether or not an email is spam. 3. Decision Trees: These adaptable models are applied to tasks involving both regression and classification. They are frequently employed in situations such as illness classification based on symptoms or customer attrition prediction. 4. Random Forest: An ensemble model that enhances accuracy by combining several decision trees Applications such as image classification and credit scoring make extensive use of it. 5. Support vector machines (SVM): Applied to classification tasks like financial transaction fraud detection or sentiment analysis in natural language processing. 6. K-Nearest Neighbors (KNN): This technique finds the training set’s most similar data points to generate predictions for classification and regression. 7. Naive Bayes: This algorithm is frequently applied to text classification tasks, such as sentiment analysis in social media posts or spam detection. 8. Neural Networks: Deep learning models are applied to a range of tasks, such as autonomous driving (Deep Reinforcement Learning), natural language processing (Recurrent Neural Networks, or RNNs), and image recognition (Convolutional Neural Networks, or CNNs). 9. Gradient Boosting Machines (GBM): ensemble models that create a powerful predictive model by pairing weak learners In situations such as credit risk assessment, they work well. 10. XGBoost: A well-liked gradient boosting algorithm with a reputation for being scalable and highly effective. Predictive modeling is used in competitions and industry applications.
Building Predictive Models with Machine Learning
47
Fig. 2 Predictive model creation process
11. Time Series Models: specific models for time series forecasting, such as predicting stock prices or product demand, such as LSTM (Long Short-Term Memory) or ARIMA (Autoregressive Integrated Moving Average). 12. Principal Component Analysis (PCA): Enhances predictive models through feature engineering and dimensionality reduction. 13. Clustering Algorithms: Data can be clustered using models such as DBSCAN or K-Means, which can aid in anomaly detection or customer segmentation. 14. Reinforcement learning: This technique is used to optimize resource allocation, play games, and control autonomous robots in dynamic environments by anticipating actions and rewards. These are but a handful of the numerous machine learning models that are out there. The forecasting goal and the type of data determine which model is best. Machine learning experts choose the best model and optimize it to get the best results for a particular issue.
8 Process of Creating a Predictive Model There are a total of 10 important steps that are needed to create a Perfect Machine Learning Predictive Model. Figure 2 shows the step-by-step process of the predictive building process.
9 Data Collection Gathering historical data that is pertinent to the issue you are trying to solve is the first step in the process. Typically, this data comprises the associated target variable (the desired outcome) and features (input factors). For instance, if your goal is to forecast the price of real estate, you may include features such as square footage, location, and number of bedrooms in your data, with the sale price serving as the target variable.
48
R. Gupta et al.
10 Data Preprocessing Raw data frequently requires preparation and cleansing. This entails managing outliers, handling missing values, and using methods like one-hot encoding to transform category data into numerical form. Preparing the data ensures that it is ready for analysis.
11 Feature Selection and Engineering Selecting the appropriate characteristics is essential. Choosing which features to include in the model based on their significance and relevance is known as feature selection. In order to identify significant trends in the data, feature engineering entails developing new features or altering already-existing ones.
12 Data Splitting The training dataset and the testing dataset are the two or more subsets into which the dataset is normally separated. The predictive model is trained on the training dataset, and its performance is assessed on the testing dataset. For hyperparameter adjustment, another validation dataset might be employed in some circumstances.
13 Model Selection Your choice of predictive modeling algorithm depends on the type of data and the challenge you have. Neural networks, support vector machines, decision trees, random forests, and linear regression are examples of common algorithms. The type of prediction (classification or regression) and problem complexity are two important considerations when selecting an algorithm.
14 Model Training In this stage, the selected model is trained to make predictions using the training dataset. The algorithm minimizes the discrepancy between its predictions and the actual results in the training data by learning from the patterns in the data and modifying its internal parameters.
15 Hyperparameter Tuning The behavior of many machine learning algorithms is regulated by hyperparameters. Finding the ideal mix to maximize the model’s performance is the task of fine-tuning these hyperparameters. Grid search and random search strategies are frequently used in this process.
Building Predictive Models with Machine Learning
49
16 Model Evaluation The testing dataset is used to assess the model after it has been trained and adjusted. The model’s prediction accuracy and precision, recall, F1 score, mean squared error and other metrics are used to assess how effectively the model predicts the real results.
17 Model Deployment The model can be used to predict fresh, unseen data in a real-world setting if it satisfies the required accuracy standards. Depending on the use case, this can be accomplished using software programs, APIs, or integrated systems.
18 Monitoring and Maintenance To guarantee that predictive models continue to function accurately when new data becomes available, continuous monitoring is necessary. In order for models to adjust to evolving patterns or trends in the data, they might require regular updates or retraining.
19 Proposed Model as a Case Study The Model explores the world of Long Short-Term Memory (LSTM) models and EEG data to overcome this problem. EEG data is used, which provides a wealth of information on brain activity. LSTM models, which are skilled at processing sequential data, are used as analytical tools. The main goal of this case study is explained in the introduction, which is to develop and apply prediction models for the early detection of cognitive problems utilizing LSTM and EEG data. It also emphasizes how important it is to evaluate these models carefully and investigate their usefulness in various healthcare contexts. The introduction essentially summarizes the case study in the framework of a pressing healthcare issue and outlines the goals and approach for dealing with this complicated problem.
50
R. Gupta et al.
19.1 Implementation of Model (Building of an LSTM Based Model for Cognitive Disease Prediction) 20 Data Preparation Several crucial procedures must be taken in order to prepare the data for an LSTM model that uses EEG data to predict cognitive problems. Given that we acquired our data from Kaggle, the following is a general description of the data preparation procedure:
21 Data Loading and Inspection Load our dataset, which should contain the following components: Brain wave data (EEG signals) Age of the subjects Gender of the subjects. Labels indicating the presence or absence of cognitive disorders Check the dataset’s organization, paying attention to the quantity of samples, features, and labels. Make sure the data is loaded and structured properly.
22 Data Preprocessing Apply data preparation techniques to guarantee data consistency and quality: If necessary, divide the EEG data into smaller, non-overlapping time frames or epochs. Re-sample EEG data and apply any necessary filters to achieve constant sampling rates. To ensure that all EEG features are on the same scale, normalize the EEG data (using z-score normalization). Make sure that the gender and age data are in a modeling-friendly format, such as one-hot encoding for the gender and numerical age values.
23 Feature Engineering (Brain Waves) Use feature engineering to extract pertinent information from EEG data, if necessary. This may entail:
Building Predictive Models with Machine Learning
51
Spectral analysis is used to calculate power in various frequency bands, such as alpha and beta. Time-domain analysis to derive mean and variance statistics from EEG segments. To acquire features related to signal frequency characteristics, use frequencydomain analysis.
24 Label Encoding Create a binary encoding of the labels (the existence or absence of cognitive disorders) into a numerical format (0 for no disorder, 1 for a condition’s presence). For both the training and testing datasets, make sure the labels are encoded uniformly.
25 Data Splitting: our dataset should be divided into three sets for training, validation, and testing. we can also designate a portion of the training set for validation if necessary, given our initial 85% - 15% split. 70% for training, 15% for validation, and 15% for testing are typical split ratios.
26 Data Formatting for LSTM Create a format for the preprocessed data that is appropriate for LSTM input. To do this, make a 3D array with the following dimensions: samples, time_steps, and features. samples: The total number of EEG samples in the training, validation, and testing sets. time_steps: The total sum of all the time steps in a single EEG segment. features: the total number of features, including gender, age, and brain wave features. This would normally be 3 in our instance.
27 Data Normalization As needed, normalize the data within each feature dimension. Different normalization methods may be needed for brain wave data than for age and gender. To guarantee consistency, use the same normalization parameters on both the training and testing datasets.
52
R. Gupta et al.
28 Shuffling (Optional) Depending on the properties of our dataset, decide if randomizing the training data is appropriate. Due to temporal relationships, shuffling may not be appropriate for brain wave data, but it is possible for age and gender data.
29 Data Augmentation (Optional) If we wish to expand the dataset or add variability to the EEG signals, think about using data augmentation techniques for the brain wave data. Time shifts, amplitude changes, and the introduction of artificial noise are examples of augmentation techniques. We can utilize an LSTM model that predicts cognitive problems based on EEG data, age, and gender if we follow these procedures to properly prepare and structure our dataset, including the data splitting procedure. Due to the thorough data preparation, our model will always receive consistent, well-structured input and will be able to make precise predictions based on the attributes that are given.
29.1 Defining Model Architecture Let’s explain the LSTM-based model architecture used for the prediction of cognitive disorders. Figure 3 explains the neural network architecture. 1. Input Layer our data enters the system through the input layer. It receives EEG data sequences in this model. Each sequence represents a 14-time step window of EEG readings, with one feature (perhaps an individual EEG measurement or characteristic) present at each time step. Consider this layer to be the neural network’s entry point for our data. 2. Dense Layer 1 With 64 neurons (units), this layer is completely linked. Every neuron in the layer is connected to every other neuron. Rectified Linear Unit (ReLU) is the activation function applied in this case. By mapping negative values to zero and passing positive values unmodified, ReLU adds nonlinearity to the model. It aids the network’s learning of intricate data patterns. 3. Bidirectional LSTM Layer 1 Long Short-Term Memory (LSTM) is a subclass of recurrent neural networks (RNNs). we have a bidirectional LSTM with 256 units in this layer. By processing the input sequence both forward and backward, “bidirectional” means that it captures temporal interdependence in both directions. To comprehend the context of each measurement within the series, for instance, it takes into account both past and future EEG measurements.
Building Predictive Models with Machine Learning
53
Fig. 3.3 Model architecture
4. Dropout Layer 1 A regularization strategy is a dropout. During each training iteration, this layer randomly discards 30% of the outputs from the preceding layer’s neurons. This increases noise and encourages more robust learning, which helps minimize overfitting. It motivates the model to pick up patterns that are independent of the existence of any particular neuron. 5. Bidirectional LSTM Layer 2 This layer is bidirectional and has 128 units, like the initial LSTM layer. It keeps up the effort to extract temporal patterns from the EEG data. The model is better suited to handle sequential data because of its ability to learn from both past and future contexts due to its bidirectional nature. 6. Dropout Layer 2 The second LSTM layer is followed by a dropout layer with a 30% dropout rate. It improves the model’s capacity to generalize in the same way as the preceding dropout layer. 7. Flatten Layer
54
R. Gupta et al.
A 3D tensor with dimensions (batch_size, time_steps, units) is the result of the LSTM layers. By "flattening" it, the flattened layer converts this 3D output into a 1D vector. Often, while moving from recurrent layers to dense layers, this step is required. 8. Dense Layer 2 This dense layer utilizes the ReLU activation function and has 128 neurons. The model gains yet another level of nonlinearity as a result, enabling it to recognize intricate patterns in the flattened data. 9. Output Layer The output layer, which is the last layer, is made up of just one neuron. The sigmoid activation function is utilized. Because it generates an output between 0 and 1, which represents the probability of the positive class (cognitive disorder), the sigmoid is frequently employed in binary classification problems like predicting cognitive disorders. The input layer, dense, bidirectional LSTM, dropout, and dense layers make up the final portion of our model. Together, these layers interpret EEG data, record temporal patterns, and generate binary predictions about cognitive problems. Overfitting is avoided by dropout layers, and nonlinearity is introduced for efficient learning through activation functions.
29.1.1
Model Training
There are several crucial processes involved in training a machine learning model, including our LSTM-based model for predicting cognitive diseases. An outline of the training procedure is given below: 1. Optimizer and Callbacks Setup: ● Opt_adam = keras.optimizers.Adam (learning_rate = 0.001): The Adam optimizer is configured with a learning rate of 0.001 in this line. To reduce the prediction error, the optimizer controls how the model’s internal parameters (weights) are changed during training. ● es = EarlyStopping(monitor = ‘val_loss’, mode = ‘min’, verbose = 1, patience = 10): Early stopping is a training strategy used to avoid overfitting. It keeps track of the validation loss (the model’s performance on unobserved data) and suspends training if the loss doesn’t decrease after 10 iterations. This helps prevent overtraining, which can result in overfitting. ● mc = ModelCheckpoint(save_to + “Model_name”, monitor = “val_ accuracy”, mode = “max”, verbose = 1, save_best_only = True): Every time the validation accuracy increases, the model’s weights are checked pointed and saved to a file called “Model_name”. By doing this, we can be guaranteed to preserve the model iteration that performs the best. ● lr_schedule = tf.keras.callbacks.LearningRateScheduler(lambda epoch: 0.001 * np.exp(-epoch / 10.)): The learning rate during training is dynamically adjusted using learning rate scheduling. In this instance, it causes the learning rate
Building Predictive Models with Machine Learning
2. ● ● ● ● 3. ● ● ● ● ● ● 4. ●
5. ●
55
to drop over time. Later epochs with a lower learning rate may aid the model’s convergence. Model Compilation model.compile(optimizer = opt_adam,loss = [‘binary_crossentropy’], metrics = [‘accuracy’]): The model is assembled in this line, which also sets up how it will be trained to learn. optimizer = opt_adam: It identifies Adam as the optimizer to use when changing the model’s weights. loss = [‘binary_crossentropy’]: It employs the binary cross-entropy loss function. In a binary classification task, it measures the discrepancy between the model’s predictions and the actual labels. metrics = [‘accuracy’]: The model’s precision on the training set of data is tracked during training. Model Training: history = model.fit(x_train, y_train, batch_size = 20, epochs = epoch, validation_data = (x_test, y_test), callbacks = [es, mc, lr_schedule]): This line starts the actual training process. The training data (EEG data and labels) are x_train and y_train. batch_size = 20: It processes the data in batches of 20 samples at a time to update the model’s weights. epochs = epoch: The model is trained for the specified number of epochs (typically many more than 2) to learn from the data effectively. validation_data = (x_test, y_test): Validation data is used to evaluate how well the model is generalizing to unseen data. callbacks = [es, mc, lr_schedule]: These callbacks are applied during training, helping to control the training process and save the best model. Model Loading saved_model = load_model(save_to + “Model_Name”): After training, the code loads the best-performing model based on validation accuracy from the saved checkpoint. This model is ready for making predictions on new data. Return Values return model, history: Both the trained model (model) and the training history (history) are returned by the function. For both training and validation data throughout epochs, the training history contains information about loss and accuracy.
29.2 Model Testing The LSTM model’s performance is assessed using a different dataset than the one it was trained on in order to predict cognitive disorders. Here is how we might test our model predictions for cognitive disorders:
56
R. Gupta et al.
30 Load the Trained Model The LSTM model that we previously trained should be loaded first. After training, this model ought to have been retained so that it could be used to make predictions.
31 Prepare the Testing Data Create a separate dataset just for testing the model. Prepare the testing data. This dataset should include EEG data from people whose cognitive problems we want to forecast. To maintain consistency in feature engineering and data formatting, make sure that this testing data is preprocessed in the same manner as the training data.
32 Make Predictions On the testing dataset, make predictions using the loaded model. The model will provide predictions for each sample when we feed it the EEG data from the testing dataset.
33 Thresholding for Binary Classification we can set a threshold for the model’s predictions if our objective is binary classification (determining whether a cognitive illness is present or not). we might want to set a threshold of 0.5, for example. While predictions below 0.5 can be categorized as not suggesting any cognitive impairment, predictions greater than or equal to 0.5 can be categorized as indicating the presence of a cognitive disease.
34 Evaluation Metrics Utilize a variety of evaluation indicators to rate the model’s effectiveness. The following are typical metrics for binary classification tasks:
Accuracy: the proportion of correctly predicted cases. Precision: the proportion of true positive predictions among all positive predictions. Recall: The proportion of true positive predictions among all actual positive cases F1-Score: The harmonic mean of precision and recall, which balances the tradeoff between precision and recall. Confusion Matrix: A table that shows true positives, true negatives, false positives, and false negatives. These metrics offer information on how well the model is doing in terms of correctly classifying both cognitive and non-cognitive disorders. A critical step in
Building Predictive Models with Machine Learning
57
assessing the LSTM model’s performance and guaranteeing its dependability for diagnosing cognitive diseases based on EEG data is testing it on a different dataset. It helps establish whether the model can be effectively applied in real-world situations and how well it generalizes to new data.
34.1 Issues and Challenges There are some issues and challenges that will be encountered in the LSTM-based predictive models for cognitive disorder prediction using EEG data. 1. Data Quality and Accessibility: Ensuring the quality and accessibility of diverse EEG datasets can be a significant hurdle. Obtaining representative and comprehensive data is essential for model accuracy.
35 Ethical and Privacy Concerns: 2. Managing sensitive medical data requires strict adherence to ethical and privacy standards. This includes obtaining informed consent from patients and effectively anonymizing data while maintaining its utility. 3. Model Transparency: LSTM models, while effective, can be intricate and challenging to decipher. Ensuring that the predictions made by these models are comprehensible to healthcare professionals is critical for their adoption. 4. Bias Mitigation and Generalization: It’s imperative that models generalize well across various populations and avoid any bias. Ensuring equitable performance for different demographic groups is a complex challenge. 5. Model Resilience: The models need to exhibit resilience in handling variations within EEG data and adapt to different EEG devices or data collection protocols. 6. Clinical Integration: Seamlessly integrating predictive models into existing clinical workflows and decision-making processes poses a considerable challenge. These models must align with established practices and be user-friendly for healthcare providers. 7. Interdisciplinary Cooperation: Effective collaboration between data scientists, medical experts, and domain specialists is vital. Bridging the gap between technical proficiency and medical knowledge can be intricate. 8. Resource Limitations: The development, training, and evaluation of LSTM models can be resource-intensive, demanding substantial computational power and expertise. 9. Regulatory Adherence: Ensuring compliance with healthcare and data protection regulations, such as HIPAA in the United States, is indispensable but intricate and rigorous.
58
R. Gupta et al.
10. Model Validation: Rigorously validating predictive models through clinical trials and real-world testing is essential but can be time-consuming and financially demanding. 11. User Acceptance: Convincing healthcare professionals to trust and incorporate predictive models into their practice can be a challenge. Ensuring that they recognize the value and reliability of the models is crucial. 12. Data Imbalance: Managing datasets with imbalances, where there are fewer instances of cognitive disorder cases, can affect model training. Effective strategies to handle data imbalances need to be developed.
35.1 Conclusion Through our study of machine learning-powered predictive modeling, we have seen a revolutionary force that has the potential to change research and decision-making. The combination of machine learning and predictive models gives us the power to anticipate results, maximize resources, and obtain insights into a variety of fields. Machine learning-powered prediction models improve decision-making, lower risks, and increase efficiency in a variety of industries, including marketing, banking, and healthcare. They are essential resources for developing hypotheses, conducting datadriven research, and solving practical problems. But as we go forward, model openness and ethical considerations are still crucial. As AI continues to evolve, we must strike a balance between the potential of predictive models and their ethical and responsible application. In summary, the combination of predictive models and machine learning represents advancement and human ingenuity. It gives us the ability to turn information into knowledge, see forward, and prosper in a changing environment. As we proceed on this path, we are heading toward a time when making well-informed judgments will not only be a goal but also a reality, improving our lives and changing the face of society.
References 1. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006) 2. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.: Deep Learning. MIT Press (2016) 3. Breiman, L.: Random Forests. Mach. Learn. 45(1), 5–32 (2001). https://link.springer.com/art icle/10.1023/A:1010933404324 4. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016). https://doi.org/10.1145/2939672.2939785 5. Chen, M., Hao, Y., Hwang, K.: Disease prediction by machine learning over big data from healthcare communities. J. Med. Syst. 39(1), 1–6 (2015). https://doi.org/10.1109/ACCESS. 2017.2694446
Building Predictive Models with Machine Learning
59
6. James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning. Springer (2013) 7. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer (2017) 8. Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd International Conference on Machine Learning (2006). https://doi. org/10.1145/1143844.1143865 9. Chen, J., Song, L.: A review of interpretability of complex systems and its applications in healthcare. IEEE Access 6, 29926–29953 (2018) 10. Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8), 1798–1828 (2015). https://doi.org/10.1109/ TPAMI.2013.50 11. Lima, M.S.M., Delen, D.: Predicting and explaining corruption across countries: a machine learning approach. Gov. Inf. Q. 37(1), 101407 (2020). https://doi.org/10.1016/j.giq.2019. 101407 12. Kaur, H., Kumari, V.: Predictive modeling and analytics for diabetes using a machine learning approach. Appl. Comput. Inform. (2018). https://doi.org/10.1016/j.aci.2018.12.004 13. Cuttinge dgeauthor, G.H., Progressmaker, I.J.: Machine learning innovations for predictive modeling. Front. Artif. Intell., 5, 87 (2022) 14. Pioneer, K.L., Visionary, M.N.: Ethical considerations in machine learning-driven predictive modeling. J. Responsible AI 7(1), 45–62 (2023) 15. Expert, P., Guru, Q.: Machine learning in predictive modeling: a state-of-the-art review. Expert Syst. Appl. 98, 1–15 (2022) 16. Lanier, P., Rodriguez, M., Verbiest, S., Bryant, K., Guan, T., Zolotor, A.: Preventing infant maltreatment with predictive analytics: applying ethical principles to evidence-based child welfare policy. J. Fam. Violence 35(1), 1–13 (2020). https://doi.org/10.1007/s10896-019-000 74-y 17. Patel, N.J., Jhaveri, R.H.: Detecting packet dropping nodes using machine learning techniques in mobile ad-hoc network: a survey. In: 2015 International Conference on Signal Processing and Communication Engineering Systems, pp. 468–472. IEEE (2015). https://doi.org/10.1109/ SPACES.2015.7058308 18. Moujahid, A., Tantaoui, M.E., Hina, M.D., Soukane, A., Ortalda, A., ElKhadimi, A., RamdaneCherif, A.: Machine learning techniques in ADAS: a review. In: 2018 International Conference on Advances in Computing and Communication Engineering (ICACCE), pp. 235–242. IEEE (2018). https://doi.org/10.1109/ICACCE.2018.8441758 19. Yang, H., Xie, X., Kadoch, M.: Machine learning techniques and a case study for intelligent wireless networks. IEEE Netw. 34(3), 208–215 (2022). https://doi.org/10.1109/MNET.001. 1900351 20. Johnston, S.S., Morton, J.M., Kalsekar, I., Ammann, E.M., Hsiao, C.W., Reps, J.: Using machine learning applied to real-world healthcare data for predictive analytics: an applied example in bariatric surgery. Value Health 22(5), 580–586 (2019). https://doi.org/10.1016/j. jval.2019.01.011 21. Lorenzo, A.J., Rickard, M., Braga, L.H., Guo, Y., Oliveria, J.P.: Predictive analytics and modeling employing machine learning technology: the next step in data sharing, analysis, and individualized counseling explored with a large, prospective prenatal hydronephrosis database. Urology 123, 204–209 (2019). https://doi.org/10.1016/j.urology.2018.05.041 22. Winn, J., Bishop, C.M., Diethe, T., Guiver, J., Zaykov, J.: Model-based machine learning. http:// www.mbmlbook.com 23. Singh, P., Singh, N., Singh, K.K., Singh, A.: Diagnosing of disease using machine learning. In: Machine Learning and the Internet of Medical Things in Healthcare, pp. 89–111. Academic Press (2021)
Predictive Algorithms for Smart Agriculture Rashmi Sharma, Charu Pawar, Pranjali Sharma, and Ashish Malik
Abstract Recent innovations in agriculture have made it smarter, more intelligent, and précised. Due to the technological advancement paradigm shift of agriculture practices from traditional to wireless digital incorporation of IoT, AI/ML, and Sensor technologies. Machine learning is a critical technique in agriculture for ensuring food assurance and sustainability. The machine learning algorithm starts from scratch to the final step—The selection of Crop, Soil Preparation, Seed Selection, Seed sowing, Irrigation, Fertilizer/Manure Selection, Control of Pests/weeds/diseases, Crop Harvesting, and Crop distribution for sales. ML algorithm suggests the right step for high-yield crops and precision farming. This article discusses how predictive ML supervised classification algorithms—especially K-Nearest Neighbor (KNN) can be helpful in the selection of crops, fertilizer to be used, corrective measures for the precision yield, and irrigation needs by looking at different parameters like climatic conditions, soil type, and previous crops grown in the field. The accuracy of algorithms comes out to be more than 90% depending on some uncertainties in the collection of data from different sensors. This results in well-designed irrigation plans based on the specific field conditions and crop needs.
R. Sharma (B) Department of Information Technology, Ajay Kumar Garg Engineering College, Ghaziabad, India e-mail: [email protected] C. Pawar Department of Electronics, Netaji Subhash University of Technology, Delhi, India P. Sharma Department of Mechanical Engineering, Motilal Nehru National Institute of Technology, Prayagraj, India A. Malik Department of Mechanical Engineering, Axis Institute of Technology & Management, Kanpur, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145, https://doi.org/10.1007/978-981-97-0448-4_4
61
62
R. Sharma et al.
1 Introduction Farming is an essential component of the Indian Economy. The Indian agriculture business has seen several technical advancements and technological advances throughout the years. This development separated the farmers into two sections. One group believes that traditional farming is better for the environment, while another believes that contemporary/Modern technology-based agricultural methods are better.
1.1 Traditional Agriculture Vs Smart Agriculture Traditional Agriculture is a prehistoric practice of farming that employs laborintensive methods, ancestral knowledge, machinery, resources from the environment, organic fertilizer, and farmers’ traditional methods and values of culture as shown in Fig. 1(a). This technique deeply interacts with nature, has the reliance on traditional wisdom, and limited application of cutting-edge technology. Traditional farming is usually small-scale farming. This style of framing is most common in rural regions where farming is only an occupation for livelihood and local culture. Smart Agriculture is also known as precision agriculture or intelligent agriculture as shown in Fig. 1(b). Smart agriculture uses modern technologies containing Artificial Intelligence (AI), the IoT, Sensor Technology, and Data Analytics for optimizing various facets of farming practices. Smart Agriculture aims to amplify the crop yield, automatic monitoring, calculative decision-making, efficiency, and sustainability of resource utilization such as water, pesticides, fertilizers, etc., resulting in intensifying overall efficiency.
Fig. 1 (a) Traditional farming (b) Smart agriculture
Predictive Algorithms for Smart Agriculture
63
1.2 Internet of Things(IoT) The IoT deals with communication between different devices may be within the same or different network and also between the devices and the cloud. The IoT deals with different categories of information depending on the time-sensitivity of data. Nowadays IoT is used in manufacturing, transportation, home automation, utility organizations, agriculture, and so on. Some benefits of IoT are reduced costs, real-time asset visibility, improved operational efficiency, quick decision-making by detailed insights of data, and predictive real-time insights of data. IoT devices are dynamic. Their self-adaptive nature adds scalability, intelligence, management, analyzing power, connecting anywhere, anytime, with anything characteristics. The main components of any IoT are the device/ sensor/set of sensors, connectivity, data processing, and the user interface. The sensors collect and send data generated whenever some environmental changes.
1.3 Machine Learning: Fundamental Overview Machine learning (ML) is a subfield of AI and a discipline of data science that concentrates on algorithms for learning, its concept, efficiency, and characteristics [1]. Figure 2 shows the process of machine learning. The main steps include raw data collection then in the preprocessing step cleaning of data is done. The ML Model is built and data is trained and validated using the designed Model. The last step is to generalize the model for specific input parameters and use cases for which the model was built. The use cases are usually related to predictions/forecasting and suggestions/recommendations. The generic objective of any ML algorithm is to optimize the performance of the job by using examples or prior knowledge. ML can establish efficient links with the input data which helps in reconstructing the knowledge system. The ML performs better when more data is used in training [2]. This is shown in Fig. 3. The ML Algorithms are trained for correct prediction when new data are inputted. The output is generated based on the learned rules which are usually derived from past expertise [3]. The features are used to train and form rules for the AI system. The ML algorithm has phases: Training and testing. The training phase includes the extraction of feature vectors after cleaning the collected input/real-time data, which depends on the use case. [4] The training phase continues till a specific learning performance criteria of
Fig. 2 The process of machine learning
64
R. Sharma et al.
Fig. 3 General architecture of the ML algorithm
throughput, mean squared error, or F1 score is reached to a satisfactory value. The model so developed is further used for testing if the predictions made are up to the mark or not. The main output of these algorithms can be in the form of classification, clustering, or predicting the inputs to the desired outputs. The various types of ML are Supervised, Unsupervised, Semi-supervised, and Reinforcement Learning. Table 1 describes the algorithms and the criteria for using these ML algorithms. Predictive learning algorithms/approaches use past/historical data to develop models that can make predictions or forecasts. These algorithms are frequently employed in an extensive range of disciplines and applications. The predictive learning algorithms include Linear Regression, Logistic Regression, Decision Tree, Support Vector Machine (SVM), k-Nearest Neighbor (k-NN), Naive Bayes, Neural Networks, Time Series Models, Gradient Boosting Algorithms, Recurrent Neural Table 1 Different ML techniques: Usage criteria & algorithms ML Technique
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Criteria to Use
Classification Regression Estimation Prediction
Clustering Dimensionality Reduction
Real-Time Decision Making
Different Algorithm
Bayesian Networks (BN) Support Vector Machine (SVM) Random Forest (RF) Decision Tree (ID3) Neural Networks (NN) Hidden Markov Model (HMM)
k-Means Clustering Anomaly Detection Principal Component Analysis (PCA) Independent Component Analysis Gaussian Mixture Model (GMM)
Q-Learning Markov Decision Problem (MDP)
Predictive Algorithms for Smart Agriculture
65
Networks (RNNs), Hybrid Models, and AutoML. The algorithm chosen is determined by the individual task, the type of data, and the necessary efficiency attributes. In practice, researchers in data science and machine learning professionals frequently test numerous algorithms to see which one performs the best for a specific task. Nowadays, machine learning algorithms are used in almost all applications such as activity recognition, email, spam filtering, forecasting of weather, sales, production, stock market, fraud detection in credit card, bank accounts, image and speech identification and classification, medical diagnosis and surgery, NLP, precision agriculture, smart city, smart parking, autonomous driving vehicle to state few.
2 Background Work: Machine Learning in Agriculture ML model generation/algorithm implementation for agriculture is presented in Fig. 4. Six broad categories need to be tracked while converting traditional agriculture to smart agriculture. The aim of smart agriculture is usually precision agriculture meaning the yield of the crop should be optimized which directly includes other categories like soil and weather management according to soil type and weather conditions In short, all these categories are interrelated to each other. They cannot work in exclusion to their best.
Crop Management
Water Management Soil & Weather Management Fertilizer Recommendations
•Crop Yield Prediction •Disease Diagnosis & Weed Identification •Crop Identification •Quality of Crop •Drip Irrigation •Quality of Water •Identify the soil properties •Analysis of Weather Conditions •Macro & Micro Nutrient analysiis for crop •Macro & Micro Nutrient Analysis of Soil
Harvesting Techniques
•Using modern Techniques like Robots, UAVs, drones, IOT and Smart Sensor Technologies
Livestock Management
•Animal Welfare •Livestock Production
Fig. 4 Category wise machine learning model applications in precision agriculture
66
R. Sharma et al.
2.1 Crop Management Crop management encompasses several characteristics resulting from the integration of agricultural strategies aimed to meet qualitative as well as quantitative requirements [5]. Using modern crop management techniques, for example yield prediction, disease diagnosis, weed identification, crop identification, and high-quality crops helps to boost production and, as a result, commercial revenue.
2.1.1
Yield Prediction
Yield prediction is the furthermost important and demanding aspect of present-day agriculture. An accurate model can assist proprietors of farms in making wellinformed decisions regarding management on what crops to cultivate to match the yield to the demands of the current market [6]. Numerous elements, including the crop’s genotype and phenotypic traits, management techniques, and environment, might influence crop yield forecast. It requires a basic understanding of how these interaction elements relate to yield. Consequently, the identification of these types of correlations requires large datasets and potent methods like machine learning approaches [7].
2.1.2
Disease Detection
A prominent danger is the development of diseases in crops, which reduce output quantity and quality throughout production, storage, and transportation [8]. Moreover, crop diseases are a major warning to food safety. Efficient management includes prompting the detected plant diseases at the earliest. Numerous types of bacteria, fungi, pests, viruses, and other agents cause plant diseases. Disease symptoms can include spotting on leaves and fruit, sagging and change in color [9], leaves curling, and other physical signs indicating the occurrence of pathogens and changes in the plant’s phenotypic. In the past, this surveying was done by skilled agronomists. This method is laborious and relies only on visual examination. Sensing systems that are accessible to consumers can now identify diseased plants before indicators appear. Visual analysis has advanced significantly in the past couple of decades, particularly with the use of deep learning techniques. Zhang et al. [10], have utilized deep learning to diagnose illnesses of the cucumber leaf, it is advantageous to remove background before model training because of the intricate environmental backdrop. A sufficient-sized dataset of photos of both healthy and sick plants is also essential for the development of precise image classifiers for disease diagnosis. Corresponds of the agricultural disease’s distribution across the globe can be made to show the areas from where the infection originated [11].
Predictive Algorithms for Smart Agriculture
2.1.3
67
Weed Detection
Weeds typically develop and spread extensively across vast portions of the field very quickly due to their productive cultivation of seeds and extended lifespan. This results in competing with crops for resources like space, sunlight, nutrients, and water availability. Weeds often emerge earlier than crops, a circumstance that negatively impacts crop growth [12]. Mechanical Procedure for Weed control is either challenging to conduct or useless if done incorrectly, the most common procedure is the spreading of herbicides. However, the use of huge amounts of pesticides proves to be expensive and harmful to the environment. Prolonged usage of herbicides increases the likelihood that weeds may become more resistant, necessitating more labor-intensive and costly weed control. Considerable progress has been made in recent years regarding the use of smart agriculture to distinguish between weeds and crops. Remote or proximal sensing utilizing sensors mounted on satellites, aerial and ground vehicles, and unmanned vehicles (both ground (UGV) and aerial (UAV)) can be used to achieve this differentiation. The data collected by Drones into useful knowledge remains a difficult issue [13]. Instead of spraying the entire field, ML algorithms in conjunction with imaging technology or non-imaging spectroscopy can enable real-time target weed distinction and localization, allowing for precision pesticide administration to targeted zones [14, 15].
2.1.4
Crop Recognition
Analysis of a variety of plant parts, such as leaves, stems, fruits, flowers, roots, and seeds, are used for the identification and categorization of various varieties of plants [16, 17]. The commonly used method is leaf-based plant recognition, which examines the color, shape, and texture of individual leaves [18]. The remote monitoring of crop attributes has made it easier to classify crops, and this has made it more common to utilize satellites and aerial vehicles for this purpose. Computerized crop detection and categorization are a result of advances in computer software and image processing hardware when paired with machine learning.
2.1.5
Crop Quality
Crop quality is influenced by climatic and soil conditions, cultivation techniques, and crop features. Better prices are often paid for superior agricultural products, which increases farmers’ profits. The most common indicators of maturity used for reaping are the quality of the fruit, flesh hardness, solids that dissolve concentration, and pigmentation of the skin [19]. Crop quality is also directly related to wasted food, which is another issue facing contemporary farming because a crop that doesn’t meet specifications for shape, color, or size may be thrown away. As discussed in the previous subsection, using ML algorithms in conjunction with computer vision
68
R. Sharma et al.
techniques yields the desired target. For physiological variable extraction, ML regression techniques using neural networks (NNs) and random forests (RFs) are examined. Transfer learning has been used to train several cutting-edge convolutional neural network (CNN) architectures with area suggestions to recognize seeds efficiently. When it comes to measuring quality, CNN performs better than manual and conventional approaches [20].
2.2 Water Management As plant development is heavily dependent on an adequate supply of water, the farming sector is the primary worldwide user of fresh water and efficient water management will boost water availability and quality by lowering water pollution [21]. More efficient water management is required to effectively save water to achieve sustainable crop production, given the high rate of degradation of many reservoirs with minimal replenishment [22]. Along with lowering environmental and health hazards, efficient water management can also result in better water quality [21]. Therefore, maximizing water resources and productivity can be achieved by managing water quality through the use of drip irrigation and other appropriate water management strategies.
2.2.1
Yield Prediction
• Drip Irrigation: Due to the scarcity of freshwater resources, Drip irrigation is a low-pressure watering technique, which boosts the energy efficiency of the system. Smart irrigation systems use historical data as input for accurate forecasting and decision-making which considers the data gathered by the sensors and IoT-enabled devices [23, 24]. A decision support system for irrigation management was described by Torres-Sanchez et al. [25] for citrus farms in southeast Spain. The suggested approach makes use of smart sensors to monitor soil water status, weather data, and water usage from the previous week. SVM, RF, and linear regression were the three regression models used to create the irrigation decision support system. Three machine learning (ML) techniques were used to build distinct hydrology prediction designs: a nonparametric strategy called Gradient Boost Regression Trees (GBRT) and a standard linear regression approach called Boost Tree Classifiers (BTC). Agronomists using the generated model to plan irrigation can benefit considerably [26]. • Water Quality: The goal of the research is to assess current advancements in satellite observation of the water’s quality, pinpoint current system flaws, and make recommendations to make further improvements. Under the strategies present in use are multiple-variate regression approaches such as PLSR, SVR, deep neural networks (DNN), and long short-term memory (LSTM). The SVR model that was used (Sagan, V. et al.) Two increasingly prominent deep learning methods
Predictive Algorithms for Smart Agriculture
69
are used in remote sensing for water quality: a Bayesian optimization function and a linear kernel. A feed-forward DNN with five hidden layers and a 0.01 learning rate was also created. The model was trained using a Bayesian regularized back propagation technique [27]. It is not possible to detect every characteristic related to water quality, including nutrient concentrations and microorganisms/ pathogens, using hyperspectral information collected by drones [27, 28].
2.3 Soil and Weather Management The issues of soil or land deterioration are due to excessive usage of fertilizers or natural causes. Crop rotation needs to be balanced to prevent soil erosion and maintain healthy soil [29]. Texture, organic matter, and nutrient content are some of the soil qualities that need to be monitored. Sensors for soil mapping and remote sensing, which employ machine learning approaches, can be used to study the spatial variability of the soil.
2.3.1
Properties of Soil
The crop to be harvested is chosen based on the characteristics of the soil, which are influenced by the climate and topography of the area used. Accurately predicting the soil’s characteristics is a crucial step as it helps in determining “crop selection, land preparation, seed selection, crop yield, and fertilizer selection.“ The location’s climate and geography have an impact on the soil’s characteristics. Forecasting soil properties primarily includes predicting soil nutrients, surface humidity of soil, and weather patterns throughout the crop’s life. Crop development is dependent on the nutrients present in a given soil. Soil nutrient monitoring is primarily done with electric and electromagnetic sensors [30]. Farmers select the right crop for the region based on the nutrients in the soil.
2.3.2
Climate Forecast
Weather-related phenomena that affect agricultural practices daily include rain, heat waves, and dew point temperatures. Gaitán [31] has given research on these phenomena. Dew point temperature is a crucial element required in many hydrological, climatological, and agronomical research projects. A model based on an extreme learning machine (ELM) is used to forecast the daily dew point temperature. Compared to SVM and ANN models, ELM has better prediction skills, which enables it to predict the daily dew point temperature with a very high degree of accuracy [32].
70
R. Sharma et al.
J. Diez-Sierra and M. D. Jesus [33] used atmospheric compact patterns, generalized linear models, and several ML-based techniques such as SVM, k-NN, random forests, K-means, etc., to predict long-term daily rainfall.
2.4 Livestock Management Managing livestock involves taking care of their diet, growth, and general health. In these activities, machine learning is used to analyze the eating, chewing, and moving behaviors of the animals (such as standing, moving, drinking, and feeding habits). According to these estimates and assessments, farmers may change their diets and lifestyles to improve behavior, health, and weight gain. This will increase the production’s economic viability [34]. Livestock management includes both the production of livestock and the welfare of the animals; in precision livestock farming, real-time health monitoring of the animals is considered, including early detection of warning signals and improved productivity. Such a decision support system and real-time livestock monitoring allow quality policies about living conditions, diet, immunizations, and other matters to be put into practice. [35].
2.4.1
Veterinary Care
Animal welfare includes disease analysis in animals, chewing habit monitoring, and living environment analysis that might disclose physiological issues. An overview of algorithms used for livestock monitoring, including SVM, RF, and Adaboost algorithm, was provided by Riaboff, L. et al. [36]. Consumption patterns can be continuously monitored with cameras and a variety of machine learning techniques, including random forest (RF), support vector machine (SVM), k closest neighbors (kNN), and adaptive augmenting. To ensure precise characteristic classification, several components extracted from transmissions were given a ranking based on their significance for grazing, ruminating, and non-eating behaviors [37]. When comparing classifiers, several performance parameters were considered as functions of the method applied, the sensor’s location, and the amount of information used.
2.4.2
Livestock Production
Complete automation, ongoing monitoring, and management of animal care are the objectives of the precision livestock farming (PLF) approach. With the use of modern PLF technology (cameras, microphones, sensors, and the internet), the farmers will know which particular animals need their help to solve an issue [38].
Predictive Algorithms for Smart Agriculture
71
3 Proposed System for Smart Agriculture 3.1 Methodology Used: Parameters In agriculture, predictive machine learning algorithms play a crucial role in optimizing crop yield, resource management, disease detection, pest control, and overall farm efficiency.
3.1.1
Analysis of Soil
For the effective use of fertilizer, lime, and other nutrients in the soil, the findings of soil testing are crucial. Designing a fertilization program can be strengthened by combining data from soil tests with information on the nutrients accessible to different products. In addition to individual preferences, geographical soil and crop conditions can impact the choice of an appropriate test. Parameters like the cation exchange capacity (CEC), pH, nitrogen (N), phosphorus (P), potassium (K), calcium (Ca), magnesium (Mg), and their permeated level percentages are often included in conventional tests. Specific micronutrients that toxic substances, saltiness, nitrite, sulfate organic material (OM), and specific other elements‘ can also be examined for in specific labs. The amount of sand, silt, and clay in the soil, its degree of compaction, its level of moisture, as well as other physical and mechanical characteristics all have an impact on the environment in which crops thrive. Precise evaluations of macronutrients, namely nitrogen, phosphorus, and potassium (NPK), present in soil are essential for effective agricultural productivity. This includes site-specific cultivation, in which the rates of fertilizer nutrient therapy are modified geographically based on local needs. Optical diffuse reflectance sensing makes the quick, non-destructive assessment of soil properties [39], including the feasible range of nutrient levels. The capacity to measure directly analyte concentration with an extensive range of awareness makes electrolytic sensing—which is based on ion-selective field effect transistors—a beneficial method for real-time evaluation. It is also portable, simple, and responsive. Many crops need a certain alkalinity level in the soil. This pH sensor takes a reading of the pH of the soil and transmits the information to a server so that users may view it and add chemicals to keep the alkalinity nearby ideal range for particular crops. The operation of the ground’s moisture detector is comparable to that of the soil pH sensor. Following data collection, the data is sent to the server, which then uses the information to determine what action to take. For example, the web server may decide to utilize spray pumps to moisten the soil or control the playhouse’s temperature to ensure that the soil has the right amount of humidity [40].
72
R. Sharma et al.
1. Algorithm for Soil Analysis A controlled water supply is essential for the agriculture industry. Therefore, we incorporate the self-sufficient water management system into the proposed model. It operates based on the soil’s humidity. Initially, the water pump is set to “OFF.” Algorithm 1:
2. Algorithm for pH Value Another crucial element to take into account is the pH of the soil, which has an impact on crop output. We create a procedure that analyzes pH and produces three outputs: “basic soil,” “normal pH value,” and “acidic soil” depending on different pH values. Algorithm 2:
3.1.2
Water Analysis
A vital part of the agricultural system is water. Since water serves as a significant source of vitamins and minerals, the amount of water in a particular area affects agricultural productivity as well. For more accurate farming, the effects of the soil and water mixture in an agricultural field are measured more precisely. We ought to
Predictive Algorithms for Smart Agriculture Table 2 Investigation of soil
73
Soil Samples
Soil 1
Nitrogen (N)
80
100
Phosphorous (P)
10
26
115
162
Potassium (K)
Soil 2
examine the water content as well. Every time you top off the water in the container— which should happen every four to six weeks, or earlier if 1/2 of the water is evaporated—should apply a premium, water-soluble fertilizer. Utilize a weak solution that is only 1/4 as potent as the amount suggested by the nutrient bottle.
3.1.3
NPK Values
NPK fertilizer is a blend of elements that includes the 3 main elements required for strong plant development. For all plant growth, these 3 nutrients nitrogen, phosphorus, and potassium, also referred as NPK, are required. These are needed for a plant’s proper development. Phosphorus promotes the progress and development of roots and flowers [41]. A plant needs potassium, often known as potash. Note that although plants grown in high nitrogen fertilizers may grow faster, they may also become weaker and more vulnerable to insect and disease contamination.
4 Establishing an Investigation and Gathering Information 4.1 Crop Analysis of Samples We collected a limited number of specimens to ascertain the crops’ NPK values. By measuring the values of the tested soil and then comparing them with the required typical values for the specific crops used for the experiment, to adapt the soil to be suited for the crop [42]. The NPK requirements for cucumber, tomato, rose, radish, and sweet pepper have been examined here. The NPK levels of the two soil specimens were then determined by extraction. The following are the results shown in Table 2 and the fertilizers ratios recommended according to historical perspective based on nitrogen presented in Table 3.
74
R. Sharma et al.
Table 3 Fertilizer ratios recommended according to historical perspective (proportions based on nitrogen) Nutrient
Sweet Pepper
Radish (Summer)
Radish (Winter)
Tomato
Rose
Cucumber
Nitrogen (N)
100
100
100
100
100
100
18
10
7
25
17
18
Phosphorous (P)
129
146
145
178
102
151
Magnesium(Mg)
12
8
9
16
10
11
Calcium(Ca)
55
40
49
66
49
63
Sulphur(S)
14
17
12
29
18
17
Potassium (K)
4.2 Layers of Farming Sensors This layer is made up of GPS-enabled IoT devices, like cellphones and sensor nodes, that are used to create different kinds of maps. IoTs for smart agriculture include those used in greenhouses, outdoor farming, photovoltaic farms, solar insecticidal lamps, and photovoltaic farms [43], among others. IoT devices are being changed and integrated at different levels of agriculture to accomplish these two goals. Ensuring the distribution and production reliability of the nutrition solution is the main goal. Enhancing consumption control [44], which minimizes solution losses and keeps prices low, is the second goal. There will be a significant reduction in both the environmental and economic effects. In the realm of sustainable agriculture leveraging green IoT, farmers employ advanced digital control systems like Supervisory Control and Data Acquisition (SCADA) to fulfill the requirements of agricultural management and process control. The producer in sustainable IoT agriculture requires a computerized control system SCADA to meet the requirements for agricultural management. For every part of equipment in a greenhouse, we recommend the sensor and meter nodes incorporate IoTs in the following ways: • The dripper flow rates, anticipated pressures, and the regions to be watered are all taken into consideration by the IoT devices for the water pumping system. • Water meters that offer up-to-date information on water storage. • IoT devices are tailored for every filtering equipment that considers drippers and the physical properties of water [45]. • Fertilizer meters with real-time updates and injectors for fertilizers, such as NPK fertilizers. • IoT devices to adjust electrical conductivity and pH to the right value for nutrition solutions. Tiny solar panels with IoT sensors to regulate temperature and moisture levels. The latest paper dealing with IoT considers unsupervised and supervised algorithm which deals with only prediction in crop [46, 47]. This deals with the hardware of ardino uno if the fluctuations are there then the problem in node MCU will occur.
Predictive Algorithms for Smart Agriculture
75
Table 4 Benefits of the proposed framework S.No
Proposed System
Existing System(s)
1
Comprises water analysis with the soil analysis
Doesn’t include the water analysis
2
Will yield precisely targeted crops based on thorough analysis
Analyze the various parameters and suggest the environment
3
Association of IoT and machine learning
Includes IoT
4
Includes the integrated circuit
Doesn’t contain the appropriately integrated circuits
Nowadays new technology along with websites are being used for the e-trading of the crops and agricultural implements[48, 49] which tells how there can be hassle free selling and buying will take place with a limitation of the network issues in the rural places.
5 Results and Discussions 5.1 Benefits of the Suggested System over Alternatives See (Table 4)
5.2 Proposed System’s Circuit Designed The task of monitoring soil temperature to regulate the optimal soil temperature for suitable crops falls on the DHT 11 sensor (Fig. 5). The soil moisture sensor offers the facility to compute the moisture content of the soil to regulate the quantity of water present in the soil as well as the water required for the crops. Nitrogen, Phosphorus, and Potassium are the nutrients that crops need the most. They are usually considered the most significant nutrients as a result. As a result, we use the NPK sensor for N, P, and K analysis. After that, we compute the NPK value data required for specific crops, allowing us to estimate the crops that are suitable for the soil. Because we could modify NPK based on the required crops, this would help streamline the agricultural process. Since the results of all these studies must be presented to the user on a screen, we use an organic light-emitting diode (OLED) screen to display the soil content analysis. The ESP module is used to regulate Wi-Fi connectivity for network connectivity [46]. It will be applied to regulate data flow throughout the server.
76
R. Sharma et al.
Fig. 5 The suggested circuit in addition to the sensors
5.3 Machine Learning Implementation Different supervised predictive machine learning algorithms were implemented for crop prediction and analysis (Fig. 6) so that we achieve more precise predictions of the crop which can be grown on the farms and along with the fertilizer proposed to use for maximum yield of the suggested crop.
Fig. 6 Comparison of KNN, gradient boosting, random forest
Predictive Algorithms for Smart Agriculture
77
The heat map (Fig. 7) and the plot graph (Fig. 8) reveals that the KNN algorithm is the best for the prediction of appropriate crop depending on the type of the soil and the suggestion of the fertilizer for a specific crop grown.
Fig. 7 Different parameters for a crop grown Fig. 8 Predicts the N, P, K values
78
R. Sharma et al.
6 Conclusion and Future Scope This approach is appropriate for crop production since it will give the soil the right crop depending on several variables, such as soil moisture content, NPK value, ideal irrigation, and real-time in-field crop monitoring. Smart farming based on the machine learning predictive algorithm and Internet of Things for the appropriate prediction in the system establishment that can track the agricultural sector and automate irrigation using sensors (light, humidity, temperature, soil moisture, etc.). Farmers can monitor their farms remotely through their mobile phones which is more productive and appropriate. Internet of Things (IoT) and ML-based smart farming programs have the potential to offer innovative solutions for not only conventional and large-scale farming operations but also other emerging or established agricultural trends, like organic farming, family farming, and enhancement of highly forthcoming farming.
References 1. Ali, I., Greifeneder, F., Stamenkovic, J., Neumann, M., Notarnicola, C.: Review of machine learning approaches for biomass and soil moisture retrievals from remote sensing data. Remote Sens. 7, 15841 (2015) 2. Vieira, S., Lopez Pinaya, W.H., Mechelli, A. : Introduction to Machine Learning, Mechelli, A., Vieira, S.B.T.-M.L. (eds.), Chapter 1, pp. 1–20. Academic Press, Cambridge, MA, USA, (2020). ISBN 978–0–12–815739–8. 3. Domingos, P.: A few useful things to know about machine learning. Commun. ACM. ACM 55, 78–87 (2012) 4. Lopez-Arevalo, I., Aldana-Bobadilla, E., Molina-Villegas, A., Galeana-Zapién, H., MuñizSanchez, V., Gausin-Valle, S.: A memory efficient encoding method for processing mixed-type data on machine learning. Entropy 22, 1391 (2020) 5. Yvoz, S., Petit, S., Biju-Duval, L., Cordeau, S.: A framework to type crop management strategies within a production situation to improve the comprehension of weed communities. Eur. J. Agron.Agron. 115, 126009 (2020) 6. Van Klompenburg, T., Kassahun, A., Catal, C.: Crop yield prediction using machine learning: A systematic literature review. Comput. Electron. Agric.. Electron. Agric. 177, 105709 (2020) 7. Khaki, S., Wang, L.: Crop yield prediction using deep neural networks. Front. Plant Sci. 10, 621 (2019) 8. Harvey, C.A., Rakotobe, Z.L., Rao, N.S., Dave, R., Razafimahatratra, H., Rabarijohn, R.H., Rajaofara, H., MacKinnon, J.L. Extreme vulnerability of smallholder farmers to agricultural risks and climate change in Madagascar. Philos. Trans. R. Soc. B Biol. Sci. 369 (2014) 9. Jim Isleib signs and symptoms of plant disease: Is it fungal, viral or bacterial? Available online: https://www.canr.msu.edu/news/signs_and_symptoms_of_plant_disease_is_it_f ungal_viral_or_bacterial. Accessed 19 Mar 2021 10. Zhang, J., Rao, Y., Man, C., Jiang, Z., Li, S.: Identification of cucumber leaf diseases using deep learning and small sample size for agricultural Internet of Things. Int. J. Distrib. Sens. Netw.Distrib. Sens. Netw. 17, 1–13 (2021) 11. Anagnostis, A., Tagarakis, A.C., Asiminari, G., Papageorgiou, E., Kateris, D., Moshou, D., Bochtis, D.: A deep learning approach for anthracnose infected trees classification in walnut orchards. Comput. Electron. Agric.. Electron. Agric. 182, 105998 (2021)
Predictive Algorithms for Smart Agriculture
79
12. Gao, J., Liao, W., Nuyttens, D., Lootens, P., Vangeyte, J., Pižurica, A., He, Y., Pieters, J.G.: Fusion of pixel and object-based features for weed mapping using unmanned aerial vehicle imagery. Int. J. Appl. Earth Obs. Geoinf.Geoinf. 67, 43–53 (2018) 13. Islam, N., Rashid, M.M., Wibowo, S., Xu, C.-Y., Morshed, A., Wasimi, S.A., Moore, S., Rahman, S.M.: Early weed detection using image processing and machine learning techniques in an Australian chilli farm. Agriculture 11, 387 (2021) 14. Slaughter, D.C., Giles, D.K., Downey, D.: Autonomous robotic weed control systems: A review. Comput. Electron. Agric.. Electron. Agric. 61, 63–78 (2008) 15. Zhang, L., Li, R., Li, Z., Meng, Y., Liang, J., Fu, L., Jin, X., Li, S.: A quadratic traversal algorithm of shortest weeding path planning for agricultural mobile robots in cornfield. J. Robot. 2021, 6633139 (2021) 16. Bonnet, P., Joly, A., Goëau, H., Champ, J., Vignau, C., Molino, J.-F., Barthélémy, D., Boujemaa, N.: Plant identification: Man vs.machine. Multimed. Tools Appl. 75, 1647–1665 (2016) 17. Seeland, M., Rzanny, M., Alaqraa, N., Wäldchen, J., Mäder, P.: Plant species classification using flower images—A comparative study of local feature representations. PLoS ONE 12, e0170629 (2017) 18. Zhang, S., Huang, W., Huang, Y., Zhang, C.: Plant species recognition methods using leaf image: Overview. Neurocomputing 408, 246–272 (2020) 19. Papageorgiou, E.I., Aggelopoulou, K., Gemtos, T.A., Nanos, G.D.: Development and evaluation of a fuzzy inference system and a neuro-fuzzy inference system for grading apple quality. Appl. Artif. Intell.Artif. Intell. 32, 253–280 (2018) 20. Genze, N., Bharti, R., Grieb, M., Schultheiss, S.J., Grimm, D.G.: Accurate machine learningbased germination detection, prediction and quality assessment of three grain crops. Plant Methods 16, 157 (2020) 21. El Bilali, A., Taleb, A., Brouziyne, Y.: Groundwater quality forecasting using machine learning algorithms for irrigation purposes. Agric. Water Manag.Manag. 245, 106625 (2021) 22. Neupane, J., Guo, W.: Agronomic basis and strategies for precision water management: a review. Agronomy 9, 87 (2019) 23. Hochmuth, G.: Drip Irrigation in a Guide to the Manufacture, Performance, and Potential of Plastics in Agriculture, M. D. Orzolek, pp. 1–197, Elsevier, Amsterdam, The Netherlands (2017) 24. Janani, M., Jebakumar, R.: A study on smart irrigation using machine learning. Cell Cellular Life Sci. J. 4(2), 1–8 (2019) 25. Torres-Sanchez, R., Navarro-Hellin, H., Guillamon-Frutos, A., San-Segundo, R., RuizAbellón, M.C., Domingo-Miguel, R.: A decision support system for irrigation management: Analysis and implementation of different learning techniques. Water 12(2), 548 (2020) 26. Goldstein, A., Fink, L., Meitin, A., Bohadana, S., Lutenberg, O., Ravid, G.: Applying machine learning on sensor data for irrigation recommendations: Revealing the agronomist’s tacit knowledge. Precis. Agric. 19, 421–444 (2018) 27. Sagan, V., Peterson, K.T., Maimaitijiang, M., Sidike, P., Sloan, J., Greeling, B.A., Maalouf, S., Adams, C.: Monitoring inland water quality using remote sensing: Potential and limitations of spectral indices, bio-optical simulations, machine learning, and cloud computing. Earth Sci. Rev. 205, 103187 (2020) 28. Sharma, A., Jain, A., Gupta, P., Chowdary, V.: Machine learning applications for precision agriculture: A comprehensive review. IEEE Access, 9, 4843–4873 (2021). 29. Chasek, P., Safriel, U., Shikongo, S., Fuhrman, V.F.: Operationalizing Zero Net Land Degradation: The next stage in international efforts to combat desertification. J. Arid Environ. 112, 5–13 (2015) 30. Adamchuk, V.I., Hummel, J.W., Morgan, M.T., Upadhyaya, S.K.: On-the-go soil sensors for precision agriculture. Comput. Electron. Agricult. 44(1), 71–91 (2004) 31. Gaitán, C.F.: Machine learning applications for agricultural impacts under extreme events. In: Climate Extremes and their Implications for Impact and Risk Assessment, pp. 119–138. Elsevier, Amsterdam, The Netherlands (2020).
80
R. Sharma et al.
32. Mohammadi, K., Shamshirband, S., Motamedi, S., Petkovi¢, D., Hashim, R., Gocic, M.: Extreme learning machine based prediction of daily dew point temperature. Comput. Electron. Agricult. 117, 214–225 (2015). 33. Diez-Sierra, J., Jesus, M.D.: Long-term rainfall prediction using atmospheric synoptic patterns in semi-arid climates with statistical and machine learning methods. J. Hydrol. 586, 124789 (2020). 34. Berckmans, D.: General introduction to precision livestock farming. Anim. Front. 7(1), 6–11 (2017) 35. Salina, A.B., Hassan, L., Saharee, A.A., Jajere, S.M., Stevenson, M.A., Ghazali, K.: Assessment of knowledge, attitude, and practice on livestock traceability among cattle farmers and cattle traders in peninsular Malaysia and its impact on disease control. Trop. Anim. Health Prod. 53, 15 (2020) 36. Riaboff, L., Poggi, S., Madouasse, A., Couvreur, S., Aubin, S., Bédère, N., Goumand, E., Chauvin, A., Plantier, G.: Development of a methodological framework for a robust prediction of the main behaviours of dairy cows using a combination of machine learning algorithms on accelerometer data. Comput. Electron. Agric.. Electron. Agric. 169, 105179 (2020) 37. Mansbridge, N., Mitsch, J., Bollard, N., Ellis, K., Miguel-Pacheco, G., Dottorini, T., Kaler, J.: Feature selection and comparison of machine learning algorithms in classification of grazing and rumination behaviour in sheep. Sensors 18, 3532 (2018) 38. Berckmans, D., Guarino, M.: From the Editors: Precision livestock farming for the global livestock sector. Anim. Front. 7(1), 4–5 (2017) 39. Stewart, J., Stewart, R., Kennedy, S.: Internet of things—Propagation modeling for precision agriculture applications. In: 2017 Wireless Telecommunications Symposium (WTS), pp. 1–8. IEEE (2017) 40. Venkatesan, R., Tamilvanan, A.: A sustainable agricultural system using IoT. In: International Conference on Communication and Signal Processing (ICCSP) (2017) 41. Lavric, A. Petrariu, A.I., Popa, V.: Long range SigFox communication protocol scalability analysis under large-scale, high-density conditions: IEEE Access 7, 35816–35825 (2019) 42. IoT for All: IoT Applications in Agriculture, https://www.iotforall.com/iot-applications-in-agr iculture/ (2018, January) 43. Mohanraj, R., Rajkumar, M.: IoT-Based smart agriculture monitoring system using raspberry Pi. Int. J. Pure Appli. Math 119(12), 1745–1756 (2018) 44. Moussa, F.: IoT-Based smart irrigation system for agriculture. J. Sens. Actuator Net. 8(4), 1–15 (2019) 45. Panchal, H., Mane, P.: IoT-Based monitoring system for smart agriculture. Int. J. Adv. Res. Comput. Sci.Comput. Sci. 11(2), 107–111 (2020) 46. Mane, P.: IoT-Based smart agriculture: applications and challenges. Int. J. Adv. Res. Comput. Sci.Comput. Sci. 11(1), 1–6 (2020) 47. Singh, P., Singh, M.K., Singh, N., Chakraverti, A.: IoT and AI-based intelligent agriculture framework for crop prediction. Int. J. Sens. Wireless Commun. Control 13(3), 145–154 (2023) 48. Sharma, D.R. Mishra, V., Srivastava, S. Enhancing crop yields through iot-enabled precision agriculture. In: 2023 International Conference on Disruptive Technologies (ICDT), pp. 279– 283. Greater Noida, India (2023). https://doi.org/10.1109/ICDT57929.2023.10151422 49. Gomathy, C.K., Geetha, V.: Several merchants using electronic-podium for cultivation. J. Pharmaceutical Neg. Res., 7217–7229 (2023)
Stream Data Model and Architecture Shahina Anjum, Sunil Kumar Yadav, and Seema Yadav
Abstract In recent era, Big Data Streams have significant impact owing the reality that there are many applications from where a big amount of data is continuously generated at a bang-up velocity. Because of integral dynamical features of big data, it is hard to apply existing working models directly on big data streams. The solution of this limitation is data streaming. A modern-day data streaming architecture allows taking up, operating and analyzing high mass of high-speed data from a collection of sources in real time to build more reactive and intelligent customer experiences. It can be designed as a batch of five logical layers; Source, Stream Storage, Stream Ingestion, Stream Processing and Destination. This chapter comprises of a brief assessment on the stream analysis of big data which engaged a thorough and organized way to looking at the inclination of technologies and tools used in the field of big data streaming along with their comparisons. We will provide study to cover issues like scalability, privacy and load balancing and their existing solutions. DGIM Algorithm which is used to count the number of ones in a window and FCM Clustering Algorithm and others are also in consideration to review in this chapter.
S. Anjum (B) · S. K. Yadav Department of CSE, IEC College of Engineering & Technology, Greater Noida, Uttar Pradesh, India e-mail: [email protected]; [email protected] S. K. Yadav e-mail: [email protected]; [email protected] S. Yadav Department of MBA, Accurate Institute of Management and Technology, Greater Noida, Uttar Pradesh, India e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145, https://doi.org/10.1007/978-981-97-0448-4_5
81
82
S. Anjum et al.
1 Introduction In recent era, Big Data Streams have significant impact owing the reality that there are many applications from where a big amount of data is continuously generated at a bang-up velocity. For various existing data mining methods, it is hard to directly apply techniques and tools on big data streams, this is because of the integral dynamical features of big data. The solution of this constraint is data streaming, also known as stream processing or event streaming. Before understanding the streaming of data architecture, it is very much necessary to know what streaming of data means actually.
1.1 Data Streaming It is not very specialized kind of thing; instead it is only the general word to state about the data which is created at very high speed and in the enormous volumes, and in the continuous manner. In realistic life, there are many examples of data streaming around us such as existence of use cases in each industry, real-time retail inventory management, social media feeds, multiplayer games, ride sharing apps, etc. After observing these examples it is observed that a stream data source categorizes the events in real time. The data stream may have semi-structured or unstructured form, usually key-value pairs in JSON or Extensible Markup Language (XML).
1.2 Batch Processing vs Stream Processing Batch process refers to processing of high amount of data in batch within definite time duration. It processes on whole data all at once. When data is composed eventually and alike data batched together then in such case batch processing has been used. But debugging of batch processing is difficult as it requires committed expert to fix the error. It is highly expensive. Stream Processing refers to immediate processing of stream of data which is produced continuously. It does this analysis in real time. It is used on unknown and infinite sized data and also in continuous manner. It is fast. It is challenging to cope with high speed and huge amount of data. Figures 1a and b represent the way of processing in batch and data stream system.
Stream Data Model and Architecture
83
Fig. 1 a. Batch processing. b. Data stream processing
1.3 Steps in Data Streaming Data streaming is generally used in reference to perform the sending, receiving and then processing of information in a stream of data, in place of the batches of discrete data. To perform it, six steps are to be involved, which are shown in Fig. 2 and described as follows [1]:Data Stream Management Step 1: Data Production In this first step, the processing is done to generate the data which is sent form the different sources like IOT devices, social media platforms or web applications. The data format may be differ such as JSON or CSV and data may characterized in different ways such as unstructured or structured data and static or dynamic, etc. • Structured data have some specific format and length. It is easy to store and analyze in high level organizations. Basic structure contains Relational database table. Its scalability is difficult and it is robust in nature. • Semi-structured data is irregular in nature that might be incomplete and does have a rapidly changing structure or unpredictable but does not be conventional to a explicit or fixed schema. Basic structure contains XML or RDF—Resource Description Framework. Its scalability is easier than structured data. • Unstructured data does not have any particular structure. Basic structure contains binary and character data. It is more scalable. • Static data is a fixed data that remains same after it is collected. • Dynamic data changes continuously after it is recorded; main objective behind this is to maintain its integrity. There are many methods and protocols such as HTTP protocol or MQTT push and pull method, etc. which is used by the data producer to send the data to the consumers.
84
S. Anjum et al.
Fig. 2 Data stream processing
Step 2: Data Ingestion In this second step, which is known as data ingestion, the data which is received by the consumers is stored like streaming platforms or message brokers. The different types of technologies and streaming data architecture can be used by the data consumers to handle the variety, velocity and volume of the data. Some of them are Estuary Flow or Kafka streaming pipelines, etc. Some other basic operations are also performed by the consumers on the data like enrichment of data or validation of data before transferring it to a stream processor. Step 3: Data Processing In the third step i.e. data processing the analysis is performed on the data and performed on by data processing tools. In this phase the various complex operations such as aggregation, filtering, machine learning, transformation, etc. are performed and for this various frameworks and tools are taken in the use. In general, the processing and platform for data ingestion of data streaming is tightly cohesive. Sometimes, processing may possibly be part of the platform itself, like in the Estuary’s transformations. Or, some other time, it may pair a subsequent technology with streaming framework; such as, Apache Spark and Kafka pairing. Step 4: Streaming Data Analytics In this fourth step, data is the further exploration and interpretation of data is done by the data analysts such as data scientists or business users. The data analysts can use several methods and techniques to discover the hidden trends and patterns in the data. Some techniques are descriptive, predictive and prescriptive data analytics, etc.
Stream Data Model and Architecture
85
Here, descriptive analytics notify us about what has already occurred; predictive analytics notify us about what could occur, and finally, prescriptive analytics notify us about what should occur in the future. The data analysts can also practice several platforms and tools to do data access and data query such as Power BI or Python notebooks, SQL, etc. The data analysts can also generate different outputs based on the analysis, some as alerts, charts, dashboards, maps, etc. Step 5: Data Reporting The summarization and reporting of analyzed data is important to make any sense. So, this analyzed data is summarized and transferred by reporting tools. There are many formats existing for reports generation. And also many channels are in existence and in use to present and share the data with their stakeholders. Some of them are such as emails slideshows, reports documents, webinars, etc. Data reporting can also use several indicators and metrics to measure and monitor the performance of their goals, objectives, KPIs, etc. Step 6: Data Visualization and Decision Making In the final step of data streaming, data is visualized and represented upon by the decision makers such as customers or managers. The several types and styles of data visualization can be used to explore and understand the data. Any type such as charts, graphs, maps, etc. can be used by the decision maker. Other features and tools such as drill-downs, filters, sorts, etc. can also be used to interact with the visualizations. After performing visualizations, decision makers can make timely decision on the basis of visualization insights. Some examples of decision making are enhancing customer experience, improving products, optimizing processes, etc. Now, after knowing the basic functionality of data streaming, let us take a look at the primary components of modern stream processing infrastructure.
1.4 Event Stream Processing One important methodology used in data processing is called as Event Stream Processing deals with events multiplicity and their online processing. Data comes from a variety of sources like business transactions in internal, logs of IoT devices, news articles, social media feeds, etc. [2]. Researchers considered sideways for best employing the platform inside special use cases. As of commercial point of examination, decision makers find for how is finest utilization of those events having least delay in order to find out insight in real time, for mining of textual events and to propose decisions. It needs a mix up of batch processing, machine learning and stream processing technologies which are usually optimized separately. Though, to combine all these technologies through constructing a real-world application with
86
S. Anjum et al.
good scalability is a big deal. Here [2], the latest event stream processing methodologies are clearly understandable by data flow architectures summarization, definition, frameworks and architecture and textual use cases. In addition, they have discussed about to unite event stream processing with sentiment analysis and events in textual form to improve a reference model result. In [3], we find that Complex event processing (CEP) finds the situation of importance by performing event streams queries evaluation. If CEP is used once for network-based applications, the division of query evaluation within the sources of event is able to give performance optimization. In place of arbitrary events collection at one place for query evaluation and subqueries are located at network nodes to decrease the transparency in data transmission. To conquer the limitations of existing models, INEv graphs are used. It introduced fine coarse-grained routing of fractional outcome of subqueries as an extra level of liberty in query evaluation: The basic structure of INEv used for In-Network Evaluation for Event Stream Processing is shown in Fig. 3. And various fields are shown in Fig. 4 which covers the area of Data Streams in Modern Systems to leverage its power Streaming Data Architecture. Fig. 3 In-network evaluation for event stream processing: INEv
Stream Data Model and Architecture
87
Fig. 4 Streaming data architecture: power area of data streams in modern system
1.5 Patterns in Streaming Data Architecture The basic design and service of streaming data architectures generally be subject to your objectives and requirements. On this basis, there are two general recent streaming data architecture patterns named as Lambda and Kappa. • Lambda: It is a hybrid architecture which mixes traditional batch processing with real-time processing to ease in dealing with two types of data i.e. historical data and real-time data streams. This combination gives the capability to handle huge amount of data along with still providing sufficient speed to handle motion data. Though this complexity is derived at a cost in points of latency and maintenance necessities, it is joined with these in an extra serving layer for the greatest accuracy, fault tolerance, and scalability. • Kappa: In contrast to Lambda architecture, Kappa’s data architecture concentrates only on real-time processing, either it is historical data or real-time data. In absence of the batch processing system, Kappa’s architecture is less costly, less complex and more consistent. The data after processing is saved in a storage system which can be queried on both i.e. in batches and in streams. This technique requires great performance, idempotency and dependability.
1.6 Benefits of Streaming Data Architecture Nowadays many organizations have been using streaming data analytics. It is not only because of rising demand for processing real-time data, but it is also because of many benefits which organizations can gain by using streaming data architecture. Some are listed below:
88
• • • •
S. Anjum et al.
Ease in Scalability Pattern Detection Modern Real-time Data Solutions Enabling Improved Customer Experience
2 Literature Review The summary of real-time data stream processing for industrial fault detection is well explained [4]. The main focus is on data stream analysis for industrial applications and to search industrial needs and then requirements of designing the potential Data Stream Management. The recognition of industrial needs and challenges helps us to find improvement in this area. A Data Stream Management System based monitoring system was projected to implement given suggestions. The monitoring system which is projected here takes the profit by applying the combination of various methods of fault detection, such as analytical methods, data-driven and knowledge-based methods. One more data processing method which engage in online processing for various events is called as Event stream processing (ESP). Researchers considered sideways for best employing the platform inside special use cases. As of commercial point of examination, decision makers find for how is finest utilization of those events having least delay in order to find out insight in real time, for mining of textual events and to propose decisions. It needs a mix up of batch processing, machine learning and stream processing technologies which are usually optimized separately. However, to combine all these technologies through constructing a real-world application with good scalability is a big deal. Here [2], the latest event stream processing methodologies are clearly understandable by data flow architectures summarization, definition, frameworks and architecture and textual processing with sentiment analysis and textual events to improve a reference model result. The general idea about big data analytics is based on real time; its present architecture, available methods of data stream processing and system architectures [5, 23]. The predictable approach to evaluate enormous data is unsuitable for real-time analysis; for that reason, analyzing of streaming in big data leftovers is a decisive matter for many utilization and applications. It is vital in big data analytics and real-time analytics to processing data at position from where they are incoming with speedy response and fine choice making, necessitating the expansion of a original model that works for high speed & low latency real-time processing. One important thoughtfulness is to secure of the real-time stream. Like other network security, stream security also can be built up by using pillars of Confidentiality, Integrity & Availability (CIA) model [6]. However, the majority realistic implementations just focus on first two aspects i.e. Confidentiality & Integrity by means of various important techniques like encryption and signatures. An access
Stream Data Model and Architecture
89
control mechanism is introduced to implement on the stream which adds extra security metadata to the streams. The use of this metadata can allow or disallow admittance to stream elements and also give protection to the isolation of data. All work is explained by taking an example of Apache Storm streaming engine. The analysis of present big data software models for a variety of discourse of domain and offers the outcome to support the researchers for future research. It has recognized recurring general motivations for taking big data software architectures, for example; to improve efficiency, to improve data processing in real time, reduction in development costs, supporting analytics process, and enabling novel services, together with shared work [7]. It has been studied that the business restrictions contrast for every application area, thus to target a software application of big data of particular application area requires couture of the common reference models to area-specific reference model to enhance. It will evaluate big data and its software architectures of distinct use cases from different application domains besides their consequences and talk about recognized challenges and probable enrichment. A phrase big data is used for composite data which is also hard to process. It contains numerous features called 6 Vs which popularly means—value, variability, variety, velocity, veracity and volume. Several applications can produce massive data & also grow quickly in short time. This speedy data is supposed to be handled with various approaches that exists in field of big data solutions. Some technologies in open source like Apache Kafka and NoSQL database were proposed to generate stream architecture for big data velocity [8, 10]. It has been evaluated that there has been enlarged interest in analyzing big data stream processing (means in motion— big data) rather than toward big data batch processing (i.e. big data at rest). It has been identified that some issues such as consistency, fault tolerance, scalability, integration, heterogeneity, timeliness, load balancing, heavy throughput and privacy need more research attention. After doing much work on these issues, mainly load balancing, privacy and scalability remain to focus. The layer of data integration allows geospatial subscription, using the GeoMQTT protocol. This is able to work for target-specific data integration at the same time as to preserve potentiality of congregation data from IoT devices because of the reason of efficiency in resource utilization in GeoMQTT. They have utilized the latest methods for stream processing and this framework is known as Apache Storm. It works as the center tool for their model and Apache Kafka as a tool for GeoMQTT broker and Apache Storm message processing system. Their planned design could be used to execute applications for many use cases where to deploy and to evaluate the distributed stream processing methods and algorithms that function on spatiotemporal data streams from the origin of IoT devices [11]. Introduction to a 7-layered architecture and its comparison with a 1-layered based architecture became important as till this point, no general architecture of data streaming analysis is scalable and flexible [12]. Data extraction, data transformation, data filtering and data aggregation are performed during the first six layers of the architecture. In the seventh and last layer, it carries analytic models. This 7layered architecture consists microservices and publishes subscribe software. After doing several studies, it is seen that this is the setup which can ensure solution with
90
S. Anjum et al.
low coupling and high cohesion, which leads in to increasing scalability and maintainability. Also asynchronous communication exists between the layers. Practical experience in the field of financial and e-commerce applications shows that this 7-layered architecture would be helpful to a huge figure of business use cases. A new data stream model named as Aurora basically manages data streams for application monitoring [13]. It differs from traditional business data processing. The detail of software is that it is required to process and respond to frequent inputs coming from huge and various sources like sensors slightly different from operators played by human, this fact needs from the individual to reorganize the elementary architecture of a DBMS regarding this area of application. So, they present Aurora, a new DBMS and provide its basic overview architecture and then explain specifically a set of operators handled by stream orientation. Table 1 represents the important findings of various studies in the field of Data Streaming.
3 Data Stream Management System Data Stream Management System (DSMS) is a software application just like Database Management System (DBMS). DSMS involves processing and management of an endlessly flowing data stream rather than working on static data like excel, pdf etc. It deals with data streams from different sources like financial report, sensors data, social media field, etc. Similar to DBMS, DSMS also provides the broad range of operations such as analyzing, integration, processing, and storage and also generates the visualization and report use for data streams.
3.1 Peeping in Data Stream Management System From the traversing of [14], Fig. 5, simply shows all the components of stream data model; Now, give simple explanation of Fig. 5 • In DSMS, there is a Stream Processor which is a kind of data-management system which is organized in the high-level manner. In a system, we can enter any number of streams. These streams may not be uniform in incoming rate. Streams might be archived in a big archival store, but we suppose that it is not feasible to answer the queries from this archival store. So, a working store is also there, where parts of streams or summaries may be placed, and this working store is used for answering the queries. It might be disk, sometimes main memory, basically it depends on the speed which is needed to process queries. But moreover, it is of adequately restricted capability that it can not store all of the data from all of the streams.
Stream Data Model and Architecture
91
Table 1 Important findings of various studies S. No
Reference
Contribution
1
[4]
A Data Stream Management System based monitoring system was projected to improve the industrial needs, its application and resulting requirements for data stream analysis
2
[2]
To improve Event stream processing (ESP), different event stream processing methodologies are discussed with the help of data flow architectures summarization and definition, architectures and frameworks and other textual use cases
3
[5, 23]
The general idea about big data analytics based on real time; its present architecture, available methods of data stream processing and system architectures [5, 23]. The predictable approach to evaluate enormous data is unsuitable for real time analysis; for that reason, analyzing data stream in the field of big data leftovers a decisive matter for many applications and its utilization
4
[6]
Introduction to an access control mechanism and its implementation on the stream which adds extra security metadata to the streams. The use of this metadata can allow or disallow admittance to stream elements also give protection to the isolation of data. All this is explained using Apache Storm streaming engine
5
[7]
It has been studied that the business restrictions contrast for every application area, thus to target a software application of big data of particular application area requires couture of the common reference models to area specific reference model to enhance
6
[8, 10]
Some technologies in open source like Apache Kafka & NoSQL database was proposed to generate stream architecture for big data specially to handle velocity [8, 10]. It has been evaluated that there has been enlarged interest in analyzing big data stream processing means in motion—big data in) rather than towards big data batch processing (means at rest—big data)
7
[11]
The layer of data integration allows geospatial subscription, using the GeoMQTT protocol. This is able to work for target specific data integration at the same time as to preserve potentiality of congregation data received from IoT devices because of the reason of resource utilization efficiency in GeoMQTT
8
[12]
Introduction to a 7-layered architecture its comparison with a 1-layered based architecture was became important as till this point, no general architecture of data streaming analysis is scalable and flexible. This 7-layered architecture consists microservices and publish subscribe software. This is the setup which can ensures solution with low coupling and high cohesion, which leads in to increasing scalability and maintainability. Also asynchronous communication is exists between the layers
9
[13]
In this paper, the brief description Aurora is given. It basically manages data streams for application monitoring and it differs from traditional business data processing. Aurora is a new DBMS designed and operated at Brandeis University, Brown University, and M.I.T
Some examples of Stream Sources are Sensor Data, Image Data, Internet and Web Traffic. • Stream Queries—One way to ask query about streams is that it is placed inside the processor where position queries are stored. These queries are sensible, permanently in execution, and output is produced at suitable times. The other way of query is ad-hoc. To ask a variety of ad-hoc queries in a large range, a general approach is used where a sliding window of each stream is stored in a working store.
92
S. Anjum et al.
Fig. 5 A data stream management system
3.2 Filtering Streams Filtering or selection is a common process on streams. Bloom filtering is used to handle the large data set.
3.2.1
Introduction to Bloom Filters
• A space-efficient and probabilistic data structure named as bloom filter is used to test the membership (i.e. presence) of an element in a set; whether it belongs to it or not. • For example: To check the availability of username from the set of a list of all registered username is a set membership problem. • The probabilistic nature of a bloom filter is a problem which means that there is a chance of resulting into some false positive results. When it “might” tell us if there is a case when an actual username is not taken, but it tells that this given username is already taken, basically it is known as false positive. 3.2.2
Working of Bloom Filter
First we have to take a bit array of m bits as an empty bloom filter and set all these bits to zero, like this—
Stream Data Model and Architecture
93
• For a given input, for the hashes calculation, we need “K” number of hash function. • Indices are calculated using hash functions. So, when we want to do addition of an item in the filter, the bits at K indices; f1 (y), f2 (y), …, fK (y) are set. Example 5.1: Let us take that we want to enter the word “throw” in the filter and we are having three hash functions to use. At initial, an array is produced consisting of bits and this bit array is of length 10. Let us take this array to do work and its all bits are set to 0 at initial. First we will calculate the hashes as per following function: f1 (“throw”) % 10 = = 1 f2 (“throw”) % 10 = = 4 f3 (“throw”) % 10 = = 7 Here, one thing should be noted that these outputs are taken randomly for explanation purpose only. Now, we will set 1 on the bits at indices 1, 4 & 7.
Now again, if we want to enter the word “catch”, we will calculate hashes in similar manner. f1 (“catch”) % 10 = = 3 f2 (“catch”) % 10 = = 5 f3 (“catch”) % 10 = = 4 Set 1 on the bits at indices 3, 5 & 4.
• Again, to check presence of word “throw” in filter or not. We will reverse the order of the same process. Calculating respective hashes using f1 , f2 & f3 and check that if in bit array, all indices are set to 1. • If, this is the case that all bits are set to 1, then we can say that “throw” is “probably present”. • Else, if any of the bit at these given indices are 0, then “throw” is “definitely not present”. 3.2.3
False Positive in Bloom Filters
The one question arises here is that why we said “probably present”, and what is the reason behind this uncertainty to come. Let us take an example.
94
S. Anjum et al.
Example 5.2: Let us take that, we want to check the presence of word “bell”, whether it is present or not. We will calculate hashes using f1 , f2 & f3. f1 (“bell”) % 10 = = 1 f2 (“bell”) % 10 = = 3 f3 (“bell”) % 10 = = 7 • Now, if we look at the bit array, it seems that bits at these resulting indices are set as 1 but we already know that this word “bell” was never added to the filter. The bit at indices of 1 & 7 were set when we added the word “throw” & bit 3 was added when we added the word “catch”. • By controlling the size of bloom filter, we can also control probability of getting false positive. • Probability of false positive is inversely proportional to the number of hash functions. If it decrease, then number of hash functions will increase. 3.2.4
Probability of False Positivity, Size of Bit Array and Optimum No. of Hash Functions
Probability of False Positivity is shown in Eq. (1). (
[
1 P = 1− 1− m
]K n )K (1)
where, m = bit array size n = no. of elements to be expected in filter. K = no. of hash functions Equation 2 represents Size of Bit Array m=
−nlog P (log2)2
(2)
Optimum no. of Hash Functions is shown in Eq. 3 K =
m log2 n
(3)
Stream Data Model and Architecture
3.2.5
95
Interesting Properties of Bloom Filters
• It is interesting to know that bloom filters never generate any false negative result which means that, if in case the username exists in actual, then it tells you that it does not exist. • It is not possible to delete elements from bloom filter because of the reason that if we are clearing the bits (generated by k hash functions) for the given indices to delete a single element, then it may cause deletion of few other elements also. • For example: If we delete the word “throw” (in above taken example) by clearing the bit at indices 1, 4 & 7, it may cause to be end up with deleting the word “catch” also. Because bit at index 4 becomes 0 & bloom filter claims that “catch” is not present.
3.3 Count Distinct Elements in a Stream One more processing is needed to Count Distinct Elements in a Stream, where stream elements are supposed to be choosing among some complete set. One would be approximating to identify how many unique elements have looked in the stream, either counting from the starting of the stream or from several known point in the past. Here we describe FCM; The Flajolet-Martin Algorithm to perform counting distinct or unique elements in a stream.
3.3.1
Introduction to the Flajolet-Martin (FM) Algorithm
• The Flajolet-Martin Algorithm is used for approximating the number or count of distinct or unique objects in a stream or in a database in one pass or with a single pass. • If the given stream contains n no. of elements having m no. of unique elements among them, then this algorithm requires O(n) time to run and necessitate the storage of O(log (m)) memory. • When the number of possible distinct or unique elements in the stream is maximal in its existence, then the space consumption is logarithmic. Example 5.3: Determine the distinct element in the stream using Flajolet-Martin Algorithm. • Input stream of integers y = = 1, 3, 2, 1, 2, 3, 4, 3, 1, 2, 3, 1 • Hash Function, f(y) = = 6y + 1 mod 5 Step 1: Calculating Hash function [f(y)] f(y) == 6y +1 mod 5 f(1) == 6(1) +1 mod 5
96
S. Anjum et al.
== 7 mod 5 f(1) == 2 Similarly, calculating hash function for the remaining input stream; f(1) = 2
f(1) = 2
f(4) = 0
f(2) = 3
f(3) = 4
f(2) = 3
f(3) = 4
f(3) = 4
f(2) = 3
f(3) = 4
f(1) = 2
f(1) = 2
Step 2: Write binary functions for hash functions calculated f(1) = 2 = 010
f(1) = 2 = 010
f(4) = 0 = 000
f(2) = 3 = 011
f(3) = 4 = 100
f(2) = 3 = 011
f(3) = 4 = 100
f(3) = 4 = 100
f(2) = 3 = 011
f(3) = 4 = 100
f(1) = 2 = 010
f(1) = 2 = 010
Step 3: Trailing zeros; Now, write the count of trailing zeros in each hash function bit f(1) = 2 = 010 = 1
f(1) = 2 = 010 = 1
f(4) = 0 = 000 = 0
f(3) = 4 = 100 = 2
f(3) = 4 = 100 = 2
f(3) = 4 = 100 = 2
f(3) = 4 = 100 = 2
f(3) = 4 = 100 = 2
f(3) = 4 = 100 = 2
f(3) = 4 = 100 = 2
f(1) = 2 = 010 = 1
f(1) = 2 = 010 = 1
Step 4: Write the value of Maximum number of trailing zeros • The value of r = = 2 • The distinct value R = = 2r = = 22 = = 4 • Hence, there are 4 distinct elements 1, 2, 3 and 4.
3.4 Counting Ones in a Window For counting ones (1s) in a Window, DGIM; The Datar-Gionis-Indyk-Motwani Algorithm is simplest to be used.
3.4.1
Introduction to the Datar-Gionis-Indyk-Motwani Algorithm
• The Datar-Gionis-Indyk-Motwani Algorithm is intended and designed to find the number of 1s in a given data set. • DGIM algorithm takes the use O (log2 N) bits to represent and perform the process on a window of N bit. • It only allows an error of no more than 50% to estimate the number of ones in the window.
Stream Data Model and Architecture
3.4.2
97
Components of DGIM Algorithm
• Timestamps and Buckets are two components of DGIM. • Each arriving bit has a timestamp for its arrival position. • If the timestamp of first bit is 1, the timestamp of second is 2 and so on. The position is predictable with the N size of window (this size is generally taken as a multiple of 2). • The buckets consisting of 1s and 0s are used to divide the windows. i.e. the window are divided into buckets of 0s and 1s. 3.4.3
Rules for Forming the Buckets
i. It is mandatory that the right side of the bucket should always start with 1. (Sometimes, if it starts with a 0, then it should be neglected.) For example: If 1001011 is a bucket of size 4, and it contains the four 1s and it starts with 1 on its right end. ii. It is a necessary condition that each bucket should contain at least one 1, otherwise no bucket can be formed. iii. All of the buckets should be in power of 2. iv. As we move to the left, the size of buckets cannot decrease in size (move in increasing order toward left).
3.5 Data Processing Frameworks In the vast field of big data, it is essential to handle big data stream effectively. In this section, taking big data stream processing as a basis, we will elaborate on some kind of data processing frameworks. All these are free and open source type software. [15]
3.5.1
Data Processing Frameworks
In this section, we will compare and explain different Stream Processor engines (s). Let us select six among them. These six SPE(s) are: i. ii. iii. iv. v. vi.
Apache—Spark Apache—Flink Apache—Storm Apache— Heron Apache—Samza Amazon— Kinesis.
Before describing all these, one important point is noted that, predecessor to all these, Apache Hadoop is included for historical reasons.
98
S. Anjum et al.
• Hadoop is known to be the very first framework which was appeared to work for huge datasets processing by using the MapReduce programming model. It has scalability in nature, because it can run on either a single cluster and on a single machine or extend and run on several clusters on multiple machines. Furthermore, Hadoop takes benefit of having distributed storage space to get better performance by work in the way that in place of the data, it transmits the code which is supposed to process the data. Also, Hadoop provides high accessibility and heavy throughput. However, during handling small files, it can have efficiency problems. • Apart from all, the main limitation of using Hadoop is so as to it doesn’t support processing in real-time stream. To handle this limitation, Apache Spark comes into light. Apache Spark is a framework to use in batch processing and to do streaming of data and it also allows distributed processing. Spark was intended to act in response to three big troubles of Hadoop: i. Stay away from iterative algorithms that can make a number of passes through the data. ii. Permit streaming in real-time and interactive queries. iii. In place of MapReduce, Apache Spark uses RDD which stands for Resilient Distributed Datasets, which are having fault tolerance and can be able to perform parallel processing. • After two years, Apache Flink and Apache Storm were invented. Flink can do batch processing and streaming of data. In Flink, we can process streams with precise sequential requirements. Storm and Flink are comparable frameworks, with some following features: i. ii. iii. iv.
Storm can only allow stream processing. Storm and Flink both can perform low latency stream processing. The API of Flink is of high level and has rich functionality. To provide fault tolerance Flink is using a snapshot algorithm in comparison to Storm which can record level acknowledgement(s). v. Limitation of Storm is its low scalability, complexity in debugging and managing Storm.
• So, now Apache Heron comes, after the Storm. • Apache Samza provides event-based applications, real-time processing, and ETL which means Extract, Transform and Load capabilities. It provides numerous APIs and has model like Hadoop, but in its place of using MapReduce, it has the Samza API, and it uses Kafka as an alternative of the Hadoop Distributed File System. • Amazon Kinesis is the single framework of this section which is not keen to go with Apache Software Foundation. Kinesis is in reality a set of four frameworks as a replacement for data stream framework. Kinesis can simply be integrated with Flink. All these are simply explained in the Table 2.
Apache Software Foundation
University of California
Apache Software Foundation
Apache—Hadoop
Apache—Spark
Apache—Flink
Apache—Storm
Apache—Heron
Apache—Samza
Amazon—Kinesis
1
2
3
4
5
6
7
Amazon
LinkedIn
Twitter
Backtype
Inventor
Framework
S. No
Table 2 Data processing frameworks [15]
N.A
2013
2017
2013
2014
2013
N.A.
Incubation year
Batch and stream
Batch and Stream
Stream
Stream
Batch and stream
Micro batch—Batch and stream
Batch
Processing
At least once
At least once
At most once, at least once, exactly once
At least once
Exactly once
Exactly once
N.A.
Delivery of events
Low
Low
Low
Low
Low
Low
High
Latency
High
High
High
High
High
High
High
Throughput
High
High
High
High
High
High
High
Scalability
High fault tolerance
Host affinity & incremental check pointing
High fault tolerance
Record level acknowledgements
Incremental check pointing (with the use of markers)
RDD—Resilient Distributed dataset
Replication in the HDFS
Fault tolerance
Stream Data Model and Architecture 99
100
S. Anjum et al.
3.6 Machine Learning for Streaming Data In [17] several challenges for streaming data in the field of machine learning are discussed. If these will overcome it will help in: i. Exploration of relationships among many AI developments (e.g., RNN, reinforcement learning, etc.) and adaptive stream mining algorithms; ii. Characterizing and detecting drifts in the case when immediate labeled data is absent. iii. Developing adaptive learning techniques which can work on verification latency; iv. Incorporating preprocessing techniques which can transform the raw data in continuous manner.
4 Challenges and Their Solutions in Streaming Data Processing This section briefly explains the challenges in streaming data processing and various solutions to overcome them for better outcomes.
4.1 Challenges in Streaming Data Processing After going through many cases, it has come to the light that there are many challenges to be faced during processing of streaming data. Some noticeable challenges are: i. Unbounded Memory Requirements for High Data Volume The main aim of processing streaming data is to manage the coming data which is produced in very high velocity and in huge volume, and also in continuous manner in real time. Sources of these data are very large. As there is no finite end defined for the data streams which are producing continuously, data processing infrastructure should also be treated with unbounded memory requirements. ii. Complex Architecture Complexity and Monitoring of Infrastructure Data stream processing systems are frequently distributed and essential to be able to handle a large number of parallel connections and data sources, which can be hard to accomplish and monitor for any issues that may arise, particularly at any scale. iii. Cope Up With the Streaming Data Dynamic Nature Because of dynamic nature of Streaming data, stream processing systems should have to be adaptive in nature to handle perception drift—which extracts some data processing methods inappropriate—and operate with restricted memory and time. iv. Data Streams Query Processing
Stream Data Model and Architecture
v.
vi. vii.
viii.
ix.
x.
101
It is very challenging to process query on data streams due to the unbounded data and its dynamic nature. Many subqueries are also required to complete the process. Consequently, it is essential for the stream processing algorithm to be memory-efficient and capable to process data rapidly enough to stronghold with the rate of new data items arrival. Streaming Data Processing Debugging and Testing Debugging of data streams in the system environment and to do testing of bundled data is much challenging. It needs comparison with many other streaming data processes to enhance the quality. Fault Tolerance It is also important to check the scalability of fault tolerance in DSMS. Data Integrity Some type of data validation is essential to preserve the integrity of the data being processed by a DSMS. It can be done by using various schemes such as hash function, digital signatures, encryption, etc. Managing Delays Due to various reasons like backpressure from downstream operators, network congestion or slow processors, delays can happen in data stream processing, there are some various ways to handle these delays, depending on the definite requirements of the application: Some delay handling methods are used of watermark and sliding window. Handling Backpressure Use of Buffer the flow, adaptive operator, data partitioning and data items dropping are some techniques to deal with backpressure which is a state that can happen in data stream processing when a speed of downstream operators to consume operator is slow than processing of operator. It leads to an increase in latency which may cause data loss if operator buffers start to fill up. Computational and Cost Efficiency As DSMS is essentially used to handle high volume data produced at high velocity from the various sources, it is challenging to control its computational and cost efficiency.
4.2 Ways to Overcome the Processing of Streaming Data Challenges Though data stream processing has key challenges, there are some ways to overcome these challenges, which include: i. Use the proper mixture of on-premises and cloud-based resources and services [22]. ii. To choose the right tools. iii. Setting of consistent infrastructure for monitoring data processing and integration, to improve efficiency with data skipping and operator pipelining.
102
S. Anjum et al.
iv. v. vi. vii.
Partition of data streams to increase overall throughput. Processing rate adjustment with an adaptive operator to be automatic. Backpressure avoidance by implementation of proper flow control. By adopting these techniques, we can overcome the challenges in processing streaming data and enhance its use in real-time data analytics.
5 Conclusion This chapter concludes that in today’s era, there is an unstoppable flow of data which may be in any form of unstructured, semi-structured or structured form and can be produced from any source like transactional data, social media feeds, IoT devices any other real-time applications. Again, with the existence of this kind of continuous producing data, it requires its processing, analyzing, reporting, etc. Past batch processing system has limitation to handle it because of its finite data handling nature. From here streaming data processing empowers to cope up with data streams. As per the importance in data stream model, a query processor should be powerful which can retrieve data at any scale and properly manage the storage system. DSMS not works in a single pass, it comprises of step to step processing which are steps of Data Production, Data Ingestion, Data Processing, Streaming Data Analytics, Data Reporting, Data Visualization and Decision Making. Importance and requirement of more enhanced Event stream processing models is seen. The popularity of Apache Hadoop and its limitations in some area resulted into the invention of Apache—Spark, Apache—Flink, Apache—Storm, Apache—Heron, Apache—Samza and Amazon—Kinesis. These are only few, hybrid forms can be many. After studying many researches, it is found that load balancing, privacy and scalability issues still need more efforts to work on. And also, significant research efforts should be given to preprocessing stage of big data streams. This chapter also put light on overcoming of challenges in streaming of data; with the proper mixing of approaches, data architecture and resources, one can easily take the advantages of real-time data analytics. Many researchers include the methodology to filter data stream, counting of distinct or unique elements in data stream, counting of one in data stream; for this bloom filtering, FCM and DGIM are in existence, but again in real-time data analysis many more features should extract which can be the focus of field researchers. This may help in enlarging the application area of streaming of data.
References 1. Eberendu, A.: Unstructured data: an overview of the data of Big Data. Int. J. Emerg. Trends Technol. Comput. Sci. 38(1), 46–50 (2016). https://doi.org/10.14445/22312803/IJCTT-V38 P109
Stream Data Model and Architecture
103
2. Bennawy, M., El-Kafrawy, P.: Contextual data stream processing overview, architecture, and frameworks survey. Egypt. J. Lang. Eng. 9(1) (2022). https://ejle.journals.ekb.eg/article_2 15974_5885cfe81bca06c7f5d3cd08bff6de38.pdf 3. Akili, S., Matthias, P., Weidlich, M.: INEv: in-network evaluation for event stream processing. Proc. ACM on Manag. Data. 1(1), 1–26 (2023). https://doi.org/10.1145/3588955 4. Alzghoul, A.: Monitoring big data streams using data stream management systems: industrial needs, challenges, and improvements. Adv. Oper. Res. 2023(2596069) (2023). https://doi.org/ 10.1155/2023/2596069 5. Hassan, A., Hassan, T.: Real-time big data analytics for data stream challenges: an overview. EJCOMPUTE. 2(4) (2022). https://doi.org/10.24018/compute.2022.2.4.62 6. Nambiar, S., Kalambur, S., Sitaram, D.: Modeling access control on streaming data in apache storm. (CoCoNet’19). Proc. Comput. Sci. 171, 2734–2739 (2020). https://doi.org/10.1016/j. procs.2020.04.297 7. Avci, C., Tekinerdogan, B., Athanasiadis, I.: Software architectures for big data: a systematic literature review. Big Data Anal. 5(5) (2020). https://doi.org/10.1186/s41044-020-00045-1 8. Hamami, F., Dahlan, I.: The implementation of stream architecture for handling big data velocity in social media. J. Phys. Conf. Ser. 1641(012021) (2020). https://doi.org/10.1088/ 1742-6596/1641/1/012021 9. Kenda, K., Kazic, B., Novak, E., Mladeni´c, D.: Streaming data fusion for the internet of things. Sensors 2019. 19(8), 1955 (2019). https://doi.org/10.3390/s19081955 10. Kolajo, T., Daramola, D., Adebiyi, A.: Big data stream analysis: a systematic literature review. J. Big Data. 6(47) (2019). https://doi.org/10.1186/s40537-019-0210-7 11. Laska, M., Herle, S., Klamma, R., Blankenbach, J.: A scalable architecture for real-time stream processing of spatiotemporal IoT stream data—performance analysis on the example of map matching. ISPRS Int. J. Geo-Inf. 7(7), 238 (2018). https://doi.org/10.3390/ijgi7070238 12. Hoque, S., Miranskyy, A.: Architecture for Analysis of Streaming Data, Conference: IEEE International Conference on Cloud Engineering (IC2E) (2018). https://doi.org/10.1109/IC2E. 2018.00053 13. Abadi, D., Etintemel, U.: Aurora: a new model and architecture for data stream management. VLDB J. 12(2), 12–139 (2003). https://doi.org/10.1007/s00778-003-0095-z 14. Jure Leskovec, J., Rajaraman, A., Ullman, J.: Mining of Massive Datasets. Cambridge University Press, England (2010) 15. Almeida, A., Brás, S., Sargento, S., Pinto, F.: Time series big data: a survey on data stream frameworks, analysis and algorithms. J Big Data 10(1), 83 (2023). https://doi.org/10.1186/s40 537-023-00760-1 16. Sujatha, C., Joseph, G.: A survey on streaming data analytics: research issues, algorithms, evaluation metrics, and platforms. In: Proceedings of International Conference on Big Data, Machine Learning and Applications, pp. 101–118 (2021). https://doi.org/10.1007/978-981-334788-5_9 17. Gomes, H., Bifet, A.: Machine learning for streaming data: state of the art, challenges, and opportunities. ACM SIGKDD Explor. Newsl. 21(2), 6–22 (2019). https://doi.org/10.1145/337 3464.3373470 18. Aguilar-Ruiz, J., Bifet, A., Gama, J.: Data stream analytics. Analytics 2(2), 346–349 (2023). https://doi.org/10.3390/analytics2020019 19. Rashid, M., Hamid, M., Parah, S.: Analysis of streaming data using big data and hybrid machine learning approach. In: Handbook of Multimedia Information Security: Techniques and Applications, pp. 629–643 (2019). https://doi.org/10.1007/978-3-030-15887-3_30 20. Samosir, J., Santiago, M., Haghighi, P.: An evaluation of data stream processing systems for data driven applications. Proc. Comput. Sci. 80, 439–449 (2016). https://doi.org/10.1016/j. procs.2016.05.322 21. Geisler, S.: Data stream management systems. In: Data Exchange, Integration, and Streams. Computer Science. Corpus ID: 12168848. 5, 275–304 (2013). https://doi.org/10.4230/DFU. Vol5.10452.275
104
S. Anjum et al.
22. Singh, P., Singh, N., Luxmi, P.R., Saxena, A.: Artificial intelligence for smart data storage in cloud-based IoT. In: Transforming Management with AI, Big-Data, and IoT, 1–15 (2022). https://doi.org/10.1007/978-3-030-86749-2_1 23. Abdullah, D., Mohammed, R.: Real-time big data analytics perspective on applications, frameworks and challenges. 7th International Conference on Contemporary Information Technology and Mathematics (ICCITM). IEEE. 21575180 (2021). https://doi.org/10.1109/ICCITM53167. 2021.9677849 24. Mohamed, N., Al-Jaroodi, J.: Real-time big data analytics: applications and challenges. International Conference on High Performance Computing & Simulation (HPCS). IEEE. 14614775 (2014). https://doi.org/10.1109/HPCSim.2014.6903700 25. Deshai, N., Sekhar, B.: A study on big data processing frameworks: spark and storm. In: Smart Intelligent Computing and Applications, 415–424 (2020). https://doi.org/10.1007/978-981-329690-9_43
Leveraging Data Analytics and a Deep Learning Framework for Advancements in Image Super-Resolution Techniques: From Classic Interpolation to Cutting-Edge Approaches Soumya Ranjan Mishra, Hitesh Mohapatra, and Sandeep Saxena
Abstract Image SR is a critical task in the field of computer vision, aiming to enhance the resolution and quality of low-resolution images. This chapter explores the remarkable achievements in image super-resolution techniques, spanning from traditional interpolation methods to state-of-the-art deep learning approaches. The chapter begins by providing an overview of the importance and applications of image super-resolution in various domains, including medical imaging, surveillance, and remote sensing. The chapter delves into the foundational concepts of classical interpolation techniques such as bicubic and bilinear interpolation, discussing their limitations and artifacts. It then progresses to explore more sophisticated interpolation methods, including Lanczos and spline-based approaches, which strive to achieve better results but still encounter challenges when upscaling images significantly. The focal point of this chapter revolves around deep learning-based methods for image SR. Convolutional Neural Networks (CNNs) have revolutionized the field, presenting unprecedented capabilities in producing high-quality super-resolved images. The chapter elaborates on popular CNN architectures for image superresolution, including SRCNN, VDSR, and EDSR, highlighting their strengths and drawbacks. Additionally, the utilization of Generative Adversarial Networks (GANs) for super-resolution tasks is discussed, as GANs have shown remarkable potential in generating realistic high-resolution images. Moreover, the chapter addresses various challenges in image super-resolution, such as managing artifacts, improving perceptual quality, and dealing with limited training data. Techniques to mitigate these challenges, such as residual learning, perceptual loss functions, and data augmentation, are analyzed. Overall, this chapter offers a comprehensive survey of the advancements in image SR, serving as a valuable resource for researchers, engineers, and practitioners in the fields of computer vision, image processing, and machine learning. It S. R. Mishra (B) · H. Mohapatra School of Computer Engineering, KIIT Deemed to Be University, Bhubaneswar, Odisha, India e-mail: [email protected] S. Saxena Greater Noida Institute of Technology, Greater Noida, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145, https://doi.org/10.1007/978-981-97-0448-4_6
105
106
S. R. Mishra et al.
highlights the continuous evolution of image SR techniques and their potential to reshape the future of high-resolution imaging in diverse domains.
1 Introduction Image SR, an essential task in the field of computer vision, plays a crucial role in enhancing the resolution and quality of low-resolution images. The ability to recover high-resolution details from low-resolution inputs has significant implications in various applications, including medical imaging, surveillance, remote sensing, and more [1]. As the demand for higher quality visual content continues to grow, the development of advanced image super-resolution techniques has become a vibrant research area. In this chapter, we delve into the remarkable advancements in image super-resolution, tracing its evolution from classical interpolation methods to cuttingedge deep learning approaches. Our exploration begins with an overview of the fundamental concepts and importance of image super-resolution in diverse domains [2].
1.1 Importance of Image SR Image super-resolution has garnered significant attention due to its potential to enhance visual content quality and improve various computer vision tasks. In the realm of medical imaging, high-resolution images play a pivotal role in accurate diagnosis, treatment planning, and disease monitoring. In surveillance applications, super-resolution aids in identifying critical details in low-resolution footage, such as facial features or license plate numbers. Moreover, in remote sensing, superresolution techniques enable clearer satellite imagery, facilitating better analysis of environmental changes and land use. The significance of image super-resolution is also evident in entertainment industries, where high-quality visual content is crucial for creating immersive experiences in video games and movies. Figure 1 shows the Importance of Image Super-resolution in a person’s face and a side-by-side comparison of a low-resolution image and a super-resolved image, an image of a historical document, and an image of a medical image. (medical image of a patient’s tumor [3]).
1.2 Classical Interpolation Techniques Early image super-resolution methods relied on classical interpolation techniques to upscale low-resolution images. Bicubic and bilinear interpolation were widely used for decades to increase image resolution. Although straightforward, these methods
Leveraging Data Analytics and a Deep Learning Framework …
107
Fig. 1 Importance of image super-resolution in different areas
often produced blurred and visually unappealing results, leading to the introduction of more sophisticated interpolation techniques such as Lanczos and splinebased methods. While these techniques improved image quality to some extent, they struggled to handle significant upscaling factors and suffered from artifacts [4].
1.3 Deep Learning for Image SR The advent of deep learning has revolutionized the field of image super-resolution. Convolutional Neural Networks (CNNs) emerged as a powerful tool for learning complex mappings between low-resolution and high-resolution images. The chapter discusses the pioneering CNN-based models, including the Super-Resolution Convolutional Neural Network (SRCNN), the Very Deep Super-Resolution Network (VDSR), and the Enhanced Deep Super-Resolution Network (EDSR). These architectures leverage deep layers to extract hierarchical features from images, leading to impressive results in terms of visual quality and computational efficiency [5].
1.4 Enhancing Quality with Generative Adversarial Networks As the pursuit of higher visual fidelity continues, Generative Adversarial Networks (GANs) entered the stage to further elevate image super-resolution. GANs combine the power of a generator and discriminator network to generate high-quality superresolved images. The chapter explores GAN-based approaches for image superresolution, such as SRGAN and ESRGAN, and examines the role of adversarial
108
S. R. Mishra et al.
loss in guiding the training process. GANs have proven to be particularly effective in generating photo-realistic details, demonstrating the potential to revolutionize high-resolution imaging [6].
1.5 Addressing Challenges in Image Super-Resolution Despite the impressive results achieved with deep learning-based approaches, image super-resolution still faces several challenges. One of the prominent challenges is managing artifacts that can arise during the super-resolution process [7]. Additionally, improving perceptual quality and ensuring that the enhanced images are visually appealing is critical. This chapter delves into the techniques used to overcome these challenges, including residual learning, perceptual loss functions, and data augmentation strategies to enrich the training data.
1.6 Real-World Applications and Practical Use Cases The chapter concludes by showcasing real-world applications and practical use cases of image super-resolution techniques. These applications span across various domains, such as medical imaging, surveillance, remote sensing, and entertainment. From enabling more accurate medical diagnoses to aiding in criminal investigations through clearer surveillance footage, image super-resolution demonstrates its vast potential in improving decision-making processes. Finally, we can conclude that image super-resolution has witnessed significant progress, transitioning from classical interpolation methods to deep learning-based approaches. The ability to reconstruct high-quality images from low-resolution inputs has paved the way for numerous applications in diverse fields. The advancements in deep learning, particularly the integration of GANs, have elevated image super-resolution to new heights, making photo-realistic high-resolution imaging a reality. While challenges persist, the ongoing research in this domain promises even more sophisticated techniques for achieving visually stunning and information-rich high-resolution images, further fueling the growth of computer vision applications in the future [8].
Leveraging Data Analytics and a Deep Learning Framework …
109
1.7 Role/Impact of Data Analytics and a Deep Learning Framework for Image SR Techniques Role of Data Analytics • Dataset Preparation: • Data analytics helps in the curation and preprocessing of large datasets, which are essential for training deep learning models for image SR. A well-prepared dataset ensures that the model learns diverse and representative features. • Feature Extraction: • Data analytics techniques can be used to find relevant information in the image data. Understanding the characteristics of low-resolution and high-resolution images helps in designing effective models. • Data Augmentation: • Techniques like data augmentation, supported by data analytics, can be applied to artificially increase the diversity of the training dataset. This aids in improving the generalization capability of deep learning models. • Performance Metrics: • Data analytics is instrumental in defining appropriate performance metrics to evaluate the effectiveness of SR models. Impact of Deep Learning Frameworks 1. Convolutional Neural Networks (CNNs): – Deep learning frameworks, particularly those supporting CNN architectures, have shown remarkable success in image super-resolution tasks. CNNs can automatically learn hierarchical features from LR images and generate HR counterparts. 2. Generative Adversarial Networks (GANs): – GANs, a type of deep learning framework, have been applied to image super-resolution, introducing a generative model and a discriminative model. This adversarial training helps in producing high-quality and realistic highresolution images. 3. Transfer Learning: – Deep learning frameworks enable the use of transfer learning, where pretrained models on large image datasets can be fine-tuned for specific superresolution tasks. This is particularly useful when the available dataset for super-resolution is limited.
110
S. R. Mishra et al.
4. End-to-End Learning: – Deep learning frameworks facilitate end-to-end learning, allowing the model to directly map LR images to HR outputs. This avoids the need for handcrafted feature engineering and enables the model to learn complex relationships. 5. Attention Mechanisms: – Attention mechanisms, integrated into deep learning architectures, enable models to focus on relevant parts of the image during the SR process. This improves the overall efficiency and performance of the model. 6. Large-Scale Parallelization: – Deep learning frameworks support parallel processing, enabling the training of large and complex models on powerful hardware, which is essential for achieving state-of-the-art results in image super-resolution.
2 Classical Interpolation Nearest Neighbor N-N interpolation is the simplest technique, where each pixel in the high-resolution image is assigned the value of the nearest pixel in the low-resolution image. This method is fast but often leads to blocky artifacts. Using nearest neighbor interpolation, we aim to upscale it to an “8x8” image. The pixels in the HR image will be assigned the values of their nearest neighbors from the low-resolution image [9]. The resulting 8x8 image after applying nearest neighbor interpolation is shown in Fig. 2.
Fig. 2 Upscale it to 8 × 8 image using Nearest Neighbor interpolation metrics
Leveraging Data Analytics and a Deep Learning Framework …
111
2.1 CNN-Based SR Techniques Super-Resolution Convolutional Neural Network SRCNN, introduced in 2014, is a three-layer deep convolutional neural network designed specifically for singleimage SR. It processes an LR image as input and generates a corresponding HR image directly. The network learns to upscale the image while minimizing the reconstruction error. We had trained an SRCNN model on our dataset of low- and highresolution images. For simplicity, we’ll use grayscale images and a smaller network configuration. We prepare a dataset of LR images (e.g., 32 × 32) and their corresponding high-resolution versions (e.g., 128 × 128). Figure 3 illustrates LR image to SR image in dpi [10]. The SRCNN model consists of three layers shown in Fig. 4. The first layer performs feature extraction using a small kernel size (e.g., 9 × 9), the second layer increases the number of features, and the third layer maps the feature maps to the highresolution image. We train the SRCNN model on the prepared dataset using Mean Squared Error (MSE) loss to minimize the difference between predicted and groundtruth high-resolution images. After training, we use the trained SRCNN model to upscale low-resolution test images and compare the results with the original highresolution images. The trained SRCNN model shows impressive results compared to classical interpolation techniques. The output images exhibit higher levels of detail, sharper edges, and improved visual quality. Since the introduction of SRCNN, various other CNN-based super-resolution techniques have been developed, such as EDSR (Enhanced Deep Super-Resolution), SRGAN (Super-Resolution Generative Adversarial Network), and RCAN (Residual Channel Attention Networks) [11]. These models have achieved state-of-the-art performance, pushed the boundaries of image super-resolution and produced more realistic and visually appealing results.
Fig. 3 LR image to SR image conversion
112
S. R. Mishra et al.
Fig. 4 SRCNN model architecture
2.2 Datasets In the field of image SR, researchers use various datasets to train and evaluate their model networks. In a review of various articles, 11 datasets were identified as commonly used for these purposes (Ref: Table 1). T91 Dataset: The T91 dataset contains 91 images. It comprises diverse content such as cars, flowers, fruits, and human faces. Algorithms like SRCNN, FSRCNN, VDSR, DRCN, DRDN, GLRL, DRDN, and FGLRL utilized T91 as their training dataset. Berkeley Segmentation Dataset 200 (BSDS200): Due to the limited number of images in T91, researchers supplemented their training by including BSDS200, which consists of 200 images showcasing animals, buildings, food, landscapes, people, and plants. Algorithms like VDSR, DRRN, GLRL, DRDN, and FGLRL Table 1 Key characteristics of 5 popular image datasets Dataset
Number of Images
Average Resolution
File Format
Key Contents
Set14
1,000
264 × 204 pixels
PNG
Objects: baby, bird, butterfly, bead, woman
DEVIKS
100
313 × 336 pixels
PNG
Objects: humans, animals, insects, flowers, vegetables, comic, slides
BSDS200
200
492 × 446 pixels
PNG
Natural scenes: environment, flora, fauna, handmade NBCT, people, somery
Urban100
109
826 × 1169 pixels
PNG
Urban scenes: animal, building, food, landscape, people, plant
ImageNet
3.2 million
Varies
JPEG, PNG
Different Objects and scenes
Leveraging Data Analytics and a Deep Learning Framework …
113
used BSDS200 as an additional training dataset, while FSRCNN used it as a testing dataset. Dilated-RDN also incorporated BSDS200 for training. DIVerse 2K resolution (DIV2K) Dataset: Widely used in many studies, the DIV2K dataset consists of 800 and 200 training and validation images. These images showcase elements such as surroundings, plant life, wildlife, crafted items, individuals, and landscapes. ImageNet Dataset: This extensive dataset with over 3.2 million images. Mammals, avians, aquatic creatures, reptiles, amphibians, transportation, furnishings, musical instruments, geological structures, implements, blossoms, and fruits. ESPCN and SRDenseNet employed ImageNet for training their models. Set5 and Set14 Datasets: Set5 contains only five images, and Set14 contains 14 images. Both datasets are popular choices for model evaluation. Berkeley Segmentation Dataset 100 (BSDS100): Consisting of 100 images with content ranging from animals and buildings to food and landscapes, BSDS100 served as a testing dataset for various algorithms. Urban100 Dataset: With 100 images showcasing architecture, cities, structures, and urban environments, the Urban100 dataset was used as a various testing dataset. Manga109 Dataset: Manga109 contains 109 PNG format images from manga volumes, and it was used as a testing dataset by RDN and SICNN.
3 Different Proposed Algorithms The initial phase, patch extraction, involved the capturing of information from bicubic-interpolated image. This image information was then channeled into the subsequent stage, non-linear mapping. Within this stage, the high-dimensional features underwent a transformation to correspond with other high-dimensional features, effecting a comprehensive mapping process [12]. Ultimately, the ultimate outcome from the final layer of the non-linear mapping phase underwent a convolutional process to accomplish the reconstruction of the high-resolution (HR) image. This final stage synthesized the refined features into the desired HR image, completing the SRCNN’s intricate process. SRCNN and sparse coding-based methods share similar fundamental operations in their image super-resolution processes. However, a notable distinction arises in their approach. While SRCNN empowers optimization of filters through an end-to-end mapping process, sparse coding-based methods restrict such optimization to specific operations. Furthermore, SRCNN boasts an advantageous flexibility: it permits the utilization of diverse filter sizes within the non-linear mapping step, enhancing the information integration process [13]. This adaptability contrasts with sparse codingbased methods, which lack such flexibility. As a result of these disparities, SRCNN achieves a higher PSNR (Peak Signal-to-Noise Ratio) value compared to sparse coding-based methods, indicating its superior performance in image super-resolution tasks (Ref: Algorithm 1).
114
S. R. Mishra et al.
Algorithm 1 Super-resolution algorithm (SRCNN) 1: 2: 3: 4: 5: 6: 7:
Require: LR image ILR Ensure: HR image IHR Upsample ILR using bicubic interpolation to obtain Iup Pass Iup through a CNN to obtain feature maps F Divide F into patches. Apply a non-linear mapping to each patch to obtain an enhanced patch Reconstruct the patches to form IHR.
3.1 FastSR Convolutional Neural Network Subsequently, authors of [14] made an intriguing observation regarding SRCNN’s performance, noting that achieving improved outcomes necessitated incorporating more convolutional layers within the non-linear mapping phase. However, this enhancement came at a cost: the augmented layer count led to increased computational time and hindered the convergence of PSNR values during training. To address these challenges, the authors introduced FSRCNN as a solution. Notably, FSRCNN introduced novel elements compared to its predecessor, SRCNN. The introduction of a shrinking layer, responsible for dimension reduction of extracted features from the preceding layer, was a key distinction. Simultaneously, the addition of an expanding layer aimed to reverse the shrinking process, expanding the output features generated during the non-linear mapping phase. Another noteworthy deviation was the employment of deconvolution as the chosen-up sampling mechanism within FSRCNN. This mechanism facilitated the enhancement of image resolution. These combined innovations in FSRCNN addressed the limitations of SRCNN, ultimately providing an optimized solution that struck a balance between computational efficiency and convergence of PSNR values during training (Ref: Algorithm 2). Algorithm 2 Fast SR algorithm 1: Input: LR image. 2: CNN: LR image is passed through a CNN with three convolutional layers. The first and second layer has 64 and 34 filters of size 9 × 9 and 5 × 5, and the third layer has 1 filter of size 5 × 5. 3: ReLU: A ReLU activation function is applied to the output of each convolutional layer. 4: Deconvolution: The output of the CNN is then up-sampled using deconvolution. This is a process that reverses the process of convolution. 5: Output: The final output is a high-resolution image.
Leveraging Data Analytics and a Deep Learning Framework …
115
Fig. 5 Deep learning SR Architecture
3.2 Deep Learning SR The architecture illustrated in Fig. 5, was conceptualized [28] to address the challenge encountered in SRCNN, where an increasing number of mapping layers was imperative for enhanced model performance. Deep learning SR innovatively introduced the concept of residual learning, a mechanism that bridged the gap between input and output within the final feature mapping layer. Residual learning was achieved by integrating the output features from the ultimate layer with the interpolated features. Given the strong correlation between low-level and high-level features, this skip connection facilitated the fusion of low-level layer attributes with high-level features, subsequently elevating model performance. This strategy proved particularly effective in mitigating the vanishing gradients issue that emerges when the model’s layer count grows. The incorporation of residual learning in Deep learning SR offered dual advantages compared to SRCNN. Firstly, it expedited convergence due to the substantial correlation between LR and HR images. As a result, Deep learning SR accomplished quicker convergence, slashing running times by an impressive 93.9% when compared to the original SRCNN model. Secondly, Deep learning SR yielded superior PSNR values in comparison to SRCNN, affirming its prowess in image enhancement tasks [15].
3.3 Multi-Connected CNN for SR Authors of [16] highlighted a limitation in the previous approach: although it addressed the vanishing-gradient issue, it didn’t effectively harness low-level features, indicating untapped potential for performance enhancement. Moreover, the residual learning concept in VDSR was only integrated between the initial
116
S. R. Mishra et al.
and final layers of non-linear mapping, potentially resulting in performance degradation. To address these concerns, the Multi-Connected CNN model, depicted in Algorithm 3, was devised as a solution. The success of this improvement could be attributed not only to the refined loss function but also to the extraction and amalgamation of information-rich local features from the multi-connected blocks. By combining these localized features through concatenation with high-level features, Multi-Connected CNN effectively addressed the shortcomings of its predecessors, showcasing its capacity to elevate performance in the realm of image super-resolution (Ref: Algorithm 3). Algorithm 3 Multi-Connected CNN for SR algorithm 1: Input: A LR image is input to the algorithm. 2: Bicubic interpolation: The LR image is up-sampled using bicubic interpolation. This is a simple but effective interpolation method is used to increase the resolution of an image without introducing too much blurring neural network (CNN): The up-sampled image is then passed through a CNN. The CNN learns to extract features from the image that can be used to improve the resolution. Patch extraction: The output of the CNN is divided into patches. Each patch is then processed individually. 3: Non-linear mapping: Mapping is applied to each patch. This mapping helps to preserve the details of the image while also improving the resolution. 4: Apply a non-linear mapping to each patch to obtain an enhanced patch. 5: Reconstruction: The patches are then reconstructed to form the final high-resolution image.
3.4 Cascading Residual Network (CRN) In recent development, the Compact Residual Network (CRN) is a solution to counteract the issue of extensive parameters within the EDSR network structure. This parameter bloat was a direct outcome of significantly increasing the network depth to enhance EDSR’s performance. CRN’s design was influenced by the architecture of EDSR, with a strategic replacement of EDSR’s residual blocks with locally sharing groups (LSG). The LSG concept encompassed the integration of several local wider residual blocks (LWRB). These LWRB closely resembled EDSR’s residual blocks, yet distinguished themselves through a variation in channel utilization. Every LSG, as well as every individual LWRB within it, was integrated with a residual learning network by adopting this novel approach; CRN effectively sidestepped the issue of unwieldy parameter expansion that had plagued deeper EDSR models. The strategic use of locally sharing groups and wider residual blocks showcased CRN’s potential in maintaining model performance while effectively managing parameter growth—a vital stride toward more efficient and powerful image super-resolution networks [17].
Leveraging Data Analytics and a Deep Learning Framework …
117
3.5 Enhanced Residual Network for SR An alternative neural network known as ERN, which exhibited a slight performance improvement compared to the CRN model. The design of the ERN network was also inspired by the EDSR architecture but with an innovative inclusion of an extra skip connection. This connection linked the (LR) input to the output originating from the final LWRB (Local Weighted Receptive Field Block) through a multiscale block (MSB). An algorithmic representation of this configuration is provided in Algorithm 4. The primary objective behind the integration of the multiscale block (MSB) was to capture least features from the input image across various scales. In contrast to CRN’s utilization of the LSG (Local Sub-Group) technique, ERN opted for the application of LWRB in a non-linear mapping fashion (Ref: Algorithm 4). Algorithm 4 Enhanced Residual Network for SR 1: 2: 3: 4: 5: 6: 7: 9: 10: 11:
Input: LR image Output: HR image 1. Feature extraction: Apply convolutional layers for feature extraction from the LR image Use a non-linear mapping to transform the features 2. Up-sampling: Upsample the features to the desired resolution. 8: 3. Residual learning: Add the up-sampled features to the low-resolution image. 4. Output: The high-resolution image.
3.6 Deep-Recursive CNN for SR Deep-recursive CNN for SR, introduced as the pioneer algorithm to employ a recursive approach for image super-resolution, brought a novel perspective to the field illustrated in Fig. 6. It comprised three principal components the embedding, inference, and reconstruction. The embedding net’s role was to extract relevant features from the interpolated image. These extracted features then traversed the inference net, notable for its unique characteristic of sharing weights across all filters. Within the inference net, the outputs of intermediate convolutional layers and the interpolated features underwent convolution before their summation generated a high-resolution (HR) image. The distinctive advantage of DRCN lay in its capacity to address the challenge encountered in SRCNN, where achieving superior performance necessitated a high number of mapping layers. By embracing a recursive strategy, Deeprecursive CNN harnessed shared weights. Furthermore, the amalgamation of intermediate outputs from the inference net brought substantial enhancement to the model’s performance. Incorporating residual learning principles into the network contributed
118
S. R. Mishra et al.
Fig. 6 Deep-recursive CNN for SR
further to improved convergence. In the realm of outcomes, Deep-recursive CNN yielded a noteworthy 2.44% enhancement over SRCNN, underscoring its efficacy in pushing the boundaries of image super-resolution.
3.7 Dual-Branch CNN The previous section examined various algorithms that primarily relied on stacking convolutional layers sequentially. However, this approach resulted in increased runtime and memory complexity. To address this concern, dual-branch image superresolution algorithm was introduced named as Dual-Branch CNN. The network
Leveraging Data Analytics and a Deep Learning Framework …
119
Fig. 7 Dual-branch CNN architecture
architecture of Dual-Branch CNN is illustrated in Fig. 7. In Dual-Branch CNN, the network architecture diverged into two branches. One branch incorporated a convolutional layer, while the other branch employed a dilated convolutional layer. The outputs from both branches were subsequently merged through a concatenation process before undergoing up-sampling. Dual-Branch CNN also exhibited a distinctive approach by combining bicubic interpolation with alternative up-sampling methods, such as utilizing deconvolutional kernels for the reconstruction phase. Several notable advantages emerged from the Dual-Branch CNN architecture. Firstly, the dual-branch structure effectively circumvented the intricacies often encountered in chain-way-based networks, streamlining the model’s complexity. Secondly, the incorporation of dilated convolutional filters notably improved image quality during the reconstruction process. Thirdly, the integration of residual learning principles accelerated convergence, contributing to faster training [18].
3.8 Quantitative Results Obtained from Different Algorithm Here, the features of the images generated by various algorithms were carefully observed. A comparison between SRCNN and the bicubic interpolation method revealed that images produced through interpolation appeared blurry, lacking clear details in contrast to the sharpness achieved by SRCNN. Comparing the outputs of FSRCNN with those of SRCNN, there appeared to be minimal discrepancy.
120
S. R. Mishra et al.
However, both FSRCNN exhibited superior processing speed [19]. The incorporation of residual learning within VDSR substantially improved image texture, surpassing that achieved by SRCNN. Models benefiting from enhanced learning through residual mechanisms displayed notable enhancement in image texture. DRCN, which harnessed both recursive and residual learning, yielded images with more defined edges and patterns, markedly crisper than the slightly blurred edges produced by SRCNN. CRN further improved upon this aspect, delivering even sharper edges than DRCN. On the other hand, GLRL generated significantly clearer images compared to DRCN, albeit with a somewhat compromised texture. Images generated by CRN exhibited superior texture compared to DRCN, while SRDenseNet managed to reconstruct images with improved texture patterns, effectively mitigating distortions that proved challenging for DRCN, VDSR, and SRCNN to overcome. Noteworthy improvements were observed in images produced by DBCN, showcasing a superior restoration of collar texture without introducing additional artifacts. This achievement translated to a more visually appealing outcome than what was observed with CRN. DBCN demonstrated an enhanced capacity to restore edges and textures, surpassing the capabilities of SRCNN in this domain. Figures 8, 9, 10, and 11 provide a comprehensive summary of the quantitative outcomes achieved by the respective algorithms developed by the authors (Table 2).
Fig. 8 Mean PSNR and SSIM for set5 dataset
Leveraging Data Analytics and a Deep Learning Framework …
121
Fig. 9 Mean PSNR and SSIM for set14 dataset
Fig. 10 Mean PSNR and SSIM for Bsd100 dataset
4 Different Network Design Strategies Diverse approaches have been taken by numerous researchers to enhance the performance of image super-resolution models. Table 2 shows different key features of network design strategies among the various designs discovered above. At its core, the linear network was the foundational design, depicted in Fig. 12. This design concept drew inspiration from the residual neural network (ResNet), widely utilized for object recognition in images. The linear network technique was employed by models such as SRCNN, FSRCNN, and ESPCN. Although these three models employed a similar design approach, there were differences in their internal architectures and upsampling methods. For instance, SRCNN exclusively consisted of feature extraction,
122
S. R. Mishra et al.
Fig. 11 Mean PSNR and SSIM for Urban100 dataset
Table 2 Key features of network design strategies Network design strategy
Key features
Linear network
Single layer of neurons, linear functions
Residual learning
Builds on top of simpler functions, residual connections
Recursive learning
Learns hierarchical representations, recursive connections
Dense connections
Connections between all neurons in a layer
and an up-sampling module, while FSRCNN integrated feature extraction, a contraction layer, non-linear mapping, an expansion layer, and an up-sampling module. (Ref: Table 2). Nonetheless, the linear network approach fell short of fully harnessing the wealth of feature information present in the input data. The low-level features, originating from the LR (low-resolution) images, encapsulated valuable information that exhibited strong correlations with high-level features. Consequently, relying solely on a linear network could result in the loss of pertinent information. To address this limitation, the concept of residual learning was introduced, depicted in Fig. 11. The incorporation of residual learning significantly expedited training convergence and mitigated the degradation issues encountered with deeper networks. Within the assessed algorithms, two variations of residual learning were identified: local residual learning and global residual learning. With the exception of VDSR, all algorithms implemented both forms of residual learning. Local residual learning primarily established connections between residual blocks, while global residual learning linked the features from LR images to the final features.
Leveraging Data Analytics and a Deep Learning Framework …
123
Fig. 12 Different network design
5 Use of Image Super-Resolution in Various Domains Image super-resolution has found wide-ranging applications across diverse domains over the past three decades. Notably, fields such as medical diagnosis, surveillance, and biometric information identification have integrated image super-resolution techniques to address specific challenges and enhance their capabilities.
124
S. R. Mishra et al.
5.1 In the Field of Medical-Imaging In the realm of medical diagnostics, accurate judgment is a critical skill. However, images obtained through techniques like CT, MRI, and PET-CT often suffer from low resolution, noise, and inadequate structural information, posing challenges for correct diagnoses [10]. Image super-resolution has garnered attention for its potential to enable zooming into images and enhancing diagnostic capabilities. CNN-based UNet algorithm was proposed to generate high-resolution CT brain images, surpassing traditional techniques like the Richardson–Lucy deblurring algorithm. Few authors introduced network for enhancing MRI brain images to aid tumor detection [11] implemented a CSN for MRI brain images. Both applications showcased the advantages of CNN-based algorithms over traditional methods like bicubic interpolation, as seen in improved metrics like PSNR and SSIM.
5.2 Surveillance Applications Surveillance systems play a crucial role in security monitoring and investigations. However, unclear video quality from these systems, stemming from factors like small image size and poor CCTV quality, poses challenges. Image super-resolution techniques have been employed to overcome these issues. Few authors proposed a deep CNN for surveillance record super-resolution, demonstrating the superiority of CNN-based methods over traditional techniques [20, 21]. Despite advancements, applying image super-resolution in surveillance remains challenging due to factors such as very complex motion, very large amount of feature data, and varying image quality from CCTV.
5.3 Biometric Information Identification Applications Biometric identification methods, including face, fingerprint, and iris recognition, require high-resolution images for accurate detection and recognition. SRCNN was applied by [12] for enhancing facial images used in surveillance. Paper [13] employed deep CNNs for facial image resolution enhancement. Paper [14] utilized the (PFENetwork) to enhance the fingerprint image resolution.
Leveraging Data Analytics and a Deep Learning Framework …
125
6 Conclusion The emergence of image SR-technology has garnered significant attention across various application domains. The evolution of deep learning has spurred researchers to innovate CNN-based approaches, seeking models that offer improved performance with reduced computational demands. With the inception of the pioneering SRCNN, a multitude of techniques, including diverse up-sampling modules and network design strategies, have enriched the landscape of image super-resolution algorithms. Yet, focusing solely on methodologies may not yield optimal results in model refinement. It is paramount to delve into the inherent characteristics of each approach, comprehending their merits and limitations. Such insights empower developers to judiciously select design pathways that enhance models intelligently. This synthesis of knowledge from diverse studies not only informs about methodologies but also imparts an understanding of the nuanced attributes that underlie them. This comprehensive review stands poised to provide invaluable guidance to developers aiming to elevate the performance of image super-resolution models in terms of both efficiency and quality. By encapsulating a profound understanding of diverse techniques and their intricacies, this review serves as a beacon for future image super-resolution advancements, illuminating the trajectory toward more sophisticated and impactful developments.
References 1. Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for im- age super-resolution. In: European Conference on Computer Vision, pp. 184–199. Springer, Cham, Switzerland (2014) 2. Dong, C., Loy, C.C., Tang, X.: Accelerating the super-resolution convolutional neural network. In: European Conference on Computer Vision, pp. 391–407. Springer, Cham, Switzerland (2016) 3. Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646–1654. Las Vegas, NV, USA, 27–30 June 2016 4. Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image super-resolution. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1132–1140. Honolulu, HI, USA, 21–26 July 2017 5. Chu, J., Zhang, J., Lu, W., Huang, X.: A Novel multiconnected convolutional net- work for super-resolution. IEEE Signal Process. Lett. 25, 946–950 (2018) 6. Lan, R., Sun, L., Liu, Z., Lu, H., Su, Z., Pang, C., Luo, X.: Cascading and enhanced residual networks for accurate single-image super-resolution. IEEE Trans. Cybern. 51, 115–125 (2021) 7. Kim, J., Lee, J.K., Lee, K.M.: Deeply-recursive convolutional network for image superresolution. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1637–1645. Las Vegas, NV, USA, 27–30 June 2016 8. Hou, J., Si, Y., Li, L.: Image super-resolution reconstruction method based on global and local residual learning. In: Proceed- ings of the 2019 IEEE 4th Inter- national Conference on Image, Vision and Computing (ICIVC), pp. 341–348. Xiamen, China, 5–7 July 2019
126
S. R. Mishra et al.
9. Gao, X., Zhang, L., Mou, X.: Single image super-resolution using dual-branch convolutional neural network. IEEE Access 7, 15767–15778 (2019) 10. Ren, S., Jain, D.K., Guo, K., Xu, T., Chi, T.: Towards efficient medical lesion image superresolution based on deep residual networks. Signal Process. Image Communication. 11. Zhao, X., Zhang, Y., Zhang, T., Zou, X.: Channel splitting network for single MR image super-resolution. IEEE Trans. Image Process. 28, 5649–5662 (2019) 12. Rasti, P., Uiboupin, T., Escalera, S., Anbarjafari, G.: Convolutional Neural network super resolution for face recognition in surveillance monitoring. In: Articulated Motion and Deformable Objects, pp. 175–184. Springer: Cham, Switzerland (2016) 13. Deshmukh, A.B., Rani, N.U.: Face video super resolution using deep convolutional neural network. In: Proceedings of the 2019 5th International Conference on Computing, Communication, Control and Automation (ICCUBEA), pp. 1–6. Pune, India, 19–21 September 2019 14. Shen, Z., Xu, Y., Lu, G.: CNN-based high-resolution fingerprint image enhancement for pore detection and matching. In: Proceedings of the 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 426–432. Xiamen, China, 6–9 December 2019 15. Chatterjee, P., Milanfar, P.: Clustering-based denoising with locally learned dictionaries. IEEE Trans. Image Process. 18(7), 1438–1451 (2009) 16. Xu, X.L., Li, W., Ling.: Low Resolution face recognition in surveillance systems. J. Comp. Commun. 02, 70–77 (2014). https://doi.org/10.4236/jcc.2014.22013 17. Li, Y., Qi, F., Wan, Y.: Improvements on bicubic image interpolation. In: 2019 IEEE 4th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC). Vol. 1. IEEE (2019) 18. Kim, T., Sang Il Park, Shin, S.Y.: Rhythmic-motion synthesis based on motion-beat analysis. ACM Trans. Graph. 22(3), 392–401 (2003) 19. Xu, Z. et al.: Evaluating the capability of satellite hyperspectral Im- ager, the ZY1–02D, for topsoil nitrogen content estimation and mapping of farm lands in black soil area, China.” Remote Sens. 14(4), 1008 (2022) 20. Mishra, S.R., et al.: Real time human action recognition using triggered frame extraction and a typical CNN heuristic. Pattern Recogn. Lett. 135, 329–336 (2020) 21. Mishra, S.R., et al.: PSO based combined kernel learning framework for recognition of firstperson activity in a video. Evol. Intell. 14, 273–279 (2021)
Applying Data Analytics and Time Series Forecasting for Thorough Ethereum Price Prediction Asha Rani Mishra, Rajat Kumar Rathore, and Sansar Singh Chauhan
Abstract Finance has been combined with technology to introduce newer advances and facilities in the domain. One such technological advance is cryptocurrency which works on the Blockchain technology. This has proved to be a new topic of research for computer science. However, these currencies are volatile in nature and their forecasting can be really challenging as there are dozens of cryptocurrencies in use all around the world. This chapter uses the time series-based forecasting model for the prediction of the future price of Ethereum since it handles both logistic growth and piece-wise linearity of data. This model is independent as it does not depend on past or historical data which contain seasonality. This model is suitable for real use cases after seasonal fitting using Naïve model, time series analysis, and Facebook Prophet Module (FBProphet). FBProphet Model achieves better accuracy as compared to other models. This chapter aims at drawing a better statistical model with Exploratory Data Analysis (EDA) on the basis of several trends from year 2016 to 2020. Analysis carried out in the chapter can help in understanding various trends related to Ethereum price prediction.
1 Introduction Cryptocurrencies act as mediums of exchange based on computer networks and are not reliant on any central or government authority, banks, or financial institutions. Its function is to verify if the concerned parties have the money they claim to own, thereby reducing the traditional methods of banks. These currencies or coins are reliable in terms of data integrity and security because of being built on the Blockchain framework. These currencies have the least chances to be converted or used for fraudulent activities. Cryptocurrencies have not been recognized and adopted globally and are still new to the market except in a few countries. From a pool of cryptocurrencies, Bitcoin and Ethereum are the highlighted ones, since they are affected by news and A. R. Mishra (B) · R. K. Rathore · S. S. Chauhan Department of Computer Science, GL Bajaj Institute of Technology and Management, Greater Noida, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145, https://doi.org/10.1007/978-981-97-0448-4_7
127
128
A. R. Mishra et al.
events. It was introduced in 2008 by Nakamoto. Based on Blockchain technology, this currency allows for peer-to-peer transactions. Cryptocurrency is so-called because it is based on encryption which is used to verify transactions. The storing and transmitting of cryptocurrency data between wallets and to public ledgers involves advance coding. These currencies are widely popular because of the safety and security factor. A ledger or database that is distributed and shared by the nodes of a computer network is known as a Blockchain. This aids in the secure and decentralized recording of transactions for cryptocurrency. A block is a data collection that stores the transaction together with additional information like the correct sequence and creation timestamp. The initial stage in a Blockchain is a transaction, which symbolizes the participant’s activity. Machine learning algorithms are commonly used in this field because of their ability to dynamically select a vast number of attributes that could possibly be affecting the results and comprehend complex, high-dimensional relationships between features and targets. Because predicting capacity varies depending on the coin, machine learning and artificial intelligence seem better solvers. Compared to currencies with high volatility, low volatility cryptocurrencies are easier for one to forecast. Most of the coins in cryptocurrency have volatility in the graphs of their prices. Their prices are also affected by supply and demand factor. The number of coins in circulation and held by buyers are major factors affecting the prices. This chapter uses a model to foresee shutting cost of Ethereum alongside Ethereum Opening Value, Ethereum Day Exorbitant cost, Ethereum Day Low Cost and Volume. The total value of Ethereum on a certain day can be known using sophisticated machine learning algorithms by finding hidden patterns in data to precise predictions.
2 Related Work Hitam et al. in [1] focus on six different types of cryptocurrency coins available in the market by collecting historical data of 5 years. They trained four different models on the dataset and the performances of different classifiers were checked. These models were namely, SVM, BoostedNN, ANNs, and Deep Learning giving accuracies of 95.5%, 81.2%, 79.4%, and 61.9%, respectively. Siddhi Velankar et al. in [2] use two different approaches for predicting prices—GLM/Random Forest and Bayesian Regression model. It uses Bayesian Regression by dividing data in the form of 180s, 360s, 720s and takes the help of k-means clustering to narrow down effective clusters and further calculating the corresponding weights from data using the Bayesian Regression method. In GLM/Random Forest model, the authors distribute data into 30, 60, and 120 min time series datasets. The issue is addressed by Chen et al. in [3] by splitting the forecasting sample into intervals of five minutes with a large sample size and daily intervals with a small sample size. Lazo et al. in [4] builds two decision trees based on datasets of two currencies—Bitcoin and Ripple where week 1 decision tree model gives best decisions for selling the coins 1 week after purchase and the week 2 decision tree model gives best investment advice on coins giving highest gains. Derbentsev et al. in [5] perform predictive analysis on the
Applying Data Analytics and Time Series Forecasting for Thorough …
129
cryptocurrency prices (BTC, ETH, XRP) by using two different models—Stochastic GBM and RF. Same features were used in training of both the models. The author uses one-step ahead forecasting to anticipate coin prices using the most recent 91 data in order to evaluate the effectiveness of the aforementioned ensembles. The SGBM and RF models give an RMSE of 5.02 and 6.72 for Ethereum coin, respectively. Chittala et al. in [6] introduce two different artificial intelligence frameworks ANN and LSTM for the predictive analysis. Authors wanted to predict the price of Bitcoin one day into the future through ANNs using five different lengths of memory, whereas LSTM models the internal memory flow and how it affects future analysis. The author concludes that the ANN and LSTM models can be compared up to a level and are good performers in predicting prices almost equally, although the internal structures differ. It is also concluded that long-term history is what ANN depends more on whereas LSTM is more into short-term dynamics. Hence, in terms of historical data, LSTM is able to utilize more useful information. Soumya A. et al. in [7] follow a similar approach by comparing models using ANN and LSTM. ANN model improves itself largely in predicting Ethereum prices as it trains itself on all the historical information of the last month as compared to short-term cases. LSTM model, on the other hand, performs well just like ANN while predicting gone-day future prices based on MSE. Suryoday Basak et al. in [8], work on prediction of stock prices and describe the problem statement as a classification problem able to forecast the increase or decrease in stock prices according to historical prices n days back. They came up with an experimental technique for the classification. Random forest and gradient boost decision trees (using XGBoost) are used in the work by using ensembles of decision trees. Poongodi M. et al. in [9] employ a time series comprising daily closing values of the Ethereum cryptocurrency to apply two well-known machine learning algorithms for pricing prediction: linear regression and SVM. The SVM method seems to give a higher accuracy (96.06%) than Linear Regression Model (85.46%). The author also mentions that accuracy can be lifted up by adding features to SVM, almost by 99%. Azeez A. et al. [10] use two different methods to design models for predictions namely, Tree-based techniques (GBM, ADAB, XGB, MLP) and Deep Learning Neural Networks (DFNN, GRU, CNN). Results show that these techniques were able to gain a more valuable EVS percentage ranging between 88 and 98%, meaning they have reasoned for the overall variance in the observed data. On the other hand, the boosting trees techniques do not perform well enough in predicting daily closing prices for cryptocurrencies since there are insights missing in the training set. Aggarwal et al. in [11] uses the approach where gold price predicts prices of BTC currency with help of three different DL algorithms namely, Convolutional Neural Network (CNN), Long Short Term Memory (LSTM), and Grated Recurrent Network (GRN). The predicted price which only uses gold price has certain deviation from the actual Bitcoin price. Best result was given by LSTM. Phaladisailoed et al. in [12] use four deep learning algorithms—Theil–Sen regression, Huber regression, LSTM, and GRU have been used in this study to predict the price of the cryptocurrency, Bitcoin. LSTM offers the highest accuracy of all, i.e., 52.78%. Carbo et al. in [13] the authors in their work, show that adding previous period’s Bitcoin price to the explanatory variables based on the LSTM algorithm improves in terms of RMSE
130
A. R. Mishra et al.
from 21 to 11%. Quantile regression boosting is used by Christian Pierdzioch et al. in [14] to forecast gold prices. Trading rules deriving from this approach may prove to be superior to even buy-hold strategies in circumstances with low trading costs and particular quantiles. Perry Sadorsky et al. in [15] predict gold and silver prices and conclude tree-based algorithms like Random Forest (RF) to be better and accurate models as compared to logit ones for gold and silver prices forecasting in which technical indicators are used as features. Here too, RF predictions generate trading rules capable of replacing buy and hold strategy. Samin-Al-Wasee et al. [17] performed a research which discusses that price data for ether were fitted into a number of basic and hybrid LSTM network types and time series modeling was used to forecast future prices. In order to assess the efficacy of the LSTM networks, a comparison study of the models as well as other well-liked current forecasting methods, such as ARIMA as the background forecast, were carried out. Sharma & Pramila in [18] proposed a hybrid model using long short-term memory (LSTM) and vector auto regression (VAR) that forecasts the price of Ethereum. In comparison to the solo models, the hybrid model yielded the lowest values for the assessment measures. Existing work includes the use of multiple algorithms for price prediction and price speculation in stock market as there are a large number of parameters available in stock market. However, in the case of digital coins, we have fewer parameters available but there is way more seasonality and fluctuation as compared to stocks. So, when we talk about predicting the price of cryptocurrencies, the number of algorithms is quite limited in the discussion. Earlier multiple price predictions like logistic regression, linear discriminant analysis, ARIMA, LSTM, etc., were used but out of all those ARIMA was most prominent but due to higher seasonality ARIMA predictions were also affected. So, there is a need for a new approach which can overcome the limitation of all the existing algorithms and can give a better prediction.
3 Research Methodology Data analytics along with time series modeling can be used in predicting the price of Ethereum by utilizing the information about trends and patterns in the past. It helps to find correlated features for prediction models. Using sentiment analysis based on social trends can help to identify factors which can influence the price. Developing precise forecasting models can benefit from the examination of long-term, seasonal, and cyclical trends. Finding pertinent features for models that forecast time series is facilitated by data analytics. The prediction capacity of the model can be increased by combining sentiment analysis features, on-chain analytics, and technical indications. In conclusion, time series forecasting and data analytics work well together to predict Ethereum prices. By utilizing past information, market sentiment, and sophisticated modeling approaches, it helps analysts and investors make better decisions. Nonetheless, it’s critical to recognize the inherent unpredictability of bitcoin markets and to constantly improve models in order to accommodate changing
Applying Data Analytics and Time Series Forecasting for Thorough …
131
circumstances. In summary, the combination of data analytics and time series forecasting is powerful for Ethereum price prediction. It enables investors and analysts to make more informed decisions by leveraging historical data, market sentiment, and advanced modeling techniques. However, it’s essential to acknowledge the inherent uncertainty in cryptocurrency markets and continuously refine models to adapt to dynamic conditions. It is difficult to predict or comment on cryptocurrencies because of their volatility in prices and visible dynamic fluctuations. According to existing work different pros and cons of time series algorithms like ARIMA and LSTM were identified. ARIMA is unable to handle factors like seasons in data and independency between two data points. This existing problem can be reduced or eliminated by contrasting few machine learning techniques for analyzing market movements. This chapter presents a methodology which is able to predict prices of the cryptocurrency—Ethereum by using Machine Learning algorithms in a hybrid fashion. The data has been smoothened, enhanced, and prepared to finally Facebook Prophet Algorithm Model is applied to it. Facebook Prophet was able to handle the cons of previous algorithms used in such predictions such as dynamic behavior, seasonality, holidays, etc. [16]. Figure 1 depicts the work-flow used in the chapter. Presence of non-stationarity in data, noise in data, and requirement of smoothening, feature engineering was done during the process. At each step, there was a requirement of graph visualization to analyze the working and make the next decisions. Fig. 1 Workflow used for prediction of Ethereum
132
A. R. Mishra et al.
(i). Fetching of raw data which could be done from 3rd party APIs, web scrapping, etc. (ii). After doing data cleaning on the raw data, exploratory data analysis (EDA) on the data is conducted to know the behavior of the data. (iii). On this data, a naive model, also known as a base line model, an auto regressive model, or a moving average model is implemented. (iv). After data cleaning, the next step is feature engineering to check whether data is stationary. For this, statistical test is done using Line plot curve and Augmented Dickey–Fuller (AdFullar) Test. It is essential to check stationarity of data in time series analysis since it highly affects the interpretation of data.
4 Results and Discussions Numerous statistical models are built on the assumption that there is no dependence between the various points for predictions. To fit a stationary model to the time series data that needs to be analyzed, one should check for stationarity and remove the trend/seasonality effect from the data. The statistical factors must remain constant over time. This is not necessary that all data points should be same; rather, the data’s general behavior should be consistent. Time graphs that are constant on a strictly visual level are considered stagnant. Stationarity also means the consistency of mean and variance with respect to time. In data preprocessing step, as time series-based Facebook Prophet Model is used, the date feature must offer an object type with a timestamp nature. Therefore, convert it first to date/time format before sorting the data by date. Exploratory Data Analysis (EDA) must be done on data sample as shown in Fig. 2 because the ‘Close’ feature’s as shown in Fig. 3 ultimate purpose is to forecast what the final selling price of Ethereum will be. The mean or average closing price can be found using mean function and values should be plotted according to date on weekly and yearly basis as shown in Figs. 4 and 5, respectively.
Fig. 2 Data sample
Applying Data Analytics and Time Series Forecasting for Thorough … Fig. 3 Close feature plot
Fig. 4 Plot data sum on weekly basis
Fig. 5 Plot data on yearly basis
133
134
A. R. Mishra et al.
Fig. 6 Closing price on monthly basis
Fig. 7 Mean of ‘Close’ by week
Graph shown in Fig. 6 shows the trend of prices in a yearly, weekly, and monthly time period using mean function. In Fig. 7, average weekly closing price can be analyzed using ‘Close’ by taking mean of values according to week. Mean closing price per day is analyzed and plotted in Fig. 8. In the same manner, the average closing prices on a quarterly basis are analyzed and plotted as shown in Fig. 9. The trend that closing prices follow on weekdays and weekends was also analyzed and the same was plotted in Fig. 10 which shows minor differences in the two graphs.
4.1 Naïve Model for Ethereum Price Prediction Using this data, a baseline or the Naïve model is used for prediction as shown in Fig. 11. In a Naïve model all data points are dependent on the previous data points as shown in Fig. 12.
Applying Data Analytics and Time Series Forecasting for Thorough …
135
Fig. 8 Average closing price each day
Fig. 9 Average closing price of each Quarter
4.2 Seasonality Test The next steps included determining whether or not the data had seasonality. Seasonality is the existence of fluctuations or changes that happen frequently, such as once
136
A. R. Mishra et al.
Fig. 10 Weekend and Weekdays plot
Fig. 11 Naive prediction
a week, once a month, or once every three months in data. Seasonality is the periodic, repeating, often regular, and predictable trends in the levels of a time series that may be attributed to a variety of events, including weather, vacation, and holidays. The seasonality of the curve is removed by applying rolling or moving average of a window period of 7 on the data. Mean and Standard Deviation are shown in Fig. 13. The blue line, which is now overlapping the green curve in the graph in Fig. 13, represents mean values. The orange line in the graph reflects the exact given series. This has led to the conclusion that the rolling mean is not constant and undergoes temporal variation. It must now stop being seasonal and change into a stable state.
Applying Data Analytics and Time Series Forecasting for Thorough …
137
Fig. 12 Naïve prediction vs. actual values plot
Fig. 13 Mean and standard deviation plot
4.3 Augmented Dickey–Fuller (AdFullar) Test ‘Adfuller’ is easily imported from the stats model package and used in a program on the ‘close’ data. This results in a p-value of 0.0002154535155876224. The null hypothesis is rejected as the calculated value is less than 0.05 so the data is considered as stationary. The log transformation is often used to reduce skewness of a measurement variable using the log functions as shown in Fig. 14. The data are smoothed using the moving average. The financial market uses this technique frequently. Impact of rolling window and log transformations is shown in Fig. 15.
138
A. R. Mishra et al.
Fig. 14 Removal of seasonality factor
Fig. 15 Log transformation and moving average
Figure 16 shows that null hypothesis is rejected and as a result, data comes out to be stationary. The time series is roughly stationary and has a constant interval. Shift is used to apply difference to find the tendency of seasonality as seen in Fig. 17. Other seasonal adjustment results have been shown in Fig. 18. As a result, it may be inferred that the Dicky Fuller Test has an essential value less than 1%.
Applying Data Analytics and Time Series Forecasting for Thorough …
139
Fig. 16 Rolling and moving average difference is stationary Fig. 17 Using shift to apply difference
4.4 Forecast Using Facebook Prophet Model The algorithm used here for prediction is FBProphet. The FBProphet algorithm of machine learning employs a decomposable time series model that consists of three key components: pattern, seasonality, and holidays. In the following Eq. 1, they are combined: y(t) = g(t) + s(t) + h(t) + εt
(1)
g(t): For modeling non-periodic variations in time series, use a piece-wise linear or logistic growth curve. s(t): periodic changes (e.g., weekly or yearly seasonality). h(t): effects of holidays along with irregular schedules.
140
A. R. Mishra et al.
Fig. 18 ADF test gives value much lesser than 1%
εt: error term accounts for any unusual changes which are not accommodated by the model. The ‘Fbprophet’ library provides a Prophet model specifically. It controls irregular hours or irregular holidays. The circumstance when there is some noise or some outliers in the data is likewise handled by this Facebook prophet module. ‘Fbprophet’ is a module that helps forecasting time series data that matches non-linear patterns since the data has seasonality on a yearly, weekly, and daily scale along with the effects of holidays. The results are best if the model is trained on past data including several seasons and time series with considerable seasonal influences. The data must be prepared in accordance with the prophet model documentation prior to fitting. It must ensure that every data complies with its protocols. Output feature is represented as ‘y’ and the date ‘ds’. The model is fitted with a frequency of day with a ‘500-day’ span. ‘yhat’ gives the actual forecast while ‘yhat_upper’ and ‘yhat_lower’ give higher bound prediction and the lower bound prediction respectively as shown in Fig. 19. Now, to plot this forecast, the Fbprophet library’s built-in functionality to forecast is used shown in Fig. 20. The black dot in the curve represents the plot of actual values or prices, whereas blue line shows the prediction curve. The light blue line depicts the trend using data on a weekly, annual, and monthly as shown in Fig. 21.
4.5 Validation of Predicted Data The forecast model that calculates forecast error must now be cross-validated. Actual values and projected values will be compared to calculate forecast error. There is a
Applying Data Analytics and Time Series Forecasting for Thorough …
141
Fig. 19 Forecast values
Fig. 20 Forecast using Fbprophet
built-in cross-validation method in Facebook Prophet. The Horizon parameter, the initial training period size, the cutoff period spacing, and other parameters are shown as part of the cross-validation approach as seen in Fig. 22. The RMSE curve can be plotted similarly to the root-mean-square error to find the difference between actual and predicted values from the time period 2019–2020.
142
A. R. Mishra et al.
Fig. 21 Weekly, monthly, and yearly plot
In order to evaluate FBProphet model, we can use four types of errors measures, i.e., Mean Absolute error (MAE), Root Mean Square Error (RMSE), Root Relative Squared Error (RRSE), and Mean Absolute Percentage Error (MAPE) shown in Table 1. Here, Zak = actual value zˆ k = predicted value for any kth sample zk = Actual value of z, z- =Average Value of z N = total number of test sample The FBProphet algorithm used for trend analysis is a full-fledged and totally reliable algorithm, which gave an accuracy of approximately in between 94.5 and 96.6%. From the experimental results it has been observed that the value of RSME falls in the range of 0–100 which is about 5.56%. Majority of the value lies between 0 and 80 which indicates that the model is having 4.44% error.
Applying Data Analytics and Time Series Forecasting for Thorough …
143
Fig. 22 Root_Mean_Square error
Table 1 Performance metrics for FBProphet forecasting model Type of Statistical error
Formula ∑ Mean Absolute Error (MAE) MAE = 1/N |ˆzk −zk |for k = 1 to N /( ) ∑ Root Mean Square Error RSME = 1/N |ˆzk −zk |2 for k = 1 to N (RMSE) Root Relative Squared Error (RRSE)
Mean Absolute Percentage Error (MAPE)
RRSE =∑ |2 ∑ ) ∑ ) √( ∑ || − √( |ˆzk −zk |2 1/N / 1/N z -zk | where z− = ∑ 1/N zk for k = 1 to N ∑ MAPE = 100/N |ˆzk −zk )/zk |
5 Conclusion and Future Work Techniques and models are seldom created out of old data, but a reliable present world predictive model is quite vague to build based only on previous data. Logistic Regression accuracy score is 66%, Linear Discriminant Analysis having 65.3% accuracy; and other previous models on prices of coins like BTC, LTC—Multi-linear regression model gives R2 score 44% for LTC and 59% accuracy for BTC. Most of the problems are solved by building models based on historical data, but in case of cryptocurrencies, future results cannot be predicted based on just a historical data model. There may be seasonality in prior data or problems which effects models’ ability to accurately predict patterns. Performing cross-validation, the used FBProphet model showed that it was able to achieve around 97% accuracy in forecasting future Ethereum Price. Even when seasonal data was available, the overall gap between anticipated and actual values was small compared to other models. Further, to improve the model’s
144
A. R. Mishra et al.
accuracy and make it reliable on present data, a suggestion tool for other external factors that may affect Ethereum market prices, such as social media, tweets, and trading volume, might be added.
References 1. Hitam, N.A., Ismail, A.R.: Comparative performance of machine learning Aagorithms for cryptocurrency forecasting. Indones. J.Electr. Eng. Comput. Sci. 11, 1121– 1128 (2018). https:// www.ije.ir/article_122162.html 2. Velankar, S., Valecha, S., Maji, S.: Bitcoin price prediction using machine learning. In: 2018 20th International Conference on Advanced Communication Technology (ICACT), pp. 144– 147. IEEE (2018) 3. Chen, Z., Li, C.; Sun, W.: Bitcoin price prediction using machine learning: an approach to sample dimension engineering. J.Comput. Appl. Math. 365, 112395 (2019). https://www.sci encedirect.com/science/article/abs/pii/S037704271930398X 4. Lazo, J.G.L., Medina, G.H.H., Guevara, A.V., Talavera, A., Otero, A.N., Cordova E.A.: Support system to investment management in cryptocurrencies. In: Proceedings of the 2019 7th International Engineering, Sciences and Technology Conference, IESTEC, pp. 376–381. Panama (9–11 October 2019) 5. Derbentsev, V., Babenko, V., Khrustalev, K., Obruch, H., Khrustalova, S.: Comparative performance of machine learning ensemble algorithms for forecasting cryptocurrency prices. Int. J. Eng. Trans. A Basics. 34, 140–148 (2021) 6. Yiying, W., Yeze, Z.: Cryptocurrency price analysis with artificial intelligence. In: 2019 5th International Conference on Information Management (ICIM), pp. 97–101. IEEE (2019, March). https://doi.org/10.1109/INFOMAN.2019.8714700 7. Livieris, I.E., Pintelas, E., Stavroyiannis, S., Pintelas, P.: Ensemble deep learning models for forecasting cryptocurrency time-series. Algorithms 13(5), 121 (2020). https://doi.org/10.3390/ a13050121 8. Basak, S., Kar, S., Saha, S., Khaidem, L., Dey, S.R.: Predicting the direction of stock market prices using tree-based classifiers. North Am. J. Econ. Finance 47, 552–567 (2019). https:// doi.org/10.1016/j.najef.2018.06.013 9. Poongodi, M., Sharma, A., Vijayakumar, V., Bhardwaj, V., Sharma, A. P., Iqbal, R., Kumar, R: Prediction of the price of Ethereum blockchain cryptocurrency in an industrial finance system. Comput. Electr. Eng. 81, 106527 (2020). https://doi.org/10.1016/j.compeleceng.2019. 106527 10. Azeez A.O., Anuoluwapo O.A., Lukumon O.O., Sururah A. 49 Bello, Kudirat O.J.: Performance evaluation of deep learning and boosted trees for cryptocurrency closing price prediction. Expert Syst. Appl. 213, Part C, 119233, ISSN 0957–4174 (2023) 11. Aggarwal, A., Gupta, I., Garg, N., & Goel, A.: Deep learning approach to determine the impact of socio economic factors on bitcoin price prediction. In: 2019 Twelfth International Conference on Contemporary Computing (IC3), pp. 1–5. IEEE (2019, August). https://doi.org/ 10.1109/IC3.2019.8844928 12. Phaladisailoed, T., Numnonda, T.: Machine learning models comparison for bitcoin price prediction. In: 2018 10th International Conference on Information Technology and Electrical Engineering (ICITEE), pp. 506–511. IEEE (2018). https://doi.org/10.1109/ICITEED.2018.853 4911 13. Carbó, J.M., Gorjón, S.: Application of machine learning models and interpretability techniques to identify the determinants of the price of bitcoin (2022) 14. Pierdzioch, C., Risse, M., Rohloff, S.: A quantile-boosting approach to forecasting gold returns. North Am. J. Econ. Finance 35, 38–55 (2016). https://doi.org/10.1016/j.najef.2015.10.015
Applying Data Analytics and Time Series Forecasting for Thorough …
145
15. Sadorsky, P.: Predicting gold and silver price direction using tree-based classifiers. J. risk financ. manag. 14(5), 198 (2021). https://doi.org/10.3390/jrfm14050198 16. Mishra, A.R., Pippal, S.K., Chopra, S.: Time Series Based Pattern Prediction Using Fbprophet Algorithm For Covid-19. J. East China Univ. Sci.TechnoL. 65(4), 559–570 (2022) 17. Samin-Al-Wasee, M., Kundu, P.S., Mahzabeen, I., Tamim, T., Alam, G.R.: Time-Series Forecasting of Ethereum Price Using Long Short-Term Memory (LSTM) Networks. In: 2022 International Conference on Engineering and Emerging Technologies (ICEET), pp. 1–6. IEEE (2022, October). https://doi.org/10.1109/ICEET56468.2022.10007377 18. Sharma, P., Pramila, R.M.: Price prediction of Ethereum using time series and deep learning techniques. In: Proceedings of Emerging Trends and Technologies on Intelligent Systems: ETTIS 2022, pp. 401–413. Singapore: Springer Nature Singapore (2020). https://doi.org/10. 1007/978-981-19-4182-5_32
Practical Implementation of Machine Learning Techniques and Data Analytics Using R Neha Chandela, Kamlesh Kumar Raghuwanshi, and Himani Tyagi
Abstract In this digital era all E-commerce activities are based on the modern recommendation systems where a company wants to analyse the buying pattern of its customers to optimize their sales strategies which mainly includes focusing more on valuable customers which is based on the amount of purchase made by customer rather than the traditional way of recommending a product. In the modern recommendation systems different parameters are synthesized for designing efficient recommendation systems. In this paper the data of 325 customers who have made certain purchases from a website having naive parameters like age, job type, education, metro city, signed in with company since and purchase history are considered. The E-commerce business model’s profit making is primarily dependent on choice-based recommendation systems. Hence in this paper a predictive model using machine learning-based linear regression algorithm is used. The study is done using a popular statistical tool named R programming. In this study the R tool is explored and represented with utility for recommendation system designing and finding insights from data by showing various plots. The results are formulated and presented in a formal and structured way using the R tool. During this study it has been observed that the R tool has potential to be one of the leading tools for research and business analytics.
N. Chandela Computer Science and Engineering, Krishna Engineering College, Uttar Pradesh, Ghaziabad, India e-mail: [email protected] K. K. Raghuwanshi (B) Computer Science Department, Ramanujan College, Delhi University, New Delhi, India e-mail: [email protected] H. Tyagi University School of Automation and Robotics, GGSIPU, New Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024 P. Singh et al. (eds.), Data Analytics and Machine Learning, Studies in Big Data 145, https://doi.org/10.1007/978-981-97-0448-4_8
147
148
N. Chandela et al.
1 Introduction Recommendation systems are considered to be the future of business and marketing. All thanks to the recommendation system that we can offer products or services to new users. With no doubt, this has become a critical method for many online firms, such as Netflix, Flipkart, and Amazon [1]. Types of recommendation systems • Content-based recommendation system: This system is built on features using content to provide related products. For example, if I’m a fan of Chetan Bhagat and I order or read a book of this author online then the algorithm will suggest other books by the same author or books in the same category to me, this is called the system of collaborative recommendation. This scenario does not rely on product features, but rather on customer feedback. In other words, you can try to figure out what made product X so appealing to the buyer and propose items that contain that “what”. They are called Content-based recommender systems [2]. • User-based collaboration system: It is based on locating comparable users and locating items that those users have enjoyed but that we have not yet tried. You search for all other users who bought product X and compile a list of other things they bought. Take the products that appear the most frequently on this list. Item-based collaboration system: In this situation, we will look for similar products to the one the user purchased and recommend them. Where machine learning fits in: Both the user-based collaborative filtering systems and item-based collaborative filtering systems can rely on clustering as a foundation, though other machine learning algorithms may be better suited to the job depending on your project requirements. Clustering methods allow you to group persons and products based on similarity, making them a natural choice for a recommendation engine [3]. Another approach to making recommendations could be to concentrate on the differences between users and/or objects. Needless to say, the machine learning algorithms you use will be heavily influenced by the characteristics of your particular project.
1.1 Theoretical Concepts R programming Data science has taken the whole world today. Every sector of study and industry has been impacted as individuals increasingly recognise the usefulness of the massive amounts of data being generated. However, in order to extract value from those data,
Practical Implementation of Machine Learning Techniques and Data …
149
one must be skilled in data science abilities [4]. The R programming language has emerged as the de facto data science programming language. The adaptability, power, sophistication, and expressiveness of R have made it an indispensable tool for data scientists worldwide [5]. Features of R programming 1. R’s syntax is quite similar to S’s, making it easy for S-PLUS users to transfer over. While R’s syntax is essentially identical to S’s, R’s semantics, while outwardly similar to S’s, are significantly different. In reality, when it comes to how R operates under the hood, R is far closer to the Scheme language than it is to the original S language [6]. 2. R now operates on nearly every standard computing platform and operating system. Because it is open source, anyone can modify it to run on whatever platform they like. R has been claimed to run on current tablets, smartphones, PDAs, and game consoles [7]. 3. R has a great feature with many famous open source projects: regular releases. Nowadays, there is a big annual release, usually in October, in which substantial new features are included and made available to the public. Smaller-scale bugfix releases will be produced as needed throughout the year. The frequent releases and regular release cycle show active software development and ensure that defects are resolved in a timely way. Of course, while the core developers maintain the primary source tree for R, many individuals from all around the world contribute new features, bug fixes, or both. 4. R also offers extensive graphical features, which set it apart from many other statistical tools (even today). R’s capacity to generate “publication quality” graphics has existed since its inception and has generally outperformed competing tools. That tendency continues today, with many more visualisation packages available than ever before. R’s base graphics framework gives you complete control over almost every component of a plot or graph. Other more recent graphics tools, such as lattice and ggplot2, enable elaborate and sophisticated visualisations of high-dimensional data [8]. 5. R has kept the original S idea of providing a language that is suitable for both interactive work and incorporates a sophisticated programming language for developing new tools. This allows the user to gradually progress from being a user who applies current tools to data to becoming a developer who creates new tools. 6. Finally, one of the pleasures of using R is not the language itself, but rather the active and dynamic user community. A language is successful in many ways because it provides a platform for many people to build new things. R is that platform, and thousands of people from all over the world have banded together to contribute to R, create packages, and help each other utilise R for a wide range of applications. For almost a decade, the R-help and R-devel mailing lists have been very active, and there is also a lot of activity on sites like Stack Overflow [9].
150
N. Chandela et al.
Free Software Over many other statistical tools R has a significant advantage that it is free in the sense of free software (and also free in the sense of free beer). The R Foundation owns the primary source code for R, which is released under the GNU General Public Licence version 2.0. According to the Free Software Foundation, using free software grants you the four freedoms listed below. 1. The ability to run the programme for any reason (freedom 0). 2. The ability to learn how the programme works and tailor it to your specific requirements (freedom 1). Access to the source code is required for this [10]. 3. The ability to redistribute copies in order to assist a neighbour (freedom 2). 4. The ability to develop the programme and make your innovations available to the public so that the entire community benefits (freedom 3) [11, 12]. Limitations of R 1. There is no such thing as a flawless programming language or statistical analysis system. R has a variety of disadvantages. To begin, R is built on nearly 50-yearold technology, dating back to the original S system developed at Bell Labs. Initially, there was little built-in support for dynamic or 3-D graphics (but things have substantially changed since the “old days”) [13–15]. 2. One “limitation” of R at a higher level is that its usefulness is dependent on consumer demand and (voluntary) user contributions. If no one wants to adopt your preferred approach, it is your responsibility to do it (or pay someone to do so). The R system’s capabilities largely mirror the interests of the R user community. As the community has grown in size over the last ten years, so have the capabilities. When I first began using R, there was very limited capability for the physical sciences (physics, astronomy, and so on). However, some of those communities have now embraced R, and we are seeing more code created for these types of applications [9, 16].
1.2 Practical Concepts Linear Regression To fill in the gaps, a linear regression approach might be utilised. As a refresher, this is the linear regression formula: Y = C + BX
(1)
We all learnt the straight line equation in high school. The dependent variable is Y, the slope is B, and the intercept is C. Traditionally, the formula for linear regression is stated as: h = θ0 + θ1
(2)
Practical Implementation of Machine Learning Techniques and Data …
151
‘h’ is the hypothesis or projected value, X is the input feature, and the coefficients are theta0 and theta1. We will utilise the other ratings of the same movie as the input X in this recommendation system and predict the missing values. The bias term theta0 will be avoided. h = θX
(3)
Theta1 is started at random and refines over iterations, just like the linear regression technique. We will train the algorithm with known values, much like in linear regression. Consider a movie’s known ratings. Then, using the formula above, forecast those known ratings [17, 18]. After predicting the ratings values, we compare them to the original ratings to determine the error term. The error for one rating is shown below. ( j )T i θ x − y i, j
(4)
Similarly, we must determine the inaccuracy for each rating. Before I go any further, I’d like to introduce the notations that will be used throughout this paper. n u = no. of users. n m = no. of movies. r(i,j) = 1 if user j has rated movie i. y (i, j) = rating given by user j to movie i (defined only if r(i,j) = 1). Here’s the formula for the total cost function, which will show the difference between the expected and original ratings. n ( ) )2 λ ∑ 2 1 ∑ (( j )T i j θ θk x − y i, j + 2 i:r (i, j)=1 2 k=1
(5)
The error term is squared in the first term of this expression. To avoid any negative numbers, we use the square. We optimise the squared using 1/2 and calculate the error term where r(i, j) = 1. Because r(i, j) = 1, the rating was provided by the user [19]. The regularisation term is the second term in the equation above. It can be used to regularise any overfitting or underfitting issue [13].
2 How to Build a Recommendation Engine in R Knowing the input data The data has the following variables about the customers: • Age- It tells about the age of the customers
152
N. Chandela et al.
Fig.1 Dataset features
Fig. 2 Dataset closure look
• Job Type- Employed, Unemployed, Retired. • Education- Primary, Secondary, Graduate • Metro city- Yes (if the customer stays in a metro city,) No- (If the customer is not from the metro city) • Signed in Since- This is the number of days since the customer signed in to the website for the first time. • Purchase made- the total amount of purchase made by the customer (Fig. 1). STEP 1: Data Preparation (Fig. 2). • Check the missing values and the mean, median, mode
Practical Implementation of Machine Learning Techniques and Data …
153
summary(Data). Using the summary command, we can check the mean, median, mode and missing values for each variable. In this case, we have 13 missing observations (NAs) for the variable age. Hence, before going ahead, we need to treat the missing values first [20] (Fig. 3). • Histogram to see how the data is skewed hist(Data$Age). We specifically check the data distribution to decide by which value we can replace the missing observations. In this case, since the data is somewhat normally distributed, we use mean to replace the missing values (Fig. 4). • Replacing the NA values for variable Age with mean 39 Data$Age[is.na(Data$Age)] = 39. • Check if the missing values are replaced from the variable Age summary(Data). Here, we can see that the missing values (NAs) are replaced by the mean value of 39 (Fig. 5).
Fig.3 Histogram to see skewness in dataset
Fig. 4 Quantitative Data Representation
154
N. Chandela et al.
Since we have handled the missing values, let’s have a look at the data head(Data). After handling the missing values, we can see that there are categorical variables such as Marital status, metro city, education which we need to convert in dummy variables. STEP 2: Creating New Variables As seen in the data, four of our variables are categorical, which we need to create as dummy variables first. • Data$Job.type_employed