105 73 17MB
English Pages [264] Year 2024
COMPUTATIONAL SCIENCE
AND ITS APPLICATIONS
COMPUTATIONAL SCIENCE
AND ITS APPLICATIONS
Edited by
Anupama Chadha, PhD
Sachin Sharma, PhD
Vasudha Arora, PhD
First edition published 2024 Apple Academic Press Inc. 1265 Goldenrod Circle, NE, Palm Bay, FL 32905 USA
CRC Press 2385 NW Executive Center Drive, Suite 320, Boca Raton FL 33431
760 Laurentian Drive, Unit 19, Burlington, ON L7N 0A4, CANADA
4 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN UK
© 2024 by Apple Academic Press, Inc. Apple Academic Press exclusively co-publishes with CRC Press, an imprint of Taylor & Francis Group, LLC Reasonable efforts have been made to publish reliable data and information, but the authors, editors, and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors are solely responsible for all the chapter content, figures, tables, data etc. provided by them. The authors, editors, and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged, please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please contact [email protected] Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for identification and explanation without intent to infringe. Library and Archives Canada Cataloguing in Publication Title: Computational science and its applications / edited by Anupama Chadha, PhD, Sachin Sharma, PhD, Vasudha Arora, PhD. Names: Chadha, Anupama, editor. | Sharma, Sachin, editor. | Arora, Vasudha, editor. Description: First edition. | Includes bibliographical references and index. Identifiers: Canadiana (print) 20230469892 | Canadiana (ebook) 20230469949 | ISBN 9781774912751 (hardcover) | ISBN 9781774912768 (softcover) | ISBN 9781003347484 (ebook) Subjects: LCSH: Computational intelligence. Classification: LCC Q342 .C66 2023 | DDC 006.3—dc23 Library of Congress Cataloging-in-Publication Data
CIP data on file with US Library of Congress
ISBN: 978-1-77491-275-1 (hbk) ISBN: 978-1-77491-276-8 (pbk) ISBN: 978-1-00334-748-4 (ebk)
About the Editors
Anupama Chadha, PhD MRIIRS, Faridabad, India Anupama Chadha, PhD, has done her doctorate in Data Mining. She has completed her MCA with distinction in 1999. After a short stint in the industry, she pursued her career in teaching and has around 20 years of teaching experience in various reputed colleges and universities. She has authored or coauthored many research articles and papers in various reputed journals and conferences. She is an editor and reviewer of many international and national journals. She is a member of research-oriented organizations, including CSI, ACM, and IETE. She has chaired various sessions in reputed conferences and has delivered keynote talks. She has conducted and organized many workshops on the latest IT technologies. She has organized several national and international conferences in association with IEEE and CSI. She has coauthored books on data mining. Dr Chadha’s area of research includes data mining, big data, and machine learning. Sachin Sharma, PhD MRIIRS, Faridabad, India Sachin Sharma, PhD, obtained his PhD degree in Computer Applications in the field of Data Mining. He is currently working as an Associate Professor in the Faculty of Computer Applications, Manav Rachna International Institute of Research and Studies, Faridabad. He is engaged in teaching MCA and BCA classes and has teaching experience of more than 22 years. He has published more than 32 research papers in the national and international journals. He has guided various MCA students for their project works. He is an editor and reviewer of several reputed Scopus-indexed journals, such as IEEE, Elsevier. He is a member of various research organizations, such as Computer Society of India, ACM, TOJET, IJCST, IAENG, Internet Society, IJET, and so on. He is an editorial board member of various national and international journals.
vi
About the Editors
He has chaired various sessions in reputed conferences. He has delivered keynote talks, conducted workshops on upcoming technologies, and has organized several international conferences in association with IEEE and Springer. His main interest areas include data mining, data structures, big data, and data science. He has published four books with two books in the pipeline. His three patents on IoT and smart card have been published. Vasudha Arora, PhD Sharda University, Greater Noida, India Vasudha Arora, PhD, obtained her doctorate in Computer Science Engineering in 2017. She completed her master’s degree with distinction from Panjab University, Chandigarh, in 2009, and obtained her bachelor’s degree with honors from Maharishi Dayanand University in 2003. She has over 16 years of rich academic and research experience in various reputed colleges and universities. She has authored or coauthored more than 25 scholarly research articles and conference papers. She is a reviewer of several reputed Scopus-indexed journals of IEEE, Springer, and Elsevier. She has delivered keynote talks, conducted workshops on upcoming technologies, and has organized several international conferences in association with IEEE and Springer. She has supervised five M.Tech dissertations and is currently supervising two PhD scholars. Dr. Arora’s area of research includes cloud computing, network security, data science, and machine learning. Dr. Arora is an IBM-certified faculty member for several courses, including cloud computing and virtualiza tion, cloud security, cloud deployment models, and has achieved various online certifications from IBM, NPTEL, EICT Academy, IIT Roorkee to name a few.
Contents
Contributors......................................................................................................... ix
Abbreviations ....................................................................................................... xi
Preface ............................................................................................................... xiii
1.
Artificial Intelligence .................................................................................. 1
Kavita Arora
2.
Machine Learning..................................................................................... 13
Shashi Tanwar
3.
Data Science............................................................................................... 43
Arti Chauhan and Ashirwad Samuel
4.
Quantum Computing................................................................................ 97
Anupama Chadha, Sachin Sharma, Rahul Chaudhary,
Reuben Vernekar, and Arun Rana
5.
Image Processing..........................................................................................117
Swetta Kukreja, Rohit Sahoo, Deepak Jain, and Vasudha Arora
6.
Evolutionary Algorithms ........................................................................ 145
Richa Sharma
7.
Process Simulation .................................................................................. 163
Kshatrapal Singh, Ashish Kumar, and Manoj Kumar Gupta
8.
Need for Deep Learning ............................................................................. 179
Bobby Singh, Nikita Gupta, Sachin Sharma and Anupama Chadha
9.
Computational Intelligence for Big Data Analysis.................................. 199
Anu Manchanda
10. Hybridization of Computational Intelligent Algorithms........................ 231
Anupama Chadha
Index ................................................................................................................. 241
Contributors
Kavita Arora
Faculty of Computer Applications, Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India
Vasudha Arora
Sharda University, Greater Noida, India
Anupama Chadha
Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India
Rahul Chaudhary
Faculty of Computer Applications, Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India
Arti Chauhan
Department of Computer Science & Engineering, School of Engineering, G D Goenka University, Gurugram, Haryana, India
Manoj Kumar Gupta
Faculty of Computer Science & Engineering, SMVD University, Katra, India
Nikita Gupta
Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India
Deepak Jain
Department of Computer Engineering, Terna Engineering College, Navi Mumbai, India
Ashish Kumar
Department of Computer Science & Engineering, ITS Engineering College, Greater Noida, India
Swetta Kukreja
Amity School of Computer Science and Engineering, Amity University, Mumbai
Anu Manchanda
Department of MCA, CMR Institute of Technology, Bengaluru, India
Arun Rana
Faculty of Computer Applications, Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India
Rohit Sahoo
Department of Computer Engineering, Terna Engineering College, Navi Mumbai, India
Ashirwad Samuel
Department of Computer Science & Engineering, School of Engineering, GD Goenka University, Haryana, India
x
Contributors
Richa Sharma
ASET, Amity University, Noida, India
Sachin Sharma
Faculty of Computer Applications, Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India
Bobby Singh
Manav Rachna International Institute of Research and Studies, Faridabad, Haryana India
Kshatrapal Singh
Department of Computer Science & Engineering, ITS Engineering College, Greater Noida, India
Shashi Tanwar
Aravali College of Engineering and Management, Faridabad, India
Reuben Vernekar
Faculty of Computer Applications, Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India
Abbreviations
AES AIoT ANN CH CNN cQPL CRM CT DB DE DL EAs EC ECC EDA EP ES GA GP HDFS HICC ILP IoE IoT LCS MFO ML MRI NE NLG PCA PDF PET
Advanced Encryption Standard artificial-intelligence-enabled Internet of Things artificial neural network cluster head convolutional neural networks correspondence capable programming language customer relationship management computed tomography database differential evolution deep learning evolutionary algorithms evolutionary computation elliptic curve cryptography exploratory data analysis evolutionary programming evolution strategy genetic algorithm genetic programming Hadoop distributed file system Hadoop Infrastructure Care Center inductive logic programming Internet of Everything Internet of Things Learning Classifier System moth flame optimization machine learning magnetic resonance imaging neuroevolution natural-language generation principal component analysis probability density function positron emission tomography
Abbreviations
xii
PSO QPL QPL RB RCNN RGB RMSE SIFT SPECT SVM TSP
particle swarm optimization Qualified Public Liability Corporation quantum programming language rule-base recurrent convolutional neural network red, green, and blue root mean square error scale-invariant feature transform single-photon computed tomography support vector machine traveling salesman problem
Preface
Digital technologies are changing at an extraordinary pace. Technologies, such as artificial intelligence (AI), machine learning (ML), the Internet of Things (IoT), big data analytics, and automation will continue to modify the work style, be it at home or at work place. The vast changes prevailing in today's technological landscape have helped businesses to expand resulting in improving the economy worldwide. The generation of vast amount of data has led to the development of technology called big data at an incredibly fast rate. This technology along with cloud computing technology and Internet has helped even small businesses to bloom. Availability of big data cloud does not require an elaborate setup as all information can be retrieved sitting remotely using an Internet connection. Nowadays, every new technology contains the flavor of AI or ML in it. These two technologies have applications in all walks of life and have transformed every segment, be it healthcare, education, agriculture etc. Many more IoT and cloud computing technologies are incorporating ML into gadgets to make them smart, hence making life smoother. ML has enabled people attain more in life by using smart software, giving a human face to machines. The main focus of this book is to discuss the overlapping behavior of some of the computational intelligence. This book also provides a summary of the real-life applications of these technologies. The book is designed keeping in mind the need of students and researchers who want to pursue their career in any of the verticals of computational intelligence. This book explains almost all the horizons of computational intelligence with the latest tools available to implement these techniques. The book will be of great help to teachers who teach any of the computational intelligence techniques in any of the undergraduate or postgraduate computer science programs.
CHAPTER 1
Artificial Intelligence KAVITA ARORA Faculty of Computer Applications, Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India
ABSTRACT Artificial intelligence, blockchain, cloud computing, and Internet of Things (IoT) are indispensable technologies navigating the way for digitized era. The conflux of all these technologies can bring out phenomenal modifications in the world of business and industry. Artificial intelligence with machine learning can train data and these data can be stored safely on cloud and used by blockchain for financial and other purposeful transactions which in turn gives leverage to IoT. It can be further proclaimed that the amalgamation of all these technologies will also head way toward business and industrial digitization too. This chapter discusses all these concepts in detail to give an insight of how business and corporations can take advantage of this conflux. 1.1 INTRODUCTION Artificial intelligence (AI) is conventionally connected with computer science governed at administering machines to perform various human like activities. Furthermore, the interpretation of intelligence is protracted to inculcate an intersperse set of dexterities, encompassing creativity, emotional knowledge, and self-awareness.
Computational Science and Its Applications. Anupama Chadha, Sachin Sharma, & Vasudha Arora (Eds.) © 2024 Apple Academic Press, Inc. Co-published with CRC Press (Taylor & Francis)
2
Computational Science and Its Applications
Artificial intelligence comes from the root word intelligent. Intelligence can have the connotation of meaning that is better, faster, capable, adapted to normal conditions. Intelligence can also mean the ability to understand. Intelligence is owned by everyone provided that the knowledge can be materialized. Even if someone is knowledgeable, but the knowledge cannot be put into practice by the person, then the person cannot be classified as being intelligent. Fundamentally, AI aspires to counter Turing’s test. It strives to recreate or mimic the intelligence of human beings in machines to perform the tasks performed by a human being. This charade may nuzzle rules of learning to gather information and use the gathered rules and information to land onto an inference. Artificial intelligence also empowers machines to adapt self-corrective measures.
FIGURE 1.1
Position chart of artificial intelligence.
Artificial Intelligence
3
Artificial intelligence techniques show how we think and how we can better apply intelligence. Artificial intelligence techniques also make computers easier to use and make knowledge increasingly widespread among the public. Entire AI methodology is gleaned from different concept institutions namely—conventional and computational intelligence. Conventional AI predominantly necessitates matter such as machine learning. It is exceptionally engrossed and centers around statistical calculations and results. This approach symbolizes the accumulation of integrants and logical responsive. It had been quite famous in the mid 1990s and was named as old-fashioned artificial intelligence or GOFR. Conventional AI technique includes expert systems, case-based reasoning, Bayesian network, and behavior-based AI. Computational intelligence involves adaptive mechanisms for intel ligent behaviors in complex environments such as the ability to adapt, generalize, abstract, discover, and associate. Some examples of application of AI are the following: 1. A Tesla car (America) which has an autopilot system on produc tion car is an example of an automatic steering system on a car. 2. Self-parking system or automatic parking in modern cars that can help the driver to park a car automatically. 3. Face recognition unlock system on a smartphone has the ability to recognize the owner’s face. 4. Virtual assistant like SIRI from Apple, Cortana from Microsoft, and Google Assistant from Google who can assist users according to their preferences. 1.2 CONVERGENCE OF BLOCKCHAIN, INTERNET OF THINGS, AND AI Blockchain, Internet of Things (IoT), and AI are cue technologies and can be amalgamated to drive new surge of digitization. These multidi mensional and revolutionary technologies possess futuristic prospects to revamp contemporary business processes and generate novice business archetype. Self-governing agents like cars, machines, cameras, and other IoT enabled activities will boom with digital advantage of Internet of Things,
4
Computational Science and Its Applications
sending and receiving money exploiting blockchain technology and independently making decisions employing artificial intelligence and data analytics. Blockchain, for instance, can enhance certitude, lucidity, reliability, and authentication of business processes by administering a conjoint and consolidated disbursed ledger which can act as a repository for a plethora of assets. In addition to that, IoT is a driving force for industry automa tion. Lastly, AI upgrades business affairs by using pattern detection and revamping their corollaries. A conjunction would definitely lead to burgeoning of business models and digitization of business syndicates. 1.2.1 INTRODUCTION TO IOT AND BLOCKCHAIN IoT and blockchain are leading technologies clinching prevalence ever since they came into conception. In imminent times, IoT will definitely sway as good as day-to-day items we use. With the spiraling use of this technology, commination to embezzlement will increase too. Any surviving technology is not sufficient to handle this. Thus, blockchain has surfaced as a powerful key to emanate any security accouter analogous to IoT. In the last few years, blockchain has acquired incredible acceptance. It is capable enough to transform and surge worldwide armature of technolo gies. The reasons why some specific domains may get maximum leverage due to this technology are that a disseminated system can be created which will do away with the rolling back of central servers and furthermore an entirely crystal clear and easily accessible database can be created. It can be thought that blockchain is a technology to manage and make payments over the Internet using virtual currency, but that is only the beginning of the ecosystem and the possibilities. Internet of Things implies to a wobbly system of manifold diversified and analogous systems capable of sense, process, and network. Internet of Things is not only an idea now-a-days but it has also become a vital part in everyone’s life. The “smartphone” is uttermost requisition of Internet of Things. Its application is not only constrained to smart homes but it also ranges from industry to commerce, farming, public safety, and healthcare too. Furthermore, it can reckon to be “Internet of Everything (IoE)” due to its far-reaching concrete entreaty.
Artificial Intelligence
5
Distinct felicities of blockchain and Internet of Things coupling are the following: • Assurance: Blockchain dispenses a soaring magnitude of depend ability and lucidity which permits organizations in quick verifica tion of information to create trust, guard progression, and activate remittance of money. • Coherence: Using blockchain, businesses can automate their processes and interchange data even without laying down centralized IT structures. • Dexterity: Blockchain validates prescribed behavior between mobile entities without any arbitrator to validate the IoT transactions. 1.2.2 ROLE OF AI IN IOT Artificial intelligence-enabled Internet of Things applications (AIoT) represent a broad range of practices which anchorage both AI and IoT. In AIoT pertinence, AI is ingrained into copious integrants such as Edge Computing, Chipsets, and software allied with IoT networks. In unison, they shape intelligent and allied systems, where AI acts as “the brain” to IoT’s “body.” Artificial intelligence leads the way for machine-learning and decision-making vigor to IoT to enhance data management and analysis and permitting colossal production yields. Artificial intelligence is briskly congregating with IoT to contemplate that intelligence has turned out to be a prominent stipulation of connected entities. The main reason behind this is that AIoT empowers real-time analysis and response and also influences long-term analysis by permitting users to recognize patterns in historical data to discern inclination which occurs over sustained duration. 1.2.3 AI AND IOT BASED ON CLOUD As we know that IoT, cloud computing, and AI are avant-grade technologies which have actually revolutionized the world. Internet of Things bridges virtual and physical domains together using different protocols. These domains spawn bounteous data having pivotal statistics of physical world.
Computational Science and Its Applications
6
FIGURE 1.2
Blend of AI and IoT.
Source: Reprinted from Ref. [17].
In addition to this, the combination of these technologies will certainly enable both decision making and furnish expert specialized experience to the users. Besides, the expeditious progression in AI directed by esca lating computation capability accompanying training of data scientists and attainability of various machine learning tools geared toward developing progressive algorithms are in fact kinetic productive usage of IoT into dimension of viable and empirical aptness. 1.2.4 TRANSFORMING BUSINESS USING AI AND IOT Artificial intelligence is visualized to discharge a profusion of ingenious tasks such as voice recognition, language processing, and decisionmaking. This technique can also play a pivotal role to emanate the spate of data being processed by IoT-enabled devices. In contemplation of exploiting absolute prospects of endorsement, IoT is now being associated with evolving AI technologies so as to help the business to come up with intelligent decision making in absence of any intercession. This stupendous transpose induced by a blend of these technologies is absolutely redesigning technological outlook. As anticipated, a massive number of businesses such as manufacturing, smart homes, sensors, airlines, and drones, are regularly embracing and administering these technologies in myriad frameworks. Internet of Things and AI together are leading to a wide spectrum of results. Here are a few of the most prominent boons of amalgamating these two technologies to the businesses:
Artificial Intelligence
• • • • •
7
Govern, analyze, and procure worthwhile discernment from data Corroborate quick and error-free analysis Maintain protection from many attacks in network Uplift operative competence Being one step ahead in management of risk
In a nutshell, it can be said that in approaching times, AI and IoT will play a pivotal role in business in numerous manners.
FIGURE 1.3
Role of AI in IoT.
1.2.5 APPLICATION OF AI IN BLOCKCHAIN Blockchain is a dispensed, apportioned, and entrenched ledger for storing encrypted data and AI is the propelling engine which enables analysis and decision making from the assembled data. It is an approach helpful in tracking transactions and is used marketably in applications such as tracking ownership of documents, digital assets, or voting rights. Blockchain and AI are the most ardent technologies trending presently. Nonetheless, even though every individual technology possesses a different level of complicacy, still they combine to make the most of each other.
Computational Science and Its Applications
8
FIGURE 1.4
Blockchain technology.
Moreover, blockchain is capable to fabricate AI progressively logical and explicit so as to bring out the crystal-clear picture application of machine learning in decisions making. Blockchain and its ledgers are the repositories where every possible use for decision making under machine learning is stored. Likewise, AI too can elevate the effectiveness of blockchain in a better way as compared to humans or standard computing. The conjunction of AI with blockchain technologies generates conceivably around the world a square-shooting technology. The technical conjunction of these two will lead to all the below mentioned aspects: Dependability: With the pursuit of AI, blockchain technology has become shielded using secured imminent application disposition. Coherence: Artificial intelligence is helpful in optimizing calculations to diminish prospector load resulting in low network suspension leading to quicker transactions. It also enables to lessen carbon footprint of blockchain technology. Conviction: Used in association with AI provides assistance to the robots to certitude each other, escalating M2M reciprocity and permits them to measure data and synchronize decisions predominately.
Artificial Intelligence
9
Repository: Blockchain is used for stockpiling confidential and personal data which is when smartly processed with AI can further provide excellence as well as ease to use.
FIGURE 1.5
Application areas of AI and blockchain technologies.
Source: Reprinted from Ref. [18].
Let us discuss appositeness of AI and blockchain technologies in conjunction: 1. Creating diverse datasets: Blockchain generates pellucid networks which are easily accessible worldwide. While being associated with APIs and various AI agents, it may lead to designing of several algorithms based on diverse data. 2. Safeguarding of data: Artificial intelligence in general and machine learning algorithms in specific are improving themselves at length. Furthermore, blockchain is a technology which stores the enciphered data in its distributed ledgers and thus lays down foundation of extensively secured databases. 3. Validation of data: Blockchain as a distributed database permits secure, transparent and tamper-proof record keeping and artificial intelligence supports intelligent decision-making on the basis of large amount of data. 4. Artificial intelligence and encryption required in blockchain go hand-in-hand as blockchain bestows encryption, and AI is a repository for the storage of huge data. 5. Blockchain can be helpful in tracking, understanding, and explaining decision making using AI. 6. Artificial intelligence can serve as a better manager for blockchain as compared to human beings or any other super computers.
Computational Science and Its Applications
10
7. Artificial intelligence can devise and market digital investment assets atop high-speed blockchains. 8. Banking sector progressively uses blockchain technology for interoperability. Taking extra advantage of AI, blockchain can furthermore augment this mechanism. 9. Ameliorate business data models. 10. Combination of blockchain and AI enables synergies in both scale and efficiency of handling financial matters. 11. Ingenious scrutiny and complaisance system. In a nutshell, catering large amount of data is actually a ferocious job and since the merger of these two technologies is handling it so smoothly, this amalgamation is going to prove to be a boon for the society, business, and markets, and pave way for digitization. 1.2.6 REMODELING OF FINANCES USING ARTIFICIAL INTELLIGENCE AND BLOCKCHAIN Artificial intelligence has made a pronounced impression on the economic sector. Almost every economic sector presently uses this technology to enhance value, diminish costs, and salvage time. Moving ahead, it would serve this industry to augment their turnover, curtail risk, and escalate revenue regardless of banks, trade, or investment lending business. In a nutshell, blockchain, IoT, and AI are technologies which can work hand-in-hand in multidirectional aspects. Their conflux will lead to new business models and products, and these services will benefit from the amalgamation of these technologies. KEYWORDS • • • •
artificial intelligence blockchain cloud computing Internet of Things
Artificial Intelligence
11
REFERENCES 1. Banafa, A. IoT and Blockchain Convergence: Benefits and Challenges; IEEE Internet of Things, 2017 Google Scholar. 2. Khan, M. A.; Salah, K. IoT Security: Review, Blockchain Solutions, and Open Challenges. Futur. Gener. Comput. Syst. 2018, 82, 395–411. 3. Banafa, A. IoT Standardization and Implementation Challenges; IEEE: Org Newsletter, 2014. 4. Serrano, M.; Soldatos, J. IoT Is More Than Just Connecting Devices: The Open IoT Stack Explained, 2015. 5. Somov, A.; Giaffreda, R. Powering IoT Devices: Technologies and Opportunities. Newsletter, 2014. 6. https://www.tiempodev.com/blog/aiot-the-role-of-artificial-intelligence-in-the internet-of-things/ 7. https://opensource.com/article/18/7/digital-transformation-strategy-think-cloud 8. https://www.cisin.com/coffee-break/Enterprise/merge-of-ai-and-iot-is-an-great-tool whether-you-apply-it-in-edge-or-cloud-computing.html 9. https://www.geeksforgeeks.org/the-role-of-artificial-intelligence-in-internet-of-things/ 10. https://aibusiness.com/ai-brain-iot-body/ 11. https://thenextweb.com/hardfork/2019/02/05/blockchain-and-ai-could- be-a-perfect match-heres-why/ 12. https://www.forbes.com/sites/darrynpollock/2018/11/30/the-fourth-industrial revolution-built-on-blockchain-and-advanced-with-ai/#4cb2e5d24242 13. https://www.forbes.com/sites/rachelwolfson/2018/11/20/diversifying-data-with artificial-intelligence-and-blockchain-technology/#1572eefd4dad 14. https://hackernoon.com/artificial-intelligence-blockchain-passive-income-forever edad8c27844e 15. https://blog.goodaudience.com/blockchain-and-artificial-intelligence-the-benefits of-the-decentralized-ai-60b91d75917b 16. https://aisuperior.com/2020/04/how-artificial-intelligence-is-transforming-the finance-and-cryptocurrency-industry 17. Breed Reply. Why Are AI and IoT Perfect Partners for Growth? https://www.reply. com/breed-reply/en/content/why-are-ai-and-iot-perfect-partners-for-growth 18. BBVA Group. Blockchain and AI: A Perfect Match? 06 May 2019. https:// www.bbvaopenmind.com/en/technology/artificial-intelligence/blockchain and-ai-a-perfect-match/
CHAPTER 2
Machine Learning SHASHI TANWAR Aravali College of Engineering and Management, Faridabad, Haryana, India
ABSTRACT Machine learning is a popular field which is growing rapidly with its advanced algorithms and ability to recognize unknown patterns using models and its statistical approaches. This chapter gives a brief summary of machine learning and its related terms such as artificial intelligence, deep learning with related algorithms, and models.1 This chapter also defines the relation of machine learning with Internet of Things, web of things, cloud, and embedded devices with their application area. Applica tions of machine learning are also described here in detail with related examples. Challenges of different domains are also discussed with lots of innovative ideas constructed by machine learning. The content in this chapter is done with the help of literature review and by the help of various useful resources.2 2.1 INTRODUCTION The aim of the present chapter is to explore in detail about the most popular term “machine learning (ML).” Machine learning which has the power to guide the human. The beginning of this chapter is with the basic concept of ML and later on various areas of ML like image processing and computer vision are discussed.3,4 We will study about the different types of ML, how Computational Science and Its Applications. Anupama Chadha, Sachin Sharma, & Vasudha Arora (Eds.) © 2024 Apple Academic Press, Inc. Co-published with CRC Press (Taylor & Francis)
Computational Science and Its Applications
14
these types are different from each other, methods that come under these types, different models, and others.
FIGURE 2.1
Machine learning.
This chapter will cover the following: • Some fundamental concepts of machine learning • Advantages and disadvantages of machine learning approaches • Relation between machine learning, deep learning (DL), and arti ficial intelligence (AI) • Machine learning with Internet of Things (IoT) and with cloud computing • Main areas where we use ML, and others 2.2 CONCEPT OF MACHINE LEARNING 2.2.1 INTRODUCTION The term machine learning was first discovered by Arthur Samuel, an American pioneer in 1959, and later in 1997, Tom Mitchell gave a wellversed definition about it. Later on many experts gave their definitions about this technology. Machine learning is a powerful field which assists humans to work automatically without explicit programming. It is a great learning of machines which makes our lives comfortable and easy to improve things with automatic decision making capability.4 It can create better species who can think well than a human being. Before ML, there were the following problems:
Machine Learning
15
• How can a large complex computer model be trained in computer engineering? • How can researchers develop the ideas about the neuroscience problems? • How effectively robust versions of AI in different areas can be trained? To overcome these problems, ML came into existence. It was then inherited by AI and move toward the methods and models. Machine learning is a field in which any computer system performance can be increase with past experience automatically —Herbert Alexander Simon
Knowledge of humans is obtained only by the experiences throughout their life. For machines, those knowledge is needed to be fed by collecting an extensive quantity of data on a definite application; machines also obtain knowledge in a short period of time. Machine learning is an innovative field that allows us to build intelligent software by data to display some desired intelligent behavior.5 Our machine-intelligent behavior through a machine which is trained software. It is for this reason that while ML is only directed to build an artificially intelligent system, for all practical purpose ML and AI are used interchangeably today. All applications of AI are also applicable in ML. Machine learning process contains mainly two phases: 1. Learning: Learning contains training data. It includes prepro cessing, learning, and error analysis. 2. Prediction: It contains models and new data. Machine Learning Examples • • • • • • • • •
The self-driving car Web search results Fraud detection Pattern recognition Credit scoring Market pricing models Social listening applications Text-based sentiment analysis Prediction of success and failures
Computational Science and Its Applications
16
• Online recommendation/offers on various popular ecommerce sites like Amazon, Netflix, and others. 2.2.2 ADVANTAGES AND DISADVANTAGES OF MACHINE LEARNING Advantages • Accurate decision making power • Selfness with no breaks • Machine learning is used to control multidimensional and variation in dynamic environment • Machine learning allows efficient utilization of resources in less time • Does not get tired and wear out easily • Instagram and Facebook use ML to push relevant advertisement. • Very faster, accurate, and reliable. • Machine learning is able to work on large volume of data. • Significant amount of data processing and content handling power. Disadvantage • • • • • • • • •
Lack of creativity Security issues are there Incurs high cost Machine learning also need a lot of training and tested data Lack of variability Deployment challenges Technical challenges High error susceptibility Sometimes gives error
2.2.3 MACHINE LEARNING ALGORITHMS 1. Liner Regression-Linear regression is a type of supervised learning. In this method, the output prediction is continuously dependant on a constant slope. In this, the predicted values lie in a continuous range. It mainly shows continuous graphs for represent things such as price, sales, and others. This algorithm is used in both statistical as well as ML algorithms with linear regression in both directions x and y.
Machine Learning
17
Like y = a1 + a2x
FIGURE 2.2
Linear regression slop.
There are mainly two types of linear regressions: 1. Simple regression 2. Multivariable regression 2. Logistic regression: Logistic regression is also the part of super vised learning. In this classification method, target variable which is a type of dependent variable find. This variable mainly consists of two classes values (dichotomous variable) in the binary form either 1 or 0 (1 means success or Yes and 0 means failure or No). It is a very simple algorithm used for various classification problems like cancer detection, diabetes detection, or in many medical areas and it can be helpful in the analysis of population growth. This method forecasts the values on the basis of the prospect concept. It repre sents the values by S-shaped curve to show any real-world values. 3. Decision Trees: Decision tree hierarchical structure contains a number of nodes. The decision tree is a structure where every node represents a test. Each leaf node acts as a class label and the other branches show the collection of various characteristics that lead to a particular label. In this, the hierarchical diagram shows the hierarchy from the root node to leaf nodes. Below diagram showing the different nodes and dependency between them. 4. Naive Bayes Classifier: This method is also a type of supervised learning algorithm. It is based on conditional probability. This
Computational Science and Its Applications
18
method shows the relationship between different events. Any event will depend on another event that has already happened. Its equation is written as:
FIGURE 2.3
Linear regression and logistic regression.
FIGURE 2.4(a)
Decision tree.
FIGURE 2.4(b)
Decision tree.
Machine Learning
19
P ( A B) =
P ( B A) P ( A) P ( B)
Let us discuss all terms used in the above equation: •
“P” is probability
Naive Bayes algorithm is more simple, fast, and scalable model. It requires little data as input for further processing. 5. Artificial Neural Network: It came into existence in the 1970s which requires one input, one output, and numerous of hidden layers. A number of neurons attach to each layer which passes the message to each layer helps at last in the decision-making. Artifi cial neural network (ANN) works on the phenomenon of random approximation. ANN is used in many applications such as in image recognition, in medical diagnosis, speech recognition, and for security purposes. It works on the basis of datasets by applying various nonlinear statistical models to produce a right pattern for a particular input. ANN provides the best result in prediction among various datasets and cost-effective method.
FIGURE 2.5
ANN.
6. Random Forest: Random forest is a method that comes under the supervised learning technique. It is the best way to represent classification and regression problems deals with machines. This method is based on ensemble learning, which consists of various classifiers which are used to solve complex problems. Random forest gives the result on the basis of prediction method made by different trees rather than a single-decision tree. It is featured-based method that gives better result as compared to random guessing.
Computational Science and Its Applications
20
7. Support Vector Machine: Support Vector Machine (SVM) is a popular and effective method that classifies the result by hyper plane graph for N-dimensional space. It is used for regression problems that classify multidimensional problems. It shows a correct line by new data points called hyperplane. This method depicts the problem areas such as maximum hyperplane, negative hyperplane, and positive hyperplane between two positive hyper planes in which SVM works showing the correct result.
FIGURE 2.6
SVM.
8. Gaussian Mixture Model: Gaussian mixture model is purely based on the probabilistic model which represents the population and subpopulations by mixture model. This method contains a number of clusters and their subparts which further contain a number of values. Here, a value u or any constant represents a mean value which represents the maximum output or best output.
FIGURE 2.7
Gaussian mixture model.
Source: Reprinted from Ref. [13].
Machine Learning
21
9. Anomaly Detection: Unexpected anomalies of data come under this category. These types of anomalies mostly occur through vast amount of data which immediately interact with the normal process. Some problems like credit card fraud, a cyber attack, machine failing, and others are considered here. Generally, anomalies can be categorized into the following categories: • Point Anomaly: It is a single instance or a tuple of dataset which takes anomalies in data. • Contextual Anomaly: It is an abnormal situation in which data commonly occur in time-series data. • Collective Anomaly: It is a collection of data instances which help in finding normally. 10. K-Means Clustering: It is a method that is used when unlabeled data are given. It is a part of unsupervised learning. In this algo rithm, unlabeled data are categorized into a number of groups or clusters with the variable k. The data are partitioned according to k variable into k clusters. It is used to solve the clustering problem when the problem is partitioned into clusters. It is easy to solve those clusters’ problems. Each cluster contains similar properties. A centroid is a set for each cluster; this centroid of each cluster is used to give the new label to every cluster. This K-mean clustering is used in common business cases with running algorithms.
FIGURE 2.8
K-mean clustering.
Source: Reprinted from Ref. [14].
22
Computational Science and Its Applications
2.2.4 TYPES OF MACHINE LEARNING
2.2.5 MAIN MODELS OF MACHINE LEARNING 1. Classification: Predicting Discrete Labels • Models that predict labels as two or more discrete categories, classification models mainly come under supervised learning. • Generate output in categories wise. • Helps to generate categorical data. • Algorithm: Logistic regression, SVM, and random tree • Example: Classify emails as spam or non-spam 2. Regression: Predicting Continuous Labels • Regression is a process in which models predict continuous labels. It is also a part • Type of supervised learning. • Output is a continuous quantity • Main aim is to forecast or predict • Algorithm: Linear regression • Example: Predict stock market price. 3. Clustering: Inferring Levels on Unlabeled Data • Clustering is one common case of unsupervised learning in which automatically assigned to sum number of discrete groups.
Machine Learning Types
Types
Definition
Methods
Applications
1. Supervised Learning
- In supervised learning, direction of work is mandatory.
A. Classification
• Used in fraud detection
B. Regression
• Image classification • Clint retention
- fully trained
• Diagnostics
- Labeled data
Machine Learning
TABLE 2.1
• Speech vector machine
- Direct feedback
• Advertising popular prediction
- Predict outcome
• Market forecasting • Weather forecasting • Calculating life expectancy • Prediction of population growth • Rainfall prediction
2. Unsupervised Learning
No need of No labeled data No feedback
A. Clustering or
• Image segmentation • Customer segmentation
segmentation B. Association
• Video recommendation system
C. Dimensionally reduction
• Biology
• Targeted market • City planning • Visualization in big data • Structure discovery
23
• Feature detection
(Continued)
Types
24
TABLE 2.1
Definition
Methods
Applications • Image recognition and classification • Feature extraction • Feature selection
3. Semi-Supervised Learning
- Incomplete training signals are given
A. Classification
• Text classification.
B. Clustering
• Lane finding on GPS data
4. Reinforcement Learning
- Reinforcement learning is very behavior driven.
A. Classification
• Real-time decisions
B. Control
• Robot navigation • Motion planning • Driver less car
Computational Science and Its Applications
- Positive and negative feedback are present.
Machine Learning
• • •
FIGURE 2.9
25
Assign different clusters with different data points Algorithm: K-means, spectral clustering, and so on Example: Find all transactions which are fraudulent in nature.
Regression, clustering, and classification.
4. Dimensionality Reduction: Inferring Structure of Unlabeled Data Dimensionality reduction is an example of an unsupervised algorithm, in which levels or other information’s are inferred from the structure of dataset itself. Models which detect and identify the lower dimensional structure and higher dimen sional structure affect the reduction in random variables. The algorithms used for this approach are matrix factorization and feature selection. 5. Model Selection Model selection is an appropriate method that is used to compare, validate select, and tune to improve parameters. The models generally used for this purpose are Grid search, Cross-validation, matrices, and others. 6. Preprocessing Processing phase includes feature extraction and normalization under preprocessing. Data are transfer into some useful manner so that it can be served as input for ML algorithms.
26
Computational Science and Its Applications
2.3 CORRELATION BETWEEN AI, ML, AND DEEP LEARNING Artificial intelligence, ML, and DL are three terms that are interrelated to each other, and many times, a lot of people are confused between these terms. Artificial Intelligence In 1956, the word artificial intelligence was discovered by John McCarthy who was an American scientist. He discovered this term to think rationally, act purposefully, and deal effectively with the environment. Artificial intelligence is the capacity to acquire knowledge by a machine or computer and it is called a robot or automation.7 It is a very broad term in which ML and DL fall. AI refers to the ability to discover and take actions about any problem with the best chance of achieving a specific goal. This includes things such as making decisions, recognizing objects, or understanding speech. The goals of AI are machine training, learning, perception, and problem-solving. Artificial intelligence continuously enhances the machine intelligence by reasoning and self-correction method and algorithms. Artificial Intelligence can be categorized into two: 1. Weak AI: It includes particular one job only like video games, online applications, programs with online-assisted services, and others 2. Strong AI: Advance jobs are considered under this which behaves like a human being. For example: self-driving cars, automated machines for disease detection, and others Machine Learning: ML is a part of AI. The significant role of ML are image processing, character recognition, and forecasting in different sectors like agriculture, industry, education, healthcare, electrical engi neering, and outer space research & aerospace engineering, etc. ML is a term that helps to create AI Software with the help of training that software acts desired intelligent behavior. It is for this reason that while ML is only one way to build an artificially intelligent system, for all practical purposes. ML and AI are used interchangeably today. All the activities of AI can be solved by ML. Machine learning contains auto ML, DL, Decision tree, Naive Bayes, Regression, and many more.
Machine Learning
FIGURE 2.10
27
Relationships between AI, ML and DL.
Deep Learning: The field of DL actually comes under ML but its capabilities are different by ML. Both fields work in different manners, DL in which the models need guidance in the case of ML if someone provides wrong input or direction, there is a need to fix the problem by externally while in DL models, decision can be possible by automatically. Deep learning works with more defined layers or hidden layers which work on the basis of train models. Deep learning algorithm has better accuracy than normal algorithm since DL needs more data, large models, and more computational power. It is like human brain neurons, which commonly use algorithm. 2.4 AUTOMATED MACHINE LEARNING It is a way of learning which provides advanced methods and processes whose aim is to train any machine which can work automatically. This is an advanced technology whose aim is to cover real-life problems without any user interaction. It covers the challenges that occur in AI. It enhances the productivity and capability to generate faster solution of real-life problems. In automate ML, machine begin their work with observation of data, it takes decision directly on the basis of experience and instructions.8 The main aim to generate automated machine is that work can be done smoothly without user intervention or machine. Its main focus is on the automatically developed computers using programming that can access data and use this for self-learning. Automated ML deals with all problems
28
Computational Science and Its Applications
and approaches which come under ML and data science. The traditional approach of ML is time-consuming, resource-intensive, and challenging process. For building the ML models, there is always required binary pattern. Auto ML allows data analysts and developers to help in training their custom models. Auto ML includes data engineers, data scientists, and ML experts who make ML advances easier than ever to implement. Automated Machine Learning Includes Automated ML provides solution of any problem automatically; following steps comes under this process:
Benefits of Automated Machine Learning • Helps to reduce the cost with advanced features • Enhancement in productivity with better reliability. • Data scientist and engineers can get data in very fast for further processing. • Quick user responses, revenues, and user satisfaction • Excellent speed to develop the advanced model with more accuracy. Why Automated Machine Learning Is Important • Manually preparing a ML model is a very complex task, this process requires mathematical expertise, domain knowledge, and program ming skills, so there is a need to develop an automated ML model. • Constructing an ML model manually is a multistep process which requires computer science skills, domain knowledge, and expertise in these areas by a company or data scientist.
Machine Learning
29
2.5 MACHINE LEARNING AND IOT 2.5.1 INTERNET OF THINGS The IoT is a system of connected different physical devices that are access by internet. Internet of Things is used to connect these devices as well as these devices can share their information with each other easily. Internet of Things is an interesting field through which we can make our work auto matic. Any type of device and gadgets such as watch, remote, doorbell, TV, phone, Lights, and so on can be connected by the Internet for sharing information.
FIGURE 2.11
Internets of things.
Internet of Things enables quick access at any time at any place with the help of sensors, drones, actuators, and surveillance cameras. This all is possible with the help of IoT components. There are a number of compo nents used in IoT which helps in automatic connectivity. By the help of IoT components, multiple task and entries can be performed by a single place. Components of IoT • Devices and sensors: The physical devices which we can connect, for example, phone, TV, light, air conditioner, refrigerator, and
Computational Science and Its Applications
30
others. Various sensors and these devices are used to collect data from surrounding environment. Collected data contain various degree of complexity and data are presented in the form of tempera ture, volume, digital reading, audio, video, or in any mode. These sensors and devices help to collect the data from various sensing devices and by the surrounding environment. Data which are gath ered by these devices have various degrees of variations.
Cloud
Computing
FIGURE 2.12
Components of IoT.
• Cloud computing: Cloud computing contains cloud which is used to collect large amount of data on clouds using the Internet. Various resources are used for this purpose. • User interface: Interaction with machine in the form of queries is done by end users. This is the way where any type of user can take the facility by their email or smartphone or by any mode. • Networking connection • Gateway: Acts as a bridge between sensors and clouds. The data collected on the clouds send by different intermediate medium or resources. There are various sensors which established connection through networks, satellite, WAN, and Wi-Fi, and so on.
Machine Learning
FIGURE 2.13
Applications of IOT.
The goals of IoT are to • Provide more comfort to user with in limited time span. • Acts as a bridge between physical world and real world. • Enable things to be connected anytime at anyplace. • Makes the Internet more reliable and pervasive.
FIGURE 2.14
Application domain.
31
Computational Science and Its Applications
32
2.5.2 APPLYING ML TO IOT DATA Machine learning and IoT work together in the current era to handle data generated by sensors or smart devices. One collects the data and other by its intelligence generates the useful data and reverts it to the user according to individual requirements. There are many ML algorithms which can explain the different techniques that are helpful in handling IoT devices. A use case of ML is using support vector machine for controlling the smart city traffic. IoT is a collection of embedded technologies that contain wired and wireless communication, sensors, physical devices, and actuator devices to the internet.9
FIGURE 2.15
IOT with ML.
Machine learning provides an effective promising solution for IoT applications and is the reason why ML came into picture in case of IoT which is enabled to provide data processing, information inference, and intelligence for IoT devices.. 2.5.3 MACHINE LEARNING AND IOT SECURITY The growth of current technology grows and with the help of upgraded devices, it is easy to interact at any place at any time. The productivity of
Machine Learning
33
inputs sensors and external factories is growing rapidly with ML. Now the output data or data analysis has become easier with ML, and there are automatic predictions about variables and other factors. ML is helpful for industry. In industrial applications, ML is used for producing fast and accu rate result in less time with predictive capabilities. IoT-collected data are sent to the servers and ML plays an important role in collecting data compared to the previous data with the help of ML algorithms. The results are displayed on a smartphones in a few seconds. ML-based IoT services has now become more powerful, secure, fast, and more intelligent with the help of ML.11,12 Machine-Based IoT Security • Fast detection of problems in IoT can be easily solved using ML. • Using ML in IoT lower bandwidth is used and it minimizes the WAN bandwidth consumption. • ML helps to keep in Address Privacy, in this user biometric data are stored locally. • ML capable to handle the hardwares of IoT-modern CPUs, GPUs, TPUs can run ML/AI algorithm. 2.6 MACHINE LEARNING AND CLOUD COMPUTING
FIGURE 2.16
Cloud computing.
Source: Reprinted from Ref. [15].
Computational Science and Its Applications
34
2.6.1 CLOUD OR CLOUD COMPUTING Cloud computing is also known as “cloud.” Cloud computing is the latest way which provides effective and speedy digital transformation. It is a plat form that entails large storing and accessing data over the Internet; in this, people are allowed to share and access the data at a broader level. The data served on cloud provides an open access to people to collect useful data. It is Internet-based computing where most of services are present such as database management, storage, server, and application delivered by client, company or by any source of internet. It is a secure way to managing any business from anywhere in the world, it needs a good internet connection for getting highly secure, cost-effective, and useful resources. There are a number of cloud computing services available for business needs and requirements. Some cloud service vendors are the following: 1. 2. 3. 4. 5. 6. 7. 8. 9.
Amazon Web Services Microsoft Azure Google Cloud Platform IBM Cloud Services Adobe Cloud Creative SAP Navisite Dropbox VMware
2.6.2 TYPES OF CLOUD Clouds services are provided on the basis of cloud types which are mainly four types. They are private, public, hybrid, and community cloud.
FIGURE 2.17
Types of clouds.
Machine Learning
TABLE 2.2
Various Types of Clouds
Private Cloud
Public Cloud
Data operated solely by cloud company and managed by the same or by third party.
Data available for public and managed by third party.
Hosting internally. A single organization Restricted access by externally.
Hybrid Cloud
Community Cloud
Cloud shared by similar companies Both public and private type of data available here managed owing to shared concerns and managed by the company or by the Hosting cloud services provide by company or third party. third party. public accessed by the web Hosting can be internally as Hosting by internally or externally pay for what you use. well as externally. It is a combination of private, by one organization at a time. public, or community cloud.
Promotes cooperation
Supports portability features.
35
Computational Science and Its Applications
36
Cloud Service Models 1. Iaas—Infrastructure as a Service • This service includes resource manufacturing, OS manage ment, and other software applications. • Server storage network • Give infrastructure service • Storage information, network capacity, and rest essential resources information keep by Iaas. • Visual infrastructure manager. 2. PaaS—Platform as a Service • PaaS gives the facilities included infrastructure and platform manufacturing with its own software applications. • Provide platform service • Application/software developer • Deploy customer-created applications to a cloud • Cloud development environment. 3. SaaS—Software as a Service • Service entirely based on internet • End users • User service provider’s applications over a network. • Web browser
FIGURE 2.18
Cloud service models.
Machine Learning
37
2.6.3 FEATURES OF CLOUD COMPUTING Cloud computing is a latest technology which works like a metaphor; it has many features. Some of features are the following:
FIGURE 2.19
Features of cloud computing.
1. Great Availability of Resources—Many physical and virtual resources provide services according to customer’s demand. In general, customers have no idea and control about where the data are stored. Cloud computing helps them to access every type of resources. 2. Large Network Access—Cloud contains a large amount of data uploaded by customers, server, or by organization. People can access data from anywhere at any time with the help of some device and internet connectivity. 3. On-Demand Self-Service—Customer has the permission to connect with cloud, they can access or monitor their own data according to their need. Customers can monitor or check their data server uptime, allotted network storage space. It is very important feature that gives permission to customer for checking computer computing capabilities. 4. Automatic System—Cloud automatically analyzes and arranges the user’s data and all updates are automatically done by clouds. It provides a transparent environment for host and customer.
Computational Science and Its Applications
38
5. Availability—Cloud can be modified or any customer can extend their storage space on cloud by buy cloud storage. 6. Security—Cloud computing provides a better security feature. Cloud captures the copy of original data, in any time when servers lost their data by clouds they get to recover that data. 7. Economical—Cloud computing is a one-time investment technique. 8. Easy Maintenance—Cloud architecture is designed in a way where maintenance takes place very easily. 9. Pay as You Go—Cloud provides the service based on paid mode, as much we want to take the service from it we have to pay for this.
FIGURE 2.20
Features of cloud computing.
Security Issues with Cloud Computing For data processing and data storage, cloud computing plays the main role of making the service more digitized but with privacy and security. There are some issues also such as the following: • Date breaches: On cloud, data are stored collectively or in massive form. There is a chance of hackers’ attack where they can misuse the data. Hackers may damage and violate the confidentiality resulting in damaged data or server failure. • Shared technology, shared dangers: Cloud computing helps to share resources and it may be a cause for distributing danger with other clients also. Resource sharing may cause a situation where
Machine Learning
39
wrong data are delivered with resources and comes client server ultimately damage can be there with the original data. Machine Learning Role in Cloud Computing Machine learning is a field that helps to train the machine after that machine can work autonomously with minimum or with human interac tion. The goal of ML in cloud is to enhance the capabilities of cloud with the help of ML, where cloud will be more intelligent and able to learn data stored on it. Now cloud will work with more efficiency and accuracy. It can predict or analyze the situation now in a better way with the help of ML.11 Machine learning works better with cloud computing server because of the low cost of operations, expandability, and good processing speed. So, ML with cloud computing is more beneficial as data handling is much easier in this. 2.7 APPLICATIONS OF MACHINE LEARNING There are different areas where we can use ML. Some of application areas are following: 1. Travel and Hospitality ¾ For traffic pattern recognition and congestion control. ¾ For prediction and deciding dynamic pricing. ¾ For text typing ¾ For aircraft scheduling purpose. 2. Manufacturing ¾ For demand forecasting ¾ Product optimization ¾ For analyzing the process ¾ Designing templates 3. Health Care & Life Science ¾ For proactive health management process ¾ For alerts and diagnostic from real time patient data. ¾ Discovery of new diseases
Computational Science and Its Applications
40
¾ Disease diagnosis system ¾ Drug discovery and manufacturing 4. Retail ¾ For generating recommendation engines ¾ For calculating lifetime value and customer feedback ¾ For inventory planning ¾ Matching people with product ¾ Selling and up-selling detail analyzing 5. Energy, Feedstock, and Utilities ¾ For calculating energy demand and supply optimization ¾ For smart grid management ¾ Data processing ¾ Power usage analytics 6. Financial Services ¾ For creating credit card evaluation ¾ Fraud detection ¾ For risk management ¾ For document analysis ¾ For generating customers segmentation ¾ For analyzing risk analytics and regulation 7. Education ¾ Personalized learning ¾ Assessment ¾ Adaptive learning ¾ Predicative analytical ¾ Increasing efficiency ¾ Learning analytics ¾ Online learning 8. Agriculture ¾ Soil management ¾ Water management ¾ Crop management ¾ Yield prediction ¾ Crop quality ¾ Weed detection
Machine Learning
FIGURE 2.21
41
Machine learning applications.
KEYWORDS • • • •
machine learning techniques applications challenges
REFERENCES 1. Yu, F. R.;, He, Y. Introduction to Machine Learning. In Deep Reinforcement Learning for Wireless Networks; Springer: Cham, 2019; pp 1–13. 2. Kubat, M. An Introduction to Machine Learning; Springer International Publishing AG, 2017.
42
Computational Science and Its Applications
3. Kodratoff, Y. Introduction to Machine Learning; Elsevier, 2014. 4. Bonaccorso, G. Machine Learning Algorithms; Packt Publishing Ltd, 2017. 5. Aldwairi, M.; Hasan, M.; Balbahaith, Z. Detection of Drive-By Download Attacks Using Machine Learning Approach. In Cognitive Analytics: Concepts, Methodologies, Tools, and Applications; IGI Global, 2020; pp 1598–1611. 6. Lantz, B. Machine Learning with R: Expert Techniques for Predictive Modeling; Packt Publishing Ltd, 2019. 7. Simard, P. Y.; Amershi, S.; Chickering, D. M.; Pelton, A. E.; Ghorashi, S.; Meek, C.; Wernsing, J. Machine Teaching: A New Paradigm for Building Machine Learning Systems, 2017. arXiv preprint arXiv:1707.06742. 8. Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J. T.; Blum, M.; Hutter, F. Auto-Sklearn: Efficient and Robust Automated Machine Learning. In Automated Machine Learning; Springer: Cham, 2019; pp 113–134. 9. https://www.geeksforgeeks.org 10. https://github.com/mhrezvan/SVM-on-Smart-Traffic-Data. 11. https://www.internetsociety.org/iot/iot-security-policy-platform/ 12. https://www.iot.org.au/ 13. Carrasco, O.C., Gaussian Mixture Models Explained: From intuition to implementation. Towards Data Science. Jun 3, 2019. https://towardsdatascience. com/gaussian-mixture-models-explained-6986aaf5a95 14. Jeffares, A. K-means: A Complete Introduction. Towards Data Science, Nov 19, 2019. https://towardsdatascience.com/k-means-a-complete-introduction-1702af9cd8c 15. Peters, L. Cloud Computing Trends for 2019. Networks Unlimited. February 22, 2019. https://www.networksunlimited.com/cloud-computing-trends-for-2019/
CHAPTER 3
Data Science ARTI CHAUHAN and ASHIRWAD SAMUEL Department of Computer Science & Engineering, School of Engineering, G. D. Goenka University, Haryana, India
ABSTRACT What is data science? The simplest definition to come up with is, it is the study of data. The real-world data is raw and data science uses tools and techniques to extract meaningful information from the raw data. It incorporates different fields such as statistics, mathematics, computer engineering, machine learning, data miming, and artificial intelligence to analyze large amount of data. Nowadays there are various applications available to automatically capture and store this large amount of data such as online system and payment portals. Organizations are overwhelmed with this huge data and want to make inferences as to enhance business and productivity also to give users a better experience. Data science is helping to reveal gaps and uncovering new patterns take it from health, medicine, finance to e-commerce. Data science is a broader term, which consider multiple challenges, such as capturing, cleaning and transforming data to finally make inferences from it. Whereas data mining is mainly about extracting knowledge and unknown patterns from the huge amount of data hence it is also called “knowledge discovery process”. On the other hand, machine learning is an automated technique which uses complex algorithms for data processing and providing trained model output or we can say that it is a technique to train model on the given data and make predictions. Artificial Intelligence goes one step ahead and uses machine
Computational Science and Its Applications. Anupama Chadha, Sachin Sharma, & Vasudha Arora (Eds.) © 2024 Apple Academic Press, Inc. Co-published with CRC Press (Taylor & Francis)
44
Computational Science and Its Applications
learning algorithm to make intelligent systems which can work on their own. These techniques have made data processing faster and much more efficient. It is because of different expertise required in this field, data science is showing strong growth. 3.1 DATA SCIENCE AND DIFFERENT DATA ANALYTICS TECHNIQUES Data science has been impacting our lives for a decade now. Are you aware that the majority of the data science approaches we use today began in the early 1900s and have been used in academics since then? Also, it is helping us to do many daily life activities that we do not even recognize such as searching Google’s web page. In just a blink, billions of pages appear and Google Maps that we are using on daily basis to reach a destination, but did we ever wonder how Google is doing that. Is there data science behind all this? If yes and if it has always been the thing, then why is it the buzzword around the corner? A study by SINTEF in 2013 stated that 90% of the world’s data was created in less than 2 years. The Global Datasphere, according to Inter national Data Group, is expected to grow exponentially from 4.4 ZB to 33 ZB between 2013 and 2018. IDC expects that by 2025, the Global Datasphere will have grown to 175 ZB and will continue to rise. Can you even imagine how big 175 ZB is? A thousand exabytes, a billion terabytes, or a trillion gigabytes are all approximated by a ZB, which is 2 to the 70th power bytes. If every gigabyte in a zettabyte were a kilometer, it would be the equivalent of walking 25,000 (approximately) rounds around the world (40,075 km). These data are enormous, much beyond our wildest dreams, yet one issue arises: where does this massive amount of data come from? This is answered by Kenneth Neil Cukier and Viktor Mayer-Schoen berger’s article—The Rise of Big Data that was published in May–June 2013 edition of Foreign Affairs. In that, they have come up with a term called “datafication.” Mainly, it is the process of storing what we do online. They define datafication as a process of turning every aspect of life into data. In this era of social media with high-speed Internet and having huge amounts of information with it, every one of us is contributing to the data
Data Science
45
heap every time we turn to our search engines for answers. On an average, more than 40,000 searches are processed by Google every second that accumulates to 3.5 billion of searches per day. Though most of the searches are being made on Google, do not forget the contribution of other search engines as well; there are 5 billion searches a day in the world. Giant social networking site Facebook, with active users count to 2.2 billion monthly, generates a heap of 4 PB of information each day. Other social media sites have also shown impressive growth, everyday approximately 95 million pictures are shared on Instagram and a similar amount of videos are uploaded too, and “stories” feature of Instagram is used by 100 million people daily. You can yourself imagine the data being generated with this. Skype users are making approximately 180,000 calls every minute and text messages contribute to (approx.) 16 million daily. Further data are generated by the online shopping, Netflix, Amazon, YouTube video streaming, Uber rides, weather forecasting, Snapchat, Linkedin, and so on. You name it and see millions of data being generated. But again the question arises, what are we are going to do with this huge amount of data heap collected day by day? Where is it being stored? How can these raw data be useful? These questions are being answered by data science. If we look at the broad definition of data science, it is extracting knowledge or useful patterns from datasets by using different methodologies. Also, it is explained as an interdisciplinary field taking into account statistics, computer science, and mathematics to get insights from data in different forms. Think of it like this, you have data. You want to get insights from it, so as to make better business decisions, but for that, the data need to be relevant and well organized as per your requirement. So, now you are required to convert your raw form of data into a well-organized form that can be processed further to get useful information to fulfill your needs. You must differentiate here between data and information. Data are in the raw form that is unorganized and might not be fully relevant as per your work requirement. But, once the data are organized and processed properly as per the needs to provide useful insights and patterns that ultimately provides knowledge, the data becomes informative and now this is called information that is going to help you in better decision-making. All the steps that are performed to store and convert raw data into useful information and further analysis of that information for future predictions are collectively termed as data science. Data are processed at different
46
Computational Science and Its Applications
stages using different analytical techniques as per the need of the hour and you must get acquainted with these technologies. There are four main kinds of data analytics as described below: 1. Descriptive analytics—Out of all the four, this is described as the simplest form. It interprets the raw data and provides insights into the past, that is, it guides us about the things that took place in the past. Let us clear our understanding more with the belowmentioned example: You all must be aware of marketing campaigns that many organizations run using different types of media. One such medium is e-mail which is immensely credible and profitable. Our e-mails are sometimes full of such marketing e-mails. The role of these e-mails is very important in enhancing the sales of a product. Hence, whenever a company runs an e-mail campaign or any campaign, then it becomes important for them to analyze that campaign to understand if the campaign was successful or not. You must be wondering here what they analyze. The e-mails are analyzed on different aspects like how many people opened the e-mails that particularly show how many people in actual the campaign was able to reach. Further, the inside link of the e-mail is analyzed to check how many users clicked and viewed the product and the next, customer purchase behavior is analyzed, showing that the campaign really helped in improving the conversion rates. To collect all such data for analysis based on different metrics, the e-mails, links, and websites are properly tagged with JavaScript tags and the data for that is properly collected and stored in data base systems. Hence, descriptive analytics is helpful in the summarization of the raw data and thus providing useful information. 2. Diagnostic analytics: This technique takes a deeper look at the data and tries to find out the root cause of an event. In other words, it tells, why something happened in the past? For example, let us continue with the above marketing campaign example. Once the campaign is analyzed and it is clear that if it was successful or not, then there is a need to get insights into the question of why? Why was the campaign successful or if failed, then why it failed? These insights help in future campaign launches. Hence, if you need to
Data Science
47
look into why anything happened, diagnostic analytics is the best choice. 3. Predictive analytics: As the name suggests, it is used to predict future outcomes of an event. In other words, it helps in predicting the probabilities of the happening of an event. A predictive model is fundamentally built on descriptive analytics and it uses machine learning algorithms to predict outcomes of the event. For example, based on past knowledge of different marketing campaigns using predictive analytics the probability of performance of a new marketing event can be examined. Hence, predictive analytics plays a vital role in the forecasting of an event. 4. Prescriptive analytics: It is based on descriptive and predictive analytics but emphasizes more on finding the best possible solution for a given situation. It can also be considered as an extension of predictive analytics. In particular, prescriptive analytics takes “what we know” (descriptive analytics) and tries to predict “what could happen” (predictive analytics) and further suggests the best possible solution (“what you should do”). For this, prescriptive analytics uses machine learning, artificial intelligence, and different algorithms to conclude. As we have understood descriptive and prescriptive analytics better with the marketing campaign example, prescriptive analytics too plays a major role in marketing. For data-driven marketing, take all the descriptive, diagnostic, and predictive data and further analyze it with prescriptive techniques. This can help reach better outcomes and thus better decision-making. Marketing managers can use prescriptive analytics to create good marketing campaigns. For example, finding out which product is trending in which area and why? Also, it helps in analyzing future trends, hence allowing marketers to launch targeted and timely campaigns. 3.2 DATA SCIENCE PROCESS So, now you all must have got a pretty good idea about the techniques that are used in data science. But I know it still needs clarity, when I started
48
Computational Science and Its Applications
with data science and had to complete a task, it was pretty confusing that from where should I start, which technique should I use first, what is the first step I begin with. Do not worry we are going to get answers for all such questions that come up when one is new to the field. Now, suppose if you have been given a problem and being a data scien tist you have been asked to work with it and provide the solution, then the first question you might come up with is the same that we discussed just now that from where do you start first. The data science process or life cycle remains pretty much the same despite having many different problems and datasets. Be it finance, education, health care, marketing, and any other complex business problem, the steps that are involved to untangle the problem remains almost the same. These steps primarily involve the following— 1. Framing the problem or setting the research goal—Before moving toward the major analytical steps, you must have the proper under standing of the problem you are handling. First, get yourself clear what you are solving and why? 2. Data collection/retrieval—Once you have set up the research goal, now you must take into account if you have the data avail able with you as per the problem you need to solve. Most of the time the organization in which you are working will have all the data stored up in the DBMS/NoSQL databases that they are using to collect or store the data; then your work is to retrieve the data useful to you from that heap of data dumped in the database. And if not, then you must work on collecting the data as per the requirement. 3. Data preprocessing—Once you have the necessary data with you, what next? You should start preparing that data. The analytical techniques that you are going to apply need proper structured, neat, and consistent data but the data that you have collected might contain noises, errors, and could be inconsis tent. The data could have missing values, inconsistent records, outliers, and many other challenges that you might have to handle before applying different techniques. This step is also called data cleaning and wrangling where you work with the
Data Science
49
inconsistencies of the data and try to prepare a stable, consistent, and structured form of the data. 4. Exploratory data analysis—Once you have transformed the raw data into clean and usable data that is fit for using, the next step is to explore the data and get a deep understanding. This can be done by applying statistical methods and visualizing techniques. This is where descriptive analytics plays a role. Descriptive statistics such as measures of central tendencies (mean, mode, and median) and measures of data dispersion (quartiles, variance, and interquartile range) are very useful in understanding the distribution of the data. Further, data can be visualized by plotting bar graphs, pie charts, line charts, scatter plots, box plots, histograms, correlations, and many more techniques to have a better understanding and visual examination of the data. This step is very helpful in identifying the hidden patterns from the data. Also, you get an understanding of key attributes that you will use in modeling. 5. Data modeling—Once you have a better understanding of the data with different patterns and insights from it, you go for data modeling but do remember not all the patterns are useful for your final findings. You further need to do pattern evaluation to identify patterns and features that are beneficial as per your research goal. These features are then used in model building with different methodologies and learning techniques such as machine learning algorithms. Once you have your model build up, it becomes very important to evaluate its performance and to know if that best summarizes the data. There are many different model evaluation techniques like cross-validation, jackknife, and bootstrapping that can be used to validate the model. Once you are fully satisfied with the performance of the model, you can further use it to predict and find inferences from the unknown data. 6. Data visualization—There you come to the end step of the data science process, presenting the analysis results. All your work will go in vain unless you represent your findings in an efficient manner. Data visualization is the process of presenting your findings from the data through visual representation for easy understanding.
50
Computational Science and Its Applications
The below diagrams clearly explain the data science procedure:
Now let us understand the data science process through a case study using python. You must be wondering why python? Python is one of the easiest and simplest programming languages. Because of its ease of use and vast selection of libraries in machine learning, it is the most widely used language by data scientists. First step: First, get yourself familiar with the dataset that you will be using for analysis from which you need to find some useful insights and later use those insights for data modeling. This particular dataset that we are going to use has been taken from Kaggle for the sole purpose of analysis.
Data Science
51
The description of the dataset is given below. The dataset contains historical data on sales of supermarket companies that has been recorded in three different branches for over 3 months. Attribute information: • • • • • • • • • • • • • • • • •
Invoice id: Unique invoice identification number Branch: Three branches of supercenter marked as A, B, and C City: Location of supermarkets Customer type: Type of customers member versus normal Gender: Gender of customers Product line: Categorization of general items Unit price: Prices of products in $ Quantity: Amount of products purchased Tax: 5% tax Total: Total price that is including tax Date: Date of purchase (January 2019 to March 2019) Time: Time of purchase (10 a.m. to 9 p.m.) Payment: Payment methods used by customers COGS: Cost of goods gold Gross margin percentage: Net sales minus the cost of goods sold Gross income: Total income Rating: Ratings by customers on a scale of 1–10
Second step: Once you are well acquainted with the dataset, start with the preprocessing of the data, clean the dataset by treating null values and any other inconsistencies if there. The above findings show that there are zero null values present in the dataset. It is great. As you understand the dataset, you can figure out which columns can provide some good insights and which are of no use. So it is better to remove those unnecessary columns. Hence, here we will be removing column Tax 5% and column Invoice ID. Have you noticed one thing while checking the brief summary of the dataset using df.info() command. The type of column Date was an object; instead it should be a date type. Hence, we need to change the type of the column.
52
Dataset Used from Kaggle Branch City
A
Yangon
1 226-31-3081
C
2 631-41-3108
Member
Unit price
Female Health and 74.69 beauty
7
Naypyitaw Normal
Female Electronic 15.28 accessories
5
A
Yangon
Normal
Male
Home and 46.33 lifestyle
7
3 123-19-1176
A
Yangon
Member
Male
Health and 58.22 beauty
8
4 373-73-7910
A
Yangon
Normal
Male
Sports and 86.31 travel
7
Tax 5% Total
Date
Time Payment Cogs
Gross Gross margin income percentage
26.1415 548.9715 1/5/2019
13:08 Ewallet
522.83 4.761905
26.1415
3.8200
10:29 Cash
76.40
3.8200
13:23 Credit card
324.31 4.761905
16.2155
23.2880 489.0480 1/27/2019 20:33 Ewallet
465.76 4.761905
23.2880
30.2085 634.3785 2/8/2019
604.17 4.761905
30.2085
80.2200
3/8/2019
16.2155 340.5255 3/3/2019
10:37 Ewallet
4.761905
9.1 9.6 7.4 8.4 5.3
Make sure that you are familiar with Python, as we will be using the same to perform analysis, also the steps shown below are being performed in the Jupyter notebook.
Some information regarding dataset:
RangeIndex: 1000 entries (0–999)
Data columns (total 17 columns):
Computational Science and Its Applications
0 750-67-8428
Customer Gender Product type line
Rating
Invoice ID
Quantity
TABLE 3.1
Data Science TABLE 3.2
53
Dataset Info.
Column Invoice ID Branch City Customer type Gender Product line Unit price Quantity Tax 5% Total Date Time Payment Cogs Gross margin percentage Gross income Rating
Non-null/Count/D type 1000 non-null object 1000 non-null object 1000 non-null object 1000 non-null object 1000 non-null object 1000 non-null object 1000 non-null float64 1000 non-null int64 1000 non-null float64 1000 non-null float64 1000 non-null object 1000 non-null object 1000 non-null object 1000 non-null float64 1000 non-null float64 1000 non-null float64 1000 non-null float64
D types: float64(7), int64(1), object(9) Total rows: 1000, total columns: 17 Checking for null values: Columns Invoice ID Branch City Customer type Gender Product line Unit price Quantity Tax 5% Total Date Time Payment Cogs Gross margin percentage Gross income Rating D type: int64
Values 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Computational Science and Its Applications
54
After changing type to Date: 0. 1. 2. 3. 4. 5. 6. 7. 8. 9.
2019-01-05 2019-03-08 2019-03-03 2019-01-27 2019-02-08 2019-03-25 2019-02-25 2019-02-24 2019-01-10 2019-02-20
Name: Date, d type: datetime64[ns] Sometimes, there is a requirement to find insights month-wise, date-wise, hour-wise, and so on. Hence, it is a good practice to convert datetime. This is how you one should do the cleaning of the data; first explore the dataset properly, find discrepancies and inconsistencies in the dataset, and then treat them with proper values to get neat and consistent data. Now, let us move ahead to the next step. Third step: Now, we have a clean dataset which we need to dive deep into to find out about useful insights, information, and patterns that can be further used in data modeling. Here, we also perform statistical analysis. This step is exploratory data analysis (EDA). • There are different ways to go with EDA. • It is better sometimes that you prepare a questionnaire and then plan to answer those questions from the dataset available. 1. Few examples are like: a. b. c. d. e. f. g.
What are the different available columns in the dataset? What are the categorical columns? What are different customer types? What is the customer’s count basis of their gender? What is the count of customers on the basis of customer type? What is the average sale per branch? What is the average profit per branch?
Data Science
55
2. You can frame such questions, starting from simple to even complex and then try to answer these questions. Once you start answering, you will get some good insights from the data and can move on slowly with the EDA step. Also, it is good if you visualize the dataset or answer these in the form of different graphs. That gives more clarity and gives much better understanding. 3. Follow below steps to understand more. Columns in the table: Invoice ID
Branch
City
Customer type
Gender
Product line
Unit price
Quantity Tax 5%
Total
Date
Time
Payment
Cogs
Gross margin percentage
Gross income
Rating
Date time
Hour
Day
Month
Year
[‘Branch’,
‘City’,
‘Customer type’,
‘Gender’,
‘Product line’,
‘Time’,
‘Payment’]
Computational Science and Its Applications
56
Member 501 Normal 499 Name: Customer type, dtype: int64 Female 501 Male 499 Name: Gender, dtype: int64 Female 261 Male 240 Name: Gender, dtype: int64 Male 259 Female 240 Name: Gender, dtype: int64 Yangon 340 Mandalay 332 Naypyitaw 328 Name: City, dtype: int64 Now, do you understand the difference in just counting the numbers that we performed earlier and instead plotting it that gives more understanding of the patterns in the dataset. Here, in the above plot, it is clear that there are mainly two different types of customer in the dataset described as Member and Normal. Also, there is no such count difference. In a similar manner, different plots can be plotted as shown below: The above plot is clearly showing us the popularity of home and life style products in the city Yangon. Food and beverages and fashion accessories are more famous in the city Naypyitaw. In the city of Mandalay, sports and travel and fashion accessories are sold in high numbers. If you want to look at overall sales of products combining all the three branches, then refer to the below graph; the graph shows that product fashion and accessories are being sold most.
Data Science
TABLE 3.3
Dataset Summary
Unit price
Quantity
Total
Cogs
Gross margin Gross percentage income
Rating
Hour
Day
Month
Year
Count 1000.000000 1000.000000 1000.000000 1000.00000 1.000000e+03 1000.000000 1000.00000 1000.000000 1000.000000 1000.000000 1000.0 Mean 55.672130
5.510000
322.966749
307.58738
4.761905e+00 15.379369
6.97270
14.910000
15.256000
1.993000
2019.0
Std
2.923431
245.885335
234.17651
6.220360e−14 11.708825
1.71858
3.186857
8.693563
0.835254
0.0
26.494628
Min
10.080000
1.000000
10.678500
10.17000
4.761905e+00 0.508500
4.00000
10.000000
1.000000
1.000000
2019.0
25%
32.875000
3.000000
124.422375
118.49750
4.761905e+00 5.924875
5.50000
12.000000
8.000000
1.000000
2019.0
50%
55.230000
5.000000
253.848000
241.76000
4.761905e+00 12.088000
7.00000
15.000000
15.000000
2.000000
2019.0
75%
77.935000
8.000000
471.350250
448.90500
4.761905e+00 22.445250
8.50000
18.000000
23.000000
3.000000
2019.0
Max
99.960000
10.000000
1042.650000 993.00000
4.761905e+00 49.650000
10.00000
20.000000
31.000000
3.000000
2019.0
57
Computational Science and Its Applications
58
FIGURE 3.1
Count of customer type.
FIGURE 3.2
Count of customer type per branch.
Data Science
59
FIGURE 3.3
Count of customers as per genders in each branch.
FIGURE 3.4
Product popularity by city.
The above plot compares the product sale per day over branches; with this, we can clearly identify which product is in demand during which hour/ time of the day, so that the stock can be refilled if the product is not available. The above graph shows that there is a spike in male customers shopping during day time around 2 pm, and around 6 pm, there is a drop in shopping of the female customers. Female customers are majorly shopping during 10–12 am.
Computational Science and Its Applications
60
FIGURE 3.5
Total sales per product.
FIGURE 3.6
Sales per hour over branch basis product.
Data Science
61
FIGURE 3.7
Sales per hour basis gender.
FIGURE 3.8
Sales per hour basis customer type.
62
Computational Science and Its Applications
The above graph shows a comparison between customer types. As it shows that around the 10th day of the month, there is a spike in shopping of customer-type members and on the same day there is a drop in shopping of customer-type normal. This shows that it is possible that during that day, some discount might be provided to the member customer that is why sales are high or we can figure out later the reason for the same day spike and drop. This is where diagnostic analytics comes into play to generally answer why this happened? Here, in the above plot, comparison being done between branches based on customer gender to understand their contribution to the profit of the supermarket. The boxplot also informs about outliers in the data, outliers are the data points which differ in behavior from the other observations. As shown in the above graph, most of the male customers in Branch C are contributing up to max value 45 but there are two customers that are shown as outliers who are contributing more than 45. Hence, it is a great plot to explore the dataset.
FIGURE 3.9
Payment methods as per branch.
The above graph is showing what kind of different payment methods is being used by customers. Also, you can figure out that in Branch C cash
Data Science
63
methods are more used and in Branch A e-wallet is more in use than other methods.
FIGURE 3.10
Bar chart to show gross income versus branch.
The above graph shows that though Branch C is generating more profit out of all three, still there is no major difference. Also, you can plot the same findings in a different manner as shown below: The above shown graph is a correlation heat map showing correlation matrix between attributes of the dataset. This can inform about correlated attributes, affecting each other.
Computational Science and Its Applications
64
FIGURE 3.11
Gross income versus branch.
The above correlation heat map clearly shows: • Quantity shows a relationship with columns total, cogs, and gross income. This means that quantity is affecting the outcomes of these columns. • Similarly, unit price also affects the same columns. • Also, total and cogs are directly related to gross income. • Other than this, there is no such relation between the attributes. The three different graphs are showing some attributes with respect to date. This is a great way to visualize trends in the data.
Data Science
FIGURE 3.12
65
Heat map showing correlation between data attributes.
Computational Science and Its Applications
66
FIGURE 3.13
Sale over time.
FIGURE 3.14
Profit generated.
Data Science
FIGURE 3.15
67
Profit generated by branches.
It is a good practice to check distribution of the data before moving to the modeling step. As if data are skewed, then it will not provide proper outcomes and before modeling we are needed to unskew the dataset. Below graph shows how data distribution is performed.
FIGURE 3.16
Quantity distribution.
We can compare attributes with each other in one go as shown in the below plot. It is get easy to visualize and compare attributes and get insights from it.
Computational Science and Its Applications
68
FIGURE 3.17
Gross Income distribution.
This marks the end of the EDA phase. Build the Model Once you have clear data and a good brief knowledge of how each attribute relates to each other, then move on to the next step that is building the model. Model building is mainly part of machine learning. Machine learning models have the ability to learn from the data automatically and improve its outcome without human intervention. Depending on the type of questions you want to answer, there are many different types of modeling algorithms available. Machine learning is basically divided into two types: supervised and unsupervised learning. • Supervised learning is the modeling of labeled data; labeled data have predefined classes of objects. An example which can clarify it more is categorizing e-mail as spam and not spam. If you notice Gmail asking you to classify e-mail as spam or not, this is labeling data and later with the same Gmail itself predicts spam e-mails for you. Similarly, you can have labeled data as per the domain you are working. I hope you have got an understanding of what is labeled data. Once a model is built, it can be used to predict values from
Data Science
69
unknown and unlabeled data. Further techniques in supervised machine learning are classification to classify discrete categories; the above example of spam e-mail is an example of classification technique. Another is regression that works on continuous data values. We will understand regression in more detail in this section.
FIGURE 3.18
Correlation between the attributes.
• Another machine learning algorithm is unsupervised learning; as the name suggests it works on unlabeled data, data which does not have predefined classes/labels. It can be further categorized into clustering which group objects on the basis of similarities and association which finds relationships between the attributes in
70
Computational Science and Its Applications
the dataset. An example of association technique is market basket analysis. Some other unsupervised algorithms are neural networks, Apriori algorithm, and so on. To move with data modeling: • First try to figure out the question you want answers for. According to that, identify the modeling technique and attributes that you will feed the model. • Second generate the model. • Third compare the generated models.
Now, let us move with data modeling with the market data. The aim
here is to find out “Quantity” to be sold to generate a particular amount of revenue “gross income.” Thus, this is a regression problem. You must have learned about regression in your mathematics or statistics class. The formula for regression is y = mx + c where m is slope, c is the intercept, y is dependent, and x is independent variable. You can choose differently as per the insight you want to get from data. One problem in the same dataset could be used for classification of customers on the basis of their contribution in the revenue generation of the company. This is a crucial step to understand your regular customers and retain them. For this problem, you will first have to label the data and then further on the labeled data generate classification model that can be used further for unknown dataset. Keeping that aside, let us import libraries required for data modeling; for regression, we will use the library Scikit-learn. This is the library used majorly for machine learning in Python. It also contains few inbuilt datasets that you can work on. As we are moving with linear regression, first is to choose one depen dent and independent variable, as per our aim to identify “Quantity” on the basis of revenue “Gross Income.” Hence, dependent variable is quantity and independent variable is gross income. Once x and y variables are identified, split them into training and testing data. Training data are over which data would be trained and testing data are over which fit model would be tested. Here, we are split ting data on 80:20 split. Make sure that your data are large enough to get good meaningful results. Also, test and training set should belong to the same dataset.
Data Science
TABLE 3.4
Dataset
Invoice Branch City ID 0 750-67 8428
A
Yangon
1 226-31 3081
C
2 631-41 3108
Customer Gender Product type line Member
Unit price
Quantity Tax 5% Total
Date
Time Payment Cogs
Gross Gross Rating margin income percentage
Female Health and 74.69 beauty
7
26.1415 548.9715 1/5/2019
13:08 Ewallet
522.83 4.761905
26.1415
9.1
Naypyitaw Normal
Female Electronic 15.28 accessories
5
3.8200
10:29 Cash
76.40
3.8200
9.6
A
Yangon
Normal
Male
Home and lifestyle
46.33
7
16.2155 340.5255 3/3/2019
13:23 Credit card
324.31 4.761905
16.2155
7.4
3 123-19 1176
A
Yangon
Member
Male
Health and 58.22 beauty
8
23.2880 489.0480 1/27/2019 20:33 Ewallet
465.76 4.761905
23.2880
8.4
4 373-73 7910
A
Yangon
Normal
Male
Sports and 86.31 travel
7
30.2085 634.3785 2/8/2019
604.17 4.761905
30.2085
5.3
80.2200
3/8/2019
10:37 Ewallet
4.761905
Source: Date time 64 [ns].
71
Computational Science and Its Applications
72
Training data: (800, 1)
Testing data: (200, 1)
Linear Regression (copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
Model Slope: [0.17693728]
Model intercept: 2.7736610470448335
Actual
Predicted
993
10
4.320978
859
9
6.566046
298
4
4.897262
553
6
3.952594
672
3
4.722006
971
7
5.059425
27
2
4.324870
231
3
3.054726
306
7
5.265026
706
4
4.295322
Actual quantity: 1109
Predicted quantity: 1093.84480402307
Mean absolute error: 1.6160936307749183
Mean square error: 4.039677805235893
Root mean square error: 2.009894973682927
Coefficient of determination (R-square of the model): 0.5090313466878675
5.828534421434526
Now, let us try out multiple linear regression model:
In multiple linear regressions, you can choose more than one dependent
variable.
y = m1×1 + m2×2… + m(n)×(n)+c is the equation of multiple linear
regression.
Index([‘Unit price’, ‘Quantity’, ‘gross income’, ‘Branch_A’, ‘Branch_B’, ‘Branch_C’, ‘Product line_Electronic accessories’,
Data Science
73
‘Product line_Fashion accessories’, ‘Product line_Food and beverages’,
‘Product line_Health and beauty’, ‘Product line_Home and lifestyle’,
‘Product line_Sports and travel’],
dtype=‘object’)
Actual
Predicted
0
10
6.644509
1
9
7.926120
2
4
4.119328
3
6
5.738307
4
3
2.860868
5
7
6.247536
6
2
0.964683
7
3
5.258493
8
7
6.473455
9
4
4.675414
Actual quantity: 1109 Predicted quantity: 1117.5342425481858 Mean absolute error: 0.8892876819154968 Mean squared error: 1.5345063051340884 Root mean squared error: 1.2387519142806958 Coefficient of determination (R-square of the model): 0.8135013408361003 3.5648907017768536 Let’s try out Random forest: MAE: −0.164 (0.025) RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion=‘mse’,
max_depth=None, max_features=‘auto’, max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None, oob_score=False,
random_state=1, verbose=0, warm_start=False)
Computational Science and Its Applications
74
0 1 2 3 4 5 6 7 8 9
Actual 10 9 4 6 3 7 2 3 7 4
Actual quantity: 1109 Predicted quantity: 1101.6799999999998 Mean absolute error: 0.15119999999999997 Mean squared error: 0.05467999999999999 Root mean squared error: 0.23383755044902432
FIGURE 3.19
Actual versus predicted
Predicted 9.57 8.97 4.09 6.14 3.34 6.83 2.00 2.40 6.97 3.91
Data Science
75
Model Comparison We have created three different models: 1. Linear regression model with mean absolute error: 1.16 and root mean square error (RMSE): 2.01. 2. Multiple linear regression with mean absolute error: 0.89 and root mean squared error: 1.24. 3. Random forest regression with mean absolute error: 0.15 and root mean squared error: 0.23. The lower the error values, the better is the model. Hence, a model with random forest regression is the best model out of the three. We can even work more, adjust values to make it better. Also, the more better the quantity and quality of the dataset to train, the better would be the model. This is how you could work with data in python.
Further data visualization is explained in Section 3.4.
3.3 AUGMENTED ANALYTICS So, the process discussed above is the manual process of understanding and analyzing the data and discovering insights and knowledge. This is great, is not it? One who is new to the field sometimes finds all these processes as magic where data scientists are no less than magicians, when provided with data they do some mumbo-jumbo, hocus-pocus, abrakadabra kind of things and whoop! In just a blink of an eye bear fruits but believe me it is not as simple as it looks. It is time consuming, resources taking, and moreover requires analytical thinking and patience too, studies show that data scientists spend almost 80% of their time in making the raw data digestible for model building that is cleaning, maintaining, and preprocessing the data so as to give it a proper structure and only about 20% of the time in actually finding insights from it that is basically the main job to get actionable knowledge so as to help decisionmaking. Starting from traditional systems of storing data to this era of artificial intelligence, different techniques have emerged and with human intelligence have transformed our lives and work environment. One such similar technique is augmented analytics. You all must be wondering, what is augmented analytics now? Have patience, we are paving the way toward it only.
76
Computational Science and Its Applications
This new concept was first introduced by Gartner in its 2017 report; Gartner is one of the top marketing research companies of the world which describes augmented analytics as—“an approach that automates insights using natural-language generation(NLG) machine learning techniques, and this marks the next wave of disruption in the analytics market.” If I try to reframe, augmented analytics enhances data analytics processes by leveraging machine learning, artificial intelligence, and natural language processing algorithms. In more simple words, it makes the work of a data scientist easier. As I just told you that most of the time in a data science life cycle go in cleaning, preprocessing, and manipulating the raw data so as to make sense out of it, but now with augmented analytics, this step can be accelerated. Augmented analytics makes it easier to clean and preprocess the data, thus allowing data scientists to focus on the main task of finding the significant figures. You must be thinking now that machine learning was already embedded in the analytics process using that only we create models, then how augmented analytics is any different? Believe me, when I heard about it first, I also had the similar question. Majorly augmented analytics includes three different areas of working as informed by Gartner: 1. Augmented data preparation—This step is the helping hand to all those data scientists who find data preparation least fascinating. It automates the data preparation process by including artificial intelligence and machine learning algorithms. Data collection, organization, manipulation, transformation, reduction, and more over data cleaning tasks become handy. 2. Augmented data discovery—This step is very helpful to business people or citizen data scientists, also providing expert data scientists with more time to work on important aspects of model building. Citizen data scientists are those people who do not have expertise in the field of data science but are tech-savvy enough to perform moderate data analysis. With this step of augmented analytics, they can dive too deep into machine learning models without writing a single code. It enables them to use machine learning to automatically find and visualize interesting patterns and insights which otherwise is a technical job, to find out insights with an accurate and unbiased machine learning model.
Data Science
77
3. Augmented data science and machine learning—This step finally automates the key aspects of the data science process, reducing the dependency on experts to generate and manage the analytical model. It automates the process of machine learning by automatically choosing the right algorithm, performing feature selection, validating and deploying models, and further performing monitoring of the life cycle of the deployed model. Microsoft powerBI, a data visualization tool, is also using augmented analytics which is making the life of the data scientist a lot easier. One such model in powerBI is “Q&A” where powerBI generates specific questions or you can also speak up and tell any question you want answered from the data. The powerBI automatically generates the graph that is required to answer the question asked. It is an amazing automated model that can be seen in the below figure where different questions are being answered through graphs. 3.4 BUSINESS INTELLIGENCE AND DATA VISUALIZATION TOOLS Business intelligence is more of an application-based usage of data science to make a business-oriented decision, where the data analysis provides an empirical evidence of the current scenario and accordingly the apt decision may be made or suggested by the concerned management with regard to the business operations. In simple terms, it is the use of data to make intelligent decisions for better functioning of an organization. So, here the emphasis is on the extent of increase or decrease in variables so as to highlight how good or bad the current scenario is and what components can be modified, introduced, or be removed for the organization to be most productive. This is success fully conveyed using data visualization. Data visualization is the representation of data to provide a concise, correct, and comprehensive view, in such a way that this picture provided is worth the entire dataset. This makes data visualization interesting like an art by opening room for innovation. However, it also needs to maintain the integrity of the data by not distorting or misrepresenting the data leading to Bad Graphics—a study of improper graphics of data which may provide misleading information or misrepresentation of the raw data.
FIGURE 3.20
PowerBI generating specific questions.
78 Computational Science and Its Applications
Data Science
79
Some of the common data visualization types/techniques are the following: • Bar charts—Usually used to compare the quantity of a single vari able for categorical data. Advanced methods emphasize the positive and negative values above and below the axis. • Pie charts—When the proportion or percentage distribution of a whole component is to be representative, signifying the differences in the shares contributing as individual segments. • Line graph—Used often for tracking changes in the values over a time span. • Histogram—Looks similar to bar charts, but here the distribution of the dataset is provided for numerical data by grouping them in ranges. • Box plot—It gives a clearer understanding of the spread of data by representation of the quarterly distribution of the data points in a dataset using minimum value, maximum value, and quartile ranges. • Scatterplot—When two different variables together determine the position of the data points, assisting in displaying the relationship between the two variables. • Radar or spider chart—In case of multiple variables or attributes, the distribution of data points is depicted in a spider-web format using multiple axes, illustrating the inclinations against the vari ables graphically. Other well-known charts include density map, bubble charts, heat map, and so on. Power BI, Google Data Studio, Tableau, and QlikView are some of the extensively used tools for data visualization. We are going to explore two of these tools, that is powerBI and Google Data Studio to further understand how these tools can help in effective representation of the data. Let us first start with powerBI: 3.4.1 MICROSOFT POWERBI PowerBI developed by Microsoft is an effective data visualization tool that really makes the work very convenient and generates user-friendly and user-oriented reports. PowerBI interface is very user-friendly that anyone
80
Computational Science and Its Applications
can use it; it is basically modeled on the foundation of Microsoft Excel but more powerful. It is a very useful tool for business intelligence analysts whose primary work is analyzing data and converting it to information which can be presented in an efficient and understandable manner. With Power BI, it is possible to connect to a variety of data sets and clean it up for better understanding. The visuals created using powerBI can then be shared with other users using powerBI, also can be downloaded in different formats such as pdf and powerBI, that is, powerBI-dataReport. Accessing powerBI online is extremely easy or can be downloaded. There are different signup options available also, you can opt for a premium account or if you just want to try how it works you can sign up for free and explore the services provided by powerBI and create reports by visualizing your data. After successfully signing up, you first need to upload your data, either you can connect to a data source available or you can upload. Here, we are going to analyze the same supermarket data that we have used earlier. As the data that we have uploaded is in csv format, this is how it is shown in powerBI. To create reports, click on the Creating Reports button (Figure 3.1) and explore through different graphs available to visualize your dataset (Figure 3.2). Different charts, as can be seen in Figure 3.3, can be created based on what information you need to generate which is clearly visible in Figure 3.4: 1. Pie chart shows gross income generated in each branch and it can be figured out that Branch-C is having the highest gross income. 2. Stacked bar plot showing count of gender versus Member and Normal customers of supermarket. In “Member,” female customers are more versus in “Normal” males are more. 3. Bar chart showing count of Payment versus Branch. E-wallet is being used in branch A by majority of customers, In branch C, cash is frequently used and in branch B every payment method is being equally used. This is how different graphs can be generated and interpreted. You can see in Figure 3.5, different graphs are used to show the same thing unit price of each product in the supermarket. So, this is how you
FIGURE 3.21
Data in PowerBI.
Data Science 81
Home page of powerBI.
Computational Science and Its Applications
FIGURE 3.22
82
FIGURE 3.23
Report page in powerBI.
Data Science 83
Graphs generated in powerBI.
Computational Science and Its Applications
FIGURE 3.24
84
FIGURE 3.25
Graphs generated in powerBI.
Data Science 85
86
Computational Science and Its Applications
can generate different graphs and visualize your data as per the data and your requirement. This is the Microsoft powerBI data visualization tool that can give us insights not only of the past but of the present and help us understand future trends too. 3.4.2 GOOGLE DATA STUDIO Google Data Studio is another data visualization tool that is freely available and easy to sign up with a Google account. It lets you create interactive dashboards and customized reports. Data studio is based on the foundation of Google analytics. It allows you to add as many pages, filters, and charts you want. Different calculated metrics and fields can be visualized or added to the data as well. It is loved most because of its dynamic filter feature, where many different filters can be created in the report so that users can easily slice and dice through the report. The homepage of the Data Studio looks like as shown in Figure 3.6. There are different tutorial dashboards available to see and learn how it works. To create new reports, just click on the “Create” icon that can be seen on the left; this way data are added. There are more than 200 in-built data sources that you can navigate through (Figure 3.7) and choose that you want. Also, there are three different options available that you can explore for reports, data source, and explorer. Further reports can be created by adding different charts and graphs as per the data available and insights that need to be generated. The generated report can be shared with other users or clients and also can be downloaded in pdf format (Figure 3.9). 3.5 DATA SCIENCE AND BIG DATA Once the data becomes too big to be handled by a traditionally used frame work or language or tool, a bigger framework is required for such analysis to be conducted. This is where the concept of distributed systems comes into the picture, where expansion of storage and fast processing is achieved
FIGURE 3.26
Homepage of Google Data Studio.
Data Science 87
Options to connect to data.
Computational Science and Its Applications
FIGURE 3.27
88
FIGURE 3.28
Graphs generated in Google Data Studio.
Data Science 89
90
Computational Science and Its Applications
by adding multiple systems, which is very low-priced, as compared to massively upgrading a single system which has its constraints. Many efforts have been made to explain this concept of Big Data using Vs such as volume, velocity, variety, veracity, and value and many such additions to point out the enormous volume of data being generated exponentially and need to be processed even when the data formats are wide ranging. Two established frameworks to support such a large volume of data on distributed file systems are Google File System and Apache Hadoop.
FIGURE 3.29
Report formed in Google Data Studio.
Google File System is a “distributed file system” that was added to meet such demands by storing large files, which were mostly required to be read (instead of being written) while considering the failures in indi vidual nodes utilizing data replication. Doug Cutting and Mike Cafarella inspired by Google’s work on “Google File System” and “MapReduce: Simplified Data Processing on Large Clusters” worked on a Nutch project, which was later developed as Apache Hadoop, an open-source framework, comprising a storage compo nent called Hadoop Distributed File System and a processing component which is a MapReduce programming model. Many such big data tools, technologies, or software have been devel oped over Hadoop framework such as Pig, Hive, HBase, and Spark, optimizing their benefits, thereby gaining its popularity.
Data Science
91
3.5.1 HADOOP ECOSYSTEM Hadoop Ecosystem refers to the environment that contains the basic components of Hadoop framework with the addition of other facilities which supports and enhances its working condition, where Hadoop is vital having HDFS as an advantage. Many of the storage problems associated with big data computing are solved by HDFS while MapReduce performs batch-processing and is a good choice for everyday data preprocessing. A brief outlay of the same is shown below: Ambari Management and Monitoring Zookeeper Oozie Coordination Workflow and scheduling
Sqoop
Hbase NoSQL Data integration Database
Mahout Machine learning
Pig Hive Scripting Query
MapReduce Distributed processing
HDFS Distributed storage
Hadoop Ecosystem • Apache Ambari is an open-source tool used for administration implemented over the Hadoop cluster, supporting the HDFS and MapReduce programs that manages and monitors the running applications by keeping a track of it. It follows a master/slave architecture wherein the master node keeps track of the state of the infrastructure and instructs the slave node to report back the state of its every action undertaken. • Apache Cassandra is an open-sourced column database, distributed storage system for managing large amounts of data with high availability. • HBase, developed as a part of Apache Hadoop, provides BigTable like capabilities to be run over HDFS. It blends real-time query capabilities with a high processing speed, making it a useful tool to derive analytic reports across large scales of data. • Hive is a data warehouse tool basically used for analyzing and querying data using queries similar to SQL.
92
Computational Science and Its Applications
• Pig is a high-level framework that allows us to analyze data in conjunction with either Apache Spark or MapReduce, adding another level of abstraction to the data processing process. • Sqoop is a tool for transferring data from relational database settings such as Oracle and MySQL to a Hadoop environment. • Oozie is a Hadoop workflow management scheduling solution that executes workflow routes for successful task completion. • Zookeeper is an open-sourced centralized service that allows Hadoop dispersed applications to communicate with one another. It keeps track of the cluster’s configuration, naming, and synchronization. • Mahout is a machine-learning and data-mining package that allows mathematicians, statisticians, and data scientists to create their own algorithms quickly. Collective filtering, categoriza tion, grouping, and mining of parallel frequent patterns are the four primary groups. The Mahout library is part of the subset of libraries that may be run in a distributed environment and using MapReduce. Other notable big data tools are Spark and Storm • Spark, originally developed in 2009 and open sourced in 2010 and subsequently moved to Apache Software Foundation in 2013, is considered as an improvement of MapReduce cluster computing paradigm with the assistance of in-memory caching. It provides an interface for implicit data parallelism programming of entire clusters by reading input at the cluster level and conducting all necessary analytical operations by outputting the results at the same level. Spark works on all the data at the same time, unlike MapReduce, which works in stages, where work in memory reduces latency between treatments and speeds up the iterative algorithms. Thus, Spark is up to 10 times faster for batch processing and up to hundred times for in-memory analysis. The same has been made accessible as a Python library called PySpark, which can handle larger datasets than libraries like pandas can. • Storm is a real-time distributed computational system used for manipulating and handling enormous volumes of high-speed data, with each node on a cluster capable of processing over a million records per second. Storm implements a fault-tolerant method
Data Science
93
for pipelining multiple computations, as it flows into a system; it processes data in parallel as it streams. Storm ensures one-at-a-time processing to avoid the inherent latency overhead imposed by microbatching. 3.6 DATA SCIENCE AND INTERNET OF THINGS Internet of Things (IoT)—It was coined in 1999 by Kevin Ashton for sensors or devices connected to the system and sending information for further processing. A leader in data generation, IDC estimates up to 41.6 billion such devices to generate 79.4 ZB (1 ZB = 1000,000,000 TB) by 2025. Being a major source of generating data from smart devices, data science techniques are required to mine these data, processed into machine learning algorithms or even employ artificial intelligence to make the best use of this outpouring of data. Take for example, humans in their daily routine use and modifie these sensors/devices as per personal requirements for waking up, readying themselves, navigating to destinations, eating choices, meeting, and conversing with known ones, following their interests/hobbies before they sleep at night. All these devices are connected via the Internet and gathering data and trends from them provides personalized information of a person’s preferences which are helpful and insightful to the service providers. However, there is also a need for caution so as not to violate an individual’s dignity by invasion of his privacy, which is a long debated matter. Similarly, such insights can also be gathered and utilized well in an organization, both internally for interdepartment dependencies and also externally toward their clients. This realistic data being generated which are already customized to cater to people’s personal preferences cannot be ignored and can be properly utilized to serve customer’s requirements for better service. Potential benefits using data science for IoT have largely been identi fied for descriptive and predictive analytics that understands customer’s preferences for building smart projects like smart cities, smart roads, and smart hospital beds. These have been beneficial for the retail sector, manufacturing and maintenance, infrastructure, automobiles, and even professional sports persons.
Computational Science and Its Applications
94
KEYWORDS
• • • • • • • • •
data science machine leaning data mining artificial intelligence algorithms computer engineering information inferences patterns
REFERENCES 1. Hastie, T.; Tibshirani, R.; Wainwright, M. Statistical Learning with Sparsity: The Lasso and Generalizations, 2015. 2. Hamilton, B. A. The Field Guide to Data Science; s.l.: Booz Allen, 2013. 3. James, G. A.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning. Second; s.l.: Springer. 4. Zaki, M. J.; Meira, W. Jr. Data Mining and Analysis. Second; s.l.: Cambridge University Press, 2020. 5. Fawcett, T.; Provost, F. Data Science for Business; s.l.: O’Reilly Media Inc., 2013. 6. Courville, I. G.; Bengio, Y.; Aaron. Deep Learning; s.l.: MIT Press, 2016. 7. Weihs, C.; Ickstadt, K. Data Science: The Impact of Statistics; s.l.: International Journal of Data Science and Analytics, 2018. 8. Dumontier, M.; Kuhn, T. Data Science—Methods, Infrastructure, and Applications; s.l.: Journal of Data Science, 2017. 9. Doan, A.; Konda, P.; Paul Suganthan, G. C.; Ardalan, A.; Ballard, J. R.; Das, S.; Govind, Y.; Li, H.; Martinkus, P.; Mudgal, S.; Paulson, E.; Zhang, H. Toward a System Building Agenda for Data Integration (and Data Science), 2017. 10. Brodie., M. L. On Developing Data Science; s.l.: Springer: Cham, 2019. 11. Brodie, M. L. What Is Data Science; s.l.: Springer: Cham., 2019. 12. Bauckhage, C.. NumPy/SciPy Recipes for Data Science; s.l.: Information Theoretic Vector Quantization, 2020. 13. Filonenko, E.; Seeram, E. Big Data: The Next Era of Informatics and Data Science in Medical Imaging-A Literature Review; s.l.: J Clin Exp Radiol, 2018. 14. Provost, F.; Fawcett, T. Data Science and Its Relationship to Big Data and DataDriven Decision Making; s.l.: Big Data—Liebertpub.com, 2013.
Data Science
95
15. Agarwal, R.; Dhar, V. Big Data, Data Science, and Analytics: The Opportunity and Challenge for IS Research; s.l.: pubsonline.informs.org, 2014. 16. George, G.; Osinga, E. C.; Lavie, D.; Scott, B. A. Big Data and Data Science Methods for Management Research; s.l.: journals.aom.org, 2016. 17. Dobre, C.; Xhafa, F. Intelligent Services for Big Data Science; s.l.: Future Generation Computer Systems, 2014. 18. Song, I. Y.; Zhu, Y. Expert Systems. Big Data and Data Science: What Should We Teach? s.l.: Wiley Online Library, 2016. 19. Van der Aalst, W.; Damiani, E. Processes Meet Big Data: Connecting Data Science with Process Science; s.l.: IEEE Transactions on Services—Ieeexplore.ieee.org, 2015. 20. Sanchez-Pinto, L. N.; Luo, Y.; Churpek, M. M. Big Data and Data Science in Critical Care; s.l.: Chest, 2018. 21. Chen, C. H.; Gorkhali, A.; Lu, Y.; Ma, Y; Li, L. Big Data Analytics and Big Data Science: A Survey; s.l.: Journal of Management Analytics, 2015. 22. Cielen, D.; Meysman, A. D. B.; Ali, M. Introducing Data Science: Big Data, Machine Learning, and More, Using Python Tools; s.l.: thuvienso.vanlanguni.edu.vn, 2016. 23. Daniel, B. K. Big Data and Data Science: A Critical Review of Issues for Educational Research; British Journal of Educational Technology, 2019. 24. Jagadish, H. V. Big Data and Science: Myths and Reality; Big Data Research, Elsevier, 2015. 25. Fairfield, J.; Shtein, H. Big Data, Big Problems: Emerging Issues in the Ethics of Data Science and Journalism; Journal of Mass Media Ethics, Taylor & Francis, 2014. 26. Aalst, W. Van Der. Data Science in Action; Process Mining, Springer, 2016. 27. Baesens, B. Analytics in a Big Data World: The Essential Guide to Data Science and Its Applications; s.l.: books.google.com, 2014. 28. Chintagunta, P.; Hanssens, D. M.; Hauser, J. R. Marketing Science and Big Data. s.l.: Marketing Science and Big Data. Market. Sci. 2016, 35 (3), 341–342. https://doi. org/10.1287/mksc.2016.0996 29. Jifa, G.; Lingling, Z. Data, DIKW, Big Data and Data Science; Procedia Computer Science—lsevier, 2014. 30. Klašnja-Milićević, A.; Ivanović, M.; Budimac, Z. Data Science in Education: Big Data and Learning Analytics; Computer Applications, Wiley Online Library, 2017. 31. Dhar, V.; Jarke, M.; Laartz, J. Big Data; Springer, 2014.
CHAPTER 4
Quantum Computing ANUPAMA CHADHA, SACHIN SHARMA, RAHUL CHAUDHARY, REUBEN VERNEKAR, and ARUN RANA Faculty of Computer Applications, Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India
ABSTRACT In this chapter, we discuss about how quantum computers are different from classical computers and their advantages over classical computers. We also discuss some hypotheses regarding how a quantum computer works. 4.1 INTRODUCTION The art of using all the possible results that the laws of quantum mechanics give us to tackle computational problems is quantum computing. Customary or “traditional” PCs only use a limited subset of these possible findings. There are various findings on the brilliant stuff that if we only had an enormous enough quantum PC, we would have the choice to do. The most interesting of these is probably that in material chemistry, physics and infor mation science, we have the option to conduct re-enactments of quantum mechanical cycles, which would never come close to old-style PCs. People use PCs for drawing, watching films, music, informal commu nication, and whatnot. Additionally, engineers compose programs for easy-to-use applications, sites, and programming. To look at both quantum and old-style processing, we will experience the working of a PC from the base to see how both function. Computational Science and Its Applications. Anupama Chadha, Sachin Sharma, & Vasudha Arora (Eds.) © 2024 Apple Academic Press, Inc. Co-published with CRC Press (Taylor & Francis)
Computational Science and Its Applications
98
4.2 DIFFERENCE BETWEEN CLASSICAL AND QUANTUM COMPUTATION To understand the difference between classical and quantum computer, we first need to understand the difference between “bits” and “qubits.” Bits and Qubits • Bits apply only to the binary values such as 0 and 1 s when we consider bit in conventional computing technology, and it cannot be considered for other values, whereas, 0, 1 s, and a superposition of both values are expressed in qubits. That means it can be used in quantum computing to describe the combination of 0 and 1 s, where it is very important to report all the values in the system. • The binary digit combination is stored in qubits whereas binary data are stored in bits. This allows qubits to run three times as quickly as a traditional PC. The data collected and the exchange of data are enormous and this allows to transmit the data at a faster pace. • Bits address the issue as if in a hit and trial run when a problem is to be solved on the machine. When using quantum computing this problem is tackled with parallel processing by supporting all four values at a time and solving them at a faster speed. • The power to do the processing increases at an exponential rate as more qubits are added to the quantum computer, while when bits are added to the standard computer, the power will not increase and the operations will be performed at the same rate as one at a time. This occurs in quantum computation due to superposition.1 4.3 QUANTUM COMPUTER AND CLASSICAL COMPUTER The idea of digital logic and bits is used in classical computers, which we use on a daily basis, that means things we can do with a classical computer is very limited as compared to quantum computers. The distinction is so significant that billion-year operations on classical computers will take days or hours on quantum devices with enough qubits. But as much as it sounds fascinating, constructing quantum computers is extremely difficult because they require extreme isolation and the proper temperature of quantum artifacts. This is not the case with conventional
Quantum Computing
99
computers that can be designed and made work for all the user’s required conditions by someone with hardware expertise. Therefore, the number of quantum computers is very limited, and their use has recently increased. The storage space needed for bits by conventional computers is enormous and it takes up a lot of space. This can be avoided for qubits because the device with a small area can store enormous information. Today, as the systems and devices are becoming smaller, qubits are helping to reimagine the world of technology with very lightweight devices that are useful to bring around.2 4.4 APPLICATIONS OF QUANTUM COMPUTATION 4.4.1 QUANTUM SIMULATION When considering quantum computers, it is important to note that we should not just think of them as really super fast computers while these computing marvels can exponentially outperform classical computers as of now it is only in computing very specific types of operations and one of those operations is quantum simulation. A huge deterrent in studying quantum mechanics for scientists is the inability of current computers to simulate what is happening in the quantum universe unlike simulating events and conditions on the macro level quantum mechanics on a classical computer just does not compute this is something very exciting about quantum computers since there are already operating in the quantum world fundamentally. They have the ability to do quantum simulation. Researchers are extremely excited about the prospect of being able to accurately simulate circumstances of quantum mechanics and they think this will lead to incredible breakthroughs and understanding of what is happening and the ever so mysterious quantum universe. 4.4.2 CYBER SECURITY In today’s digital Internet scoop, our entire lives are online and the security of our information whether it is our credit card used on Amazon or our personal photos uploaded to the cloud. Everything is secured using a common technique across the board encryption and public key cryptography has been proven time and time again to be a successful way to digitally
100
Computational Science and Its Applications
secure our data and with the way these techniques work; it would take a modern classical computer centuries or in some cases longer than the universe has existed to crack a single one of these cryptographic keys but in the modern world of quantum computers that is just not the case. In fact a quantum algorithm called Shor’s algorithm could render pretty much all modern-day cryptography useless but do not panic, there is plenty of time for the Internet to prepare by the time a quantum computer that is large enough is not only created but available to the public.3 4.4.3 SEARCH Shor’s algorithm and quantum cryptography both originated from technology employed by quantum computers called quantum search. It is not exactly what you think of when you think of regular search on a computer which modern classical computers are very efficient at already. Quantum search is more like searching for the correct solution out of billions and billions of possible answers, for example, passwords if you created a very secure password that had upper and lowercase letters a few numbers and let us throw in some special characters like a dollar sign it would take the most advanced classical supercomputer about 174 years to guess that password and a quantum computer about 7 s, if it had the same processing power as the supercomputer granted the cutting edge, is not anywhere near creating a quantum computer that big. It is still worth talking about for comparison’s sake in this next concoction stir together one part machine learning and one part quantum computing and get robots taking over the world. This new field of study takes advantage of the machine learning technology that already exists along with quantum computers’ unique ability to model extremely complex scenarios. Classical computers manage the process while offloading computationally difficult problems to quantum systems. It is sort of a team effort dealing with quantum and classical computers. If you have experiments that have already been successfully done where these systems teach themselves to recognize cars within images they are shown. This is not great news as development is continuing as bigger and bigger quantum computers are being created and when implemented quantum computers would provide a serious fast lane to machine learning technology and the robots will be knocking on our doors.
Quantum Computing
101
4.4.4 HEALTHCARE In biology, the ability to simulate large complex molecules is a major bottleneck in regular computers of today. Quantum computing changes that by simulating different and complex molecules scientists can create all kinds of environments to analyze various drugs and their effects on our bodies. Researchers are also using machine learning algorithms on quantum computers to get a better understanding of certain diseases and thus how those diseases might be prevented.3 4.4.5 FINANCE Macroeconomics and global financial markets have always been heavily dependent on cutting-edge technology, and quantum computing has already been tapped for this difficult job. Modeling markets with math is no easy feat and economists and finance mathematicians have had their share of challenges in the past few decades. Computers and algorithms have totally taken over how markets fundamentally work but these algo rithms are lacking where quantum computers are being used to refine the process particularly when it comes to pricing stock options. Stock options are a derivative of stocks and require complex computations to price a stock options price correlates to the underlying stocks price but not always in the same way since there are hundreds sometimes even thousands of different options that can be bought and sold her stock prices can fluctuate and varying ways in correlation to the underlying stocks price. Researchers are looking for quantum algorithms to improve this process which would in turn create a more efficient market and an efficient market where prices are as accurate as humanly possible stands to benefit the everyday investor.3 4.4.6 ASTRONOMY If you remember all the way back to number one, we talked about using computers to simulate quantum mechanics. A small world of the quanta is that it actually helps scientists understand the infinitely large world of the universe. Being able to model quantum mechanics accurately allows
102
Computational Science and Its Applications
scientists to seek explanations for most of the mysterious phenomena in the greater universe. Think things like black holes and neutron stars; these mysterious objects are a big question mark to us because they don’t necessarily behave like classical physics. This tells us that they are governed by the laws of the quantum world which becomes more and more clear to us as we are able to model it using quantum computers. Another side to this sword is that quantum computing makes it possible for researchers to model the Big Bang more accurately and thus helped us to understand the history of the universe. 4.4.7 RANDOM NUMBER GENERATION It may come as a surprise but it is actually incredibly difficult for a computer to generate truly random numbers. Computers use super complex algorithms to generate what are known as pseudorandom numbers. They rely on physical properties to see their randomization algorithm but they are still considered to be pseudorandom because they rely on a seed number and no matter how many fancy algorithms you run your seed number through, there is still bound to be some kind of patterns arising. Random number generation may sound like a trivial problem at a first glance but if this process is compromised by predictable numbers, it is much easier for an attacker to get in and steal your information. This is yet another area that scientists are excited to utilize quantum computers for because quantum physics is fundamentally random. Truly randomly generated numbers can be created on the fly, and in quantum computing, the process takes advantage of the fact that certain events on the quantum level by principle cannot be predicted which is pretty confusing if you are trying to understand the universe what will you need some random numbers generated on the fly what could be better.4 4.5 MATHEMATICAL DESCRIPTION OF QUANTUM STATES AND QUANTUM OPERATIONS The numerical definition of quantum mechanics is the numerical propriety that licenses a thorough portrayal of quantum computing. This numerical propriety utilizes a practical examination, particularly Hilbert capacity, a direct space. This does recognize numerical formalisms for material
Quantum Computing
103
science speculations created preceding the mid-1900s by using dynamic numerical architecture, such as boundless spatial Hilbert spaces (L2 space chiefly), administrators on this space. In other words, estimations of physical perceptible, for example, energy with force, and not, at this point, considered as estimations of capacities on stage space, however, as Eigen values, all the more absolutely as ghostly estimations of straight administrators in Hilbert space. These definitions of quantum mechanics are being utilized till today. At the core of the depiction are thoughts of the quantum state and quantum appreciable, which are fundamentally unique mainly for those that have been used in past physical reality versions. While the arithmetic grants are computations of numerous amounts which have been estimated tenta tively, there has been specific hypothetical cut off to values that can be at the same time estimated. Heisenberg first explained this impediment through a psychological test and is spoken in numerically to the new formalism through the noncommutativity of administrators speaking to quantum appreciable. Mathematical Structure Three fundamental fixings commonly depict a physical framework: state, appreciable, and elements or, for the most part, a gathering of physical balances. An old-style portrayal can be shown in a genuinely rapid manner by a stage space type of approach: • States focus on simplistic stage space. • Observables are genuine esteemed capacities on it. • Time development is given by a one-boundary gathering of simplistic changes in the stage space. • Simplicity transformations acknowledge physical balances. A quantum depiction regularly comprises Hilbert space of states, observables are linked administrators in the range of characteristics, time advancements were given through one-boundary gathering in unitary changes to the Hilbert space method and unitary transformations acknowl edge bodily balances. Postulates of quantum mechanics The accompanying outline of quantum mechanics’ numerical system can mostly be followed through Dirac–von Neumann aphorisms.
104
Computational Science and Its Applications
All physical frameworks are related along with a (topologically) sepa rable multiplex Hilbert space H, inward item 〈φ|ψ〉. Beams (i.e., of matrix complex measurement one) in H are related to the framework’s quantum conditions. As it was in the quantum states and connected to comparability with a class of vectors in the stretch one in H, here the two vectors speak to a similar shape if they contrast just by a stage factor. Detachability is numerically advantageous speculation, with the physical translation that count numerous perceptions are sufficient to decide. “The quantum mechanistic method is to beam the projective space of Hilbert, not in a vector. Numerous course books ignore to get conformal cyclic cosmology (CCC) for this qualification, which is somewhat the consequence of ways in the Schrödinger condition that one includes Hilbert-space ’vectors,’ and the outcome is the uncertain utilization of ‘vector state’ instead of the beam is hard to avert.” 1. Related with any molecule moving in a moderate field of power is a wave work that decides all that can be thought about the system. 2. With each actual discernible q, there is a related administrator Q, which while working upon the wave function related to a definite value of that observable will yield that value times the wavefunction. 3. Any administrator Q related to a genuinely quantifiable property q will be Hermitian. 4. The arrangement of Eigen functions of administrator Q will frame an arrangement of direct autonomous functions. 5. For a framework portrayed by a given wave function, the assump tion worth of any property q can be found by playing out the assumption esteem basic as for that wave function. 6. The time advancement of the wave function is given when subor dinate Schrodinger condition. 4.5.1 HYPOTHESIS 1 The quantum mechanical framework properties are controlled by a wave function Ψ(r,t) that relies on the spatial directions of the framework and time, r and t. With a solitary molecule framework, r is the arrange ment of directions of that molecule r = (x1,y1,z1). For more than one molecule, r is utilized to address the total arrangement of directions r = (x1,y1,z1,x2,y2,z2,… xn,yn,zn). Since the condition of a framework
Quantum Computing
105
is characterized by its properties, ψ determines or distinguishes the state, and now and then is known as the state work instead of the wave function. 4.5.2 HYPOTHESIS 2 The wave function is deciphered as likelihood amplitude with unques tionably the square of the wave function, Ψ∗(r,t)ψ(r,t) deciphered at the likelihood thickness at time t. A likelihood thickness times a volume is a likelihood, so for one molecule ψ∗(x1,y1,z1,t)ψ(x1,y1,z1,t)dx1dy1dz1 is the likelihood that the molecule is in the volume dxdydz situated at xl,yl,zl at time t. For numerous molecule framework, we compose the volume compo nent as dτ = dx1dy1dz1… dxndyndzn; and Ψ∗(r,t)ψ(r,t)dτ is the likelihood that molecule 1 is in the volume dxldyldz1 at xlylzl and molecule 2 is in the volume dx2dy2dz2 at x2y2z2, and so on Due to this probabilistic understanding, the wave function will standardize. ∫ψ∗(r,t)ψ(r,t)dτ = 1 The indispensable sign here addresses a multidimensional basis including all directions: xl… zn. For instance, a mix in three-dimensional space will be a mix over dV, which can be extended as: dV = dx dy dz in Cartesian coordinates or dV = r dr dθ dz in cylindrical-shaped coordinates. 4.5.3 HYPOTHESIS 3 For each discernible property of a framework, there is a quantum mechanical administrator. The administrator for the position of a molecule in three measurements is only the arrangement of directions x, y, and z, which is composed as a vector →→→ r = (x,y,z) = xi + yj + zk
Computational Science and Its Applications
106
The administrator for a segment of force is ̂ = −iℏ∂∂x Px what is more, the administrator for dynamic energy in one measurement is T̂ x = −(ℏ2/2m)∂2/∂x2 also, in three measurements P̂ = −iℏ∇(3.9.6) also,
T̂ = (−ℏ2/2m)∇)2
The Hamiltonian administrator Ĥ is the administrator for the absolute energy. As a rule, just the active energy of the particles and the electrostatic or Coulomb likely energy because of their charges are thought of, however, overall all terms that add to the energy show up in the Hamiltonian. These extra terms represent things such as outer electric and attractive fields and attractive connections because of attractive snapshots of the particles and their movement. 4.5.4 HYPOTHESIS 4 The time-autonomous wave functions of a period free Hamiltonian are made by settling the time-autonomous Schrödinger condition. Ĥ(r)ψ(r) = Eψ(r) These wave functions are called fixed state capacities on the grounds that the properties of a framework in such a state, for example, a frame work portrayed by the capacity ψ(r), are time free. 4.5.5 HYPOTHESIS 5 The time advancement or time reliance of a state is made by addressing the time-subordinate Schrödinger condition. Ĥ(r,t)ψ(r,t)=iℏ∂∂tψ(r,t) For the situation where Ĥ is autonomous of time, the time subordinate piece of the wave function is e−iωt where ω = Eℏ or proportionately ν = Eh,
Quantum Computing
107
which shows that the energy-recurrence connection utilized by Planck, Einstein, and Bohr results from the time-subordinate Schrödinger condition. This oscillatory time reliance of the likelihood adequacy doesn’t influence the likelihood thickness or the discernible properties light the fact the computa tion of these amounts, the fanciful part drops in increase by the intricate form. 4.5.6 HYPOTHESIS 6 On the off chance that a framework is depicted by the Eigen function ψ of an administrator  then the worth estimated for the noticeable property comparing to  will consistently be the Eigen value a, which can be deter mined from the Eigen value condition.  ψ = aψ 4.5.7 HYPOTHESIS 7 On the off chance that a framework is portrayed by a wave function ψ, which is not an Eigen function of an administrator Â, at that point, a conveyance of estimated qualities will be acquired, and the normal worth of the detectable property is given by the assumption esteem indispensable, A=∫ψÂψdτ∫ψψdτ where the incorporation is over all directions associated with the issue. The normal worth A, additionally called the assumption esteem, is the normal of numerous estimations. On the off chance that the wave function is standardized, the standardization vital in the denominator of Equation (3–7) rises to 1. 4.5.8 LIST OF MATHEMATICAL TOOLS Part of the field’s legends concerns the numerical factual science course reading Systems of Statistical Physics created by Richard Courant from the University of David Hilbert’s Göttingen. This tale is told by scientists that they had excused quantifiable as not intriguing the flow of study regions till they approached Schrödinger’s concepts. It was then understood that the latest quantum mechanisms, arithmetic is spread through it. It is said that
108
Computational Science and Its Applications
Heisenberg is directed by Hilbert about the grid method, Hilbert noticed his involvement in vast dimensional networks had gotten from differential conditions, exhortation which Heisenberg overlooked, passing up on the chance to bring together the hypothesis as Dirac and Weyl prepared a few couples of years after the fact. No matter what the tales’ premise, hypothesis science was traditional, though physical science was fundamentally new. The fundamental apparatuses include the following: • Direct variable-based math: complex numbers, Eigen vectors, and Eigen values • Useful examination: Hilbert spaces, direct administrators, and phantom hypothesis • Differential conditions: incomplete differential conditions, a partition of factors, conventional differential conditions, Sturm– Liouville hypothesis, and Eigen functions • Symphonies examination: Fourier changes 4.6 ENGINEERING CHALLENGES FACED IN DEVELOPING QUANTUM COMPUTERS 4.6.1 HISTORY If in the 1940s or 1950s you had proposed cramming a room-size central ized server into a shoebox, they would have laughed at you. If you had brought a machine a hundred times more fit, everyone would have laughed, but that is the extent of the quantum test in front of everyone. Progress on the equipment front is still expected to continue, but at a slower pace than that seen with conventional PCs from the 1940s to the 1990s. One would expect that we will progress at a faster rate, maybe completing what took 20–50 years in the past in 10–20 years, but we also face a slew of additional challenges that are often overlooked in favor of creative specula tion rather than being mindful of the amount of money thrown at the problem. 4.6.2 HARDWARE AND ALGORITHMS In fact, specific specialty issues can achieve adequate arrangements without a divine being portion of the necessary advancements, but obvious,
Quantum Computing
109
widely useful, commonly accessible, useful quantum PCs will necessitate the most of those advancements. We are still a long way from being able to replace all or most conventional PCs with quantum PCs, so we are looking at a future with a blend of quantum PCs and traditional processors. In general, quantum computers can continue to be coprocessors for a long time. The distinction between quantum and traditional calculation is less akin to that between a propeller-driven plane and a stream plane, and more akin to that between air travel and space travel—there is almost no information that moves in a perfect manner. Instead of critical similarity and common ality, one is met with confusion and new ideas that are dissimilar to the old ones on all fronts. For traditional registering, the concept of a low-level computing construct and bytes was used, and constantly agonizing over an incredibly exacting cut-off on memory and how information is managed is a relic of a bygone era, but for quantum processing, these and other related components remain up front and are seriously easing back, if not outright obstructing any rapid progress. To put the equipment challenges in context, we need the following: 1. Sufficiently more qubits. 64, 128, 192, 256, 512, and 1,024 are the different types of memory. 2. To begin with much larger numbers of qubits—sometimes even millions. Even though a 1,000 by 1,000 cross section (matrix) contains 1,000,000 qubits, it is a relatively unobtrusive measure of information according to current standards. 3. Significantly increased availability (ensnarement) with far fewer, if any, limitations. 4. There are a lot fewer mistakes. 5. There is no longer any rationality. 6. Significantly greater circuit depth. 7. True adaptation to noncritical failure mistake adjustment, which necessitates a large amount of excess per qubit. 8. The total cost of the framework is significantly lower. Working at noncryogenic temperatures so, before it can handle anything more than a few high-value specialty applications, we will need a much more elaborate algorithmic framework and the hardware that supports it.
Computational Science and Its Applications
110
Applied material science: • • • • • • • • •
Long-lasting qubits High devotion for a general arrangement of qubit changes Upgrading qubit processors All of this without requiring at a reasonable cost Hypothetical material science and related arithmetic (Lie gatherings, tie hypothesis, and so forth): New quantum models of calculations and new types of quantum mistake revision Quantum data preparation with new particles and physical frameworks (Calculations for) reenacting fascinating physical frameworks on different quantum PCs Compelling quantum control in reproducible quantum frameworks
Hypothetical software engineering, calculations, and related science (number hypothesis, framework disintegrations, and so forth): • Considering valuable problems that involve quantum speedup and discovering quantum calculations that achieve a faster result • Demonstrating that a quantum acceleration is not considered in a few issues • Evaluating the multifaceted nature/intensity of various quantum calculation models, verification frameworks, and games (some of which are proposed by hypothetical physicists), and minimizing some of them for other people. Quantum PC designing: • More viable quantum circuits and control successions • Reenactment and check of quantum circuits, calculations, and control groupings, and so on • Better quantum blunder remedy, perhaps custom-made to explicit quantum calculations • All the above points for explicit quantum computers built by applied physicists Interdisciplinary difficulties: • Staying out of the spotlight (quantum computers may not run your e-mail or web program). • Obtaining funding to buy new equipments.
Quantum Computing
111
4.6.3 TACKLING THE CHALLENGES 4.6.3.1 MATTERS INTERNAL TO A QUANTUM Consider an advanced CPU; regardless of a large number of different organizations and items, the processor layout in any setting follows a similar pattern. Most numerical computations take place in an ALU (number juggling and rationale unit). Branch expectation, in addition to guidance sequencing, controls project execution. Information is either stored locally in one of a few locations or off-stacked to irregular access memory. Different input/output interfaces are used in system administra tion and design. 4.6.3.2 IMPROVING THE CLASSICAL/QUANTUM INTERFACE Quantum computers are currently envisioned as complex minicomputers. They are given a difficult computational task to complete and then settle. Nonetheless, it might be more helpful to think of them as distinct processing objects that interact with older PCs in complex ways. What is the best way to deal with the massive traditional data streams created by QEC, according to the model? How does one combine old-style and quantum subroutines into a single program? Is there a way for an oldstyle compiler to improve quantum circuits by lowering qubit or entryway tallies or increasing protection from activity flaws, for example? A few important QEC measures are probabilistic, allowing quantum program mers to use choice-based control. Would we be able to devise a strategy to pursue coherent quantum bits as valuable assets, despite the fact that they have a finite “lifetime” and reasonable precision? 4.6.3.3 NORMALIZE OR OPTIMIZE The layered structure described earlier was undoubtedly an attempt to stan dardize the design of quantum computers. The argument for normalization is that it allows for different quantum computer plans to be considered, and the planning cycle is normalized, making the development and activity of design teams more productive. In any case, it is not clear if this is the best approach. Much cutting-edge registering advancement from sellers
112
Computational Science and Its Applications
such as Intel employs restrictive plans and strategies that are tailored to a specific purpose or for the assets of a specific organization. 4.7 PROGRAMMING QUANTUM COMPUTERS Quantum computers are a new type of computer. They burn through massive amounts of equivalent universes in order to run programmes faster and make decisions that even Einstein could not figure out. They are the enthralled wonder boxes that will leave your semiconductor boxes in the dust. At the very least, that is what popular science articles will tell you. They unquestionably win when it comes to making this new development sound exciting. Regardless, they can make quantum counting appear to be a mysterious art form reserved for only the most astute of analysts. Using all means, I do not believe this is legal. Furthermore, with orga nizations like IBM and Google genuinely making quantum contraptions, now is a great time to begin experimenting with quantum composing PC programs. You do not need to do anything too strenuous at first. You can begin your journey into quantum programming in the same way that many of us begin our journey into traditional programming by making games. Try not to push; you won’t need your own quantum computer. Direct quantum ventures can be easily replicated on a standard PC. By virtue of the IBM Q Experience, we can also secure some time on a verifiable quantum device. This is where we describe some direct projects created with IBM’s quantum SDK. Each will be its own Battleships variant, with fundamental quantum estimates being used to finish the basic game. There are two programming languages: declarative and imperative. 1. IMPERATIVE LANGUAGE 2. QUANTUM PSEUDOCODES Imperative language QCL (Quantum Computer Language), LsnQ, and Q are the main delegates of the basic dialects. The quantum programming language QCL is one of the most broadly used. Its grammar is inspired by the linguistic structure of the C programming
Quantum Computing
113
language, and traditional data types are like data types in C. This is funda mentally how quantum information works in QCL, and it is called Qreg (quantum register). This can also be decoded as a series of qubits (quantum bits). The assistance for characterized tasks and capacities is the most impor tant aspect of QCL. It is possible to characterize new activities that can be used to control quantum information, just as it is in current programming dialects. Q language The Q programming language was created as a development of the C++ programming language that provides classes for common quantum tasks such as QHadamard, QFourier, QNot, QSwap, and QSwap, all of which are derived from the base class Qop. The C++ class framework can be used to create new heads. qGCL P. Zuliani presented Quantum Guarded Command Language (qGCL) in his PhD Presentation. It is based on the Edsger Dijkstra’s Guarded Command Language. It could be described as a quantum programming language. 4.7.1 FUNCTIONAL QUANTUM PROGRAMMING Various quantum programming lingos based on a utilitarian programming perspective have been proposed in recent years. Various points of interest in old-style utilitarian programming lingos allow for the unmistakable communication of figures. 4.7.2 QPL AND cQPL Peter Selinger described the QPL and cQPL is its expansion, which includes a couple of segments for displaying quantum correspondences. cQPL—correspondence capable programming language—was the name given to this comprehensive language QPL (Qualified Public Liability Corporation).
Computational Science and Its Applications
114
4.7.3 QUANTUM LAMBDA CALCULUS Philip Maymin made the first try to depict quantum lambda math in 1996. André van Tonder depicted the development of lambda math proper to show the correctness of quantum programs in 2003. He also demonstrated how to use the Scheme programming language. The Lambda math Showed by Alonzo Church and Stephen Cole Kleene in the 1930s is used in Quantum Lambda Language. It is an optional quantum figuring model, but it will be demonstrated, as is customary, that it has comparable computational power. KEYWORDS • • • • • •
quantum bits qubits simulation hypotheses quantum computing
REFERENCES 1. https://www.educba.com/qubits-vs-bits/ (accessed on 06.08.2021). 2. https://www.quantumcomputers.guru/learn/difference-between-quantum-computer and-classical-computer/ (accessed on 09.08.2021). 3. https://analyticsindiamag.com/top-applications-of-quantum-computing-everyone should-know-about/ (accessed on 06.08.2021). 4. https://faun.pub/quantum-random-number-generator-qrng-c254335ef445 (accessed on 09.08.2021). 5. https://www.cmu.edu/tepper/news/stories/2021/february/quantum-computing-tayur. html (accessed on 09.08.2021). 6. https://chem.libretexts.org/Bookshelves/Physical_and_Theoretical_Chemistry_ Textbook_Maps/Book%3A_Quantum_States_of_Atoms_and_Molecules_ (Zielinksi_et_al)/03%3A_The_Schr%C3%B6dinger_Equation/3.09%3A_ Postulates_of_Quantum_Mechanics(accessed on 09.08.2021).
Quantum Computing
115
7. https://www.ibm.com/quantum-computing/what-is-quantum-computing/ (accessed on 09.08.2021). 8. https://arxiv.org/abs/1402.5172 (accessed on 09.08.2021). 9. https://www.quantiki.org/wiki/quantum-programming-language (accessed on 09.08. 2021). 10. https://www.nature.com/articles/d41586-021-00533-x (accessed on 09.08.2021). 11. https://quantumcomputing.stackexchange.com/questions/1474/what-programming languages-are-available-for-quantum-computers (accessed on 09.08.2021).
CHAPTER 5
Image Processing SWETTA KUKREJA1, ROHIT SAHOO2, DEEPAK JAIN2, and VASUDHA ARORA2 1
Amity University, Mumbai
1
Terna Engineering College, Navi Mumbai
Sharda School of Engineering & Technology, Sharda University, Greater Noida, India
2
ABSTRACT Image-processing techniques have a significant role from image acqui sition, preprocessing, and segmentation to image analysis. This chapter provides a systematic study on the importance of image processing with its application in the domain of computer vision. It also gives a clear conceptual information about images’ file formats and their types used for image processing. In the image processing process, the image is first acquired, and multiple image preprocessing techniques are used, such as resizing, noise removal, filtering, and contrast enhancement. Then, several image segmentation techniques are applied over this image, such as edge detection, thresholding, region growing, and clustering to get an output of enhanced high-quality image to perform further analysis. Image analysis is a key step to the image processing which uses segmentation, feature extraction, and classification for effective analysis and to extract impor tant information from the image. Our study not only provides a substantial introduction to image-processing techniques but also presents the impor tance of image processing in machine learning and deep learning domain. Machine learning has taken image processing to an advanced level since the detection of objects within the image is now achievable because of Computational Science and Its Applications. Anupama Chadha, Sachin Sharma, & Vasudha Arora (Eds.) © 2024 Apple Academic Press, Inc. Co-published with CRC Press (Taylor & Francis)
Computational Science and Its Applications
118
many machine learning algorithms. Deep learning uses multilayer neural networks like CNN or RCNN that can help in extracting important specific information from input images. 5.1 INTRODUCTION Digital image processing is the subcategory of digital signal processing. It has many advantages over the analog image. To avoid any kind of noise and distortion, it has a large number of algorithms which applied to the input data to avoid that kind of problem. The processing of an image is called digital image processing. Digital image processing refers to the processing of digital images which is composed of a number of elements such as objects, picture elements, and pixels where the pixel is the most extensively used to denote the elements of the digital image.1 Three factors that affect the development of digital image processing are as follow: • • •
Computer development Mathematics development Application in environment that is a wider range of demand
Image: It is defined as a two-dimensional (2D) function that is F(x, y) an image is defined as a 2D function (x, y), where x and y are spatial coordinates, and F is the amplitude at coordinates that is x and y which is called intensity. Digital image: It is defined as at a particular location each element has a particular value and it is also composed of a finite number of elements. Human Visual System It consists of the eye and brain. This both helps make the picture or image. Eye helps in capturing the image and sending messages to the brain. The different parts of the human eye are as follows: Primary lens: In which cornea is used for the incoming light. Iris: It helps in controlling the light. It can function in different light conditions like bright, dark, and so on. Retina: It helps in creating photo, in which sensitive screen behind or back of the eye.
Image Processing
FIGURE 5.1
119
The human eye.
There are two types of light-sensitive cells in the brain, rods and cones. There are 120 million rods and 6 million cone cells. Rod cells provide visual effect at low light and cones cells provide lighter in the daytime visual effect. ¾ Classification of Digital Images: There are two types of images: 1. Raster or bitmap 2. Vector image 1. Raster or Bitmap image: In bitmap or raster, image has a fixed number of pixels. The quality of the image degrades while zooming. BMS, JPG, PNG, GIP, and TIFF are the format of bitmap or raster image. 2. Vector image: In vector image, store information like length, color, and thickness. These images are for line art, font, and so on. ¾ Digital Image File format There are several types of digital image file formats: JPEG, GIF, TIFF, PNG, and BMP. These file formats are for compression or
Computational Science and Its Applications
120
for reducing the image size or file of the image. If the image is
black and white, then it contains only two intensities; one is black
and another is white.
There are two types of compression in images:
1. Lossless compression: In this type of image compression, the image file is compressed with any loss of information. The original data can be extracted from the compressed image; it has a lower compression ratio. 2. Lossy compression: In this type of image compression, there is a loss of information during the compression process. It is used for low bit rates and application streaming media.3 The reconstruction of an image from this type of compression is not possible. •
• •
TIFF (Tag Image File Format): It is a compression technique. It is based on lossy and lossless compression. PNG (Portable Network Graphics): It is based on lossless. It is also a compression technique. JPEG (Joint Picture Experts Group): It is a lossy compression technique. Since much of the original information of an image is eliminated by JPEG compression, the reconstruction is not possible from the compressed image to original image.
¾ Types of image: There are different types of images: • Binary—It contains only two pixels that are 0 and 1. “0” refers to black and “1” refers to white. This is also called monochrome. • Black and white—It contains the only black and white color. • 8-Bit color format—It contains 256 different shades of colors in it, in which 0 stands for black and 255 for white and 127 for gray. • 16-Bit color format—It is a format of the color image, in which color distribution is not the same as the image of grayscale. The format of 16 bits is divided into three formats, that is, red, green, and blue (RGB).
Image Processing
121
5.1.1 DIGITAL IMAGE REPRESENTATION Image represented as Matrix: Image represented in rows and columns. For example: f (0, 0) f (0,1) f (0, 2) … f (0, N −1) f (1, 0) f (1,1) … f (1, N −1) f (1, 2 ) f ( x, y ) = f ( M −1, 0 ) f ( M −1,1) f (M −1, 2) … f ( M −1, N )
The matrices and digital images are related to each other as a digital image is represented by pixel matrix. The pixel matrix in a computer represents the digital image. The image matrix stores the corresponding pixel data of the digital image which is the pixels and intensity of the image color.4 Each pixel for a grayscale image presents one matrix element from the set and the numeric values are changed from 0 to 255 uniformly, that is, from black pixels to white pixels, respectively. For color images, RGB (red, green, blue) color model represents three grayscale image matrices, one for each of the color components, that is, red, green, and blue. In image processing, the operation performed on the image matrix looks like the operation is performed on the image in the computer. In any image-processing technique like edge detection, thresholding the opera tion is applied over the pixel matrix. Good matrix represents the mutual relation and nature between the image elements, where the nature of the image can be positive or negative, real or complex, and the mutual relation is the relation between image element magnitudes. 5.1.2 PHASES OF IMAGE PROCESSING
FIGURE 5.2
Phases of image processing.
122
Computational Science and Its Applications
Acquisition: Collect the image in digital form; it involves scaling and conversion from RGB to gray or vice-versa that is the main task. Image enhancement: It is to improve the quality of an image and also used to extract hidden details. It is a most interesting area of image processing. Image restoration: It is to restore the image by removing the noise or blurring and by removing scratches. There are two types of categories in image restoration called the deterministic and stochastic method of image restoration. In the deterministic method of image, restoration can be used if the degradation information is already known whereas in the stochastic method of image restoration can be used if the degradation information is not known.5 Color image processing: Color is important in image processing. It is a powerful descriptor that helps in object detection and extraction. Wavelets and multiresolution processing: It is foundation of repre senting images in various degrees. Image compression: It helps to compress the file or data and also remove the redundancy. Morphological processing: The process is based on shape of the image, while operating the value of each pixel compared with the corre sponding neighborhood pixel in the input image. Segmentation procedure: It is the process of partitioning or dividing the image into number of segments and it is for detection of lines or curves in the image. The image segmentation techniques used for image processing are classified into two main categories called block-based segmentation methods and layer-based segmentation.6 Representation and description: It is to convert the raw data into the processed data and it to be done by segmentation. Object detection and recognition: It is the process of detection of the object or to localize the object in the image and recognize it. 5.2 COMPUTER VISION Image processing is the subfield of computer vision and it comes under artificial intelligence. It analyzes the images, videos, and generate the desired output in the real time. It is basically to extract the information from the input image. Computer vision focuses on the labels of the object and coordinates.
Image Processing
FIGURE 5.3
123
Fields of computer vision.
Machine learning techniques are applied by computer vision to identify the patterns and objects present in images. Computer vision is used to extract information from image by applying machine learning algorithms. Because of growing advancement in the technology, the computer vision has been extended to other domains, such as robotics, healthcare, agriculture, geographical remote sensing, military, and satellite communication.7 Researchers can use the knowledge produced by the computer vision after extracting the features from the images to perform extensive analysis or predict events. The subfields of computer vision are as follows: 1. 2. 3. 4. 5. 6.
Sensor technology Image and signal processing Computer graphics Machine learning Artificial intelligence Image recognition
Computational Science and Its Applications
124
7. Image classification 8. Image feature extraction 5.2.1 IMAGE PROCESSING VERSUS COMPUTER VISION Image processing focuses on smoothing, sharpening, and brightening the image through an image processing algorithm. Image processing is basically for the image to process filtration of image, resizing, noise removal, edge detection, and color processing, but in computer vision, it is to extract information from the visual data. This is for the detection of the object from the image differentiation of the object classification. For example, how many cats passing by a particular point in the street as recorded by the video camera? TABLE 5.1
Comparison Between Image Processing and Computer Vision
Domain
Input
Output
Image processing
Image
Image
Signal processing
Signal
Signal, quantitative information, for example, peak location
Computer vision
Image/video
Image, quantitative/qualitative information, for example, size, color, shape, and classification
Machine learning
Any feature signal, for example, image, video, and sound
Signal, quantitative/qualitative information, image
¾ Image processing • It is for the raw input images to improve the quality or to enhance them. • It is a subset of computer vision. • Image processing can use different methods or techniques. For example, hidden Markov model, independent component analysis. • Image processing applications are rescaling and correcting illumination.
Image Processing
125
¾ Computer vision • It is basically for extracting the information from the image or video. • It is a superset of image processing. • For computer vision, image processing is one of the methods used in it with other techniques, for example, machine learning and convolution neural network. • Computer vision applications are face detection, object detec tion, and so on.
FIGURE 5.4
Different domains of computer vision.
Image processing is basically for the image to process, for example, filtration of image, resizing, noise removal, edge detection, and color processing, but in computer vision, it is to extract information from the visual data. This is for the detection of an object from the image differ entiation of the object classification, for example, how many cats or any animal or human passing by a particular point in the street as recorded by the video camera. As shown in Figure 5.5, there are several computer vision tasks for a single object as well as multiple objects. For a single object, classify the object and localize or locate the location also. For multiple objects, detection of the object is involved and segmentation of the objects and boundaries, for example, curve and lines, also.
Computational Science and Its Applications
126
FIGURE 5.5
Computer vision tasks.
Source: Image by Mike Tamir. Reprinted from Ref. [21].
5.2.2 IMAGE PREPROCESSING Image preprocessing aims to enhance image or improve quality of the image. There are a number of image preprocessing methods; it is also used to reduce redundancy and increase accuracy of the algorithm. Techniques of Preprocessing Sometimes it is useful to lose unnecessary information from the image to reduce the complexity and space. For example, convert the color image into a grayscale image to reduce the pixels. The color image is not necessary to recognize the image, grayscale image is enough to the recognition of the object, although color images contain more information and more complexity and also take more space in memory. Convert the color image into grayscale to reduce the pixels, but in some places or some applications, color image is very necessary, for example, skin cancer detection, because it is important to rely on the color to detect, that is, detection of red rashes in medical image. In medical image, color images provide a lot of information. Preprocessing in image processing comprises
Image Processing
FIGURE 5.6
127
Image preprocessing.
Source: Reprinted from Ref. [22]. https://freecontent.manning.com/the-computer-vision pipeline-part-3-image-preprocessing/
various types of operations such as geometric or radiometric corrections which help to perform data analysis over the images.8 Standardization: In some algorithm, for example, machine learning need to resize the image, it has identical width, height, and so on before fed into the algorithm. Data augmentation: Another preprocessing technique is augmenting the dataset on the existing dataset. Different operations can be applied to any kind of image, for example, decolorized, flip/rotate, scaling, and so on.
128
Computational Science and Its Applications
General methodology of preprocessing:
FIGURE 5.7
Methodology of preprocessing.
The steps for image processing are as follows: Step 1: Collect the image to acquire the image and to convert the color image into grayscale to reduce the complexity. Step 2: Resize the image into different sizes or dimensions. Interpo lation is necessary for resize the image. There are two ways: one is downsampling and another is upsampling. Step 3: Noise removal can be removed by filtration methods. There are several types of noise, for example, salt and pepper, and Gaussian noise. • Salt and pepper: It is also known as an impulsive noise and it is also found in water. This noise represents as randomly accruing of white and black pixels. In salt and pepper noise, the circumstances where swift transients take place are called spike noise or impulsive noise.9 • Gaussian noise: It has a probability density function of Normal distribution which is also known as Gaussian distribution.
Image Processing
129
• Filtering methods: It is to enhance an image. In filtering methods, filters are used to suppress the high frequencies in the image, for example, smoothing the image. There are many applications in which filtering is used, for example, smoothing, sharpening, removing noise, and edge detection. Step 4: Image enhancement is the step toward transforming computerized images to enhance the aspect of an image or to convert the image to a better form for further analysis. It includes the following: 5.2.3 IMAGE SEGMENTATION Segmentation plays an important role in image processing. It is the first stage of image compression. Segmentation is based on the number of features, for example, shape, texture, and color10 It is based on the different properties; one is a discontinuity and another is similarity. Image segmentation is an approach used for processing and analysis of a digital image where the digital image is subdivided into subgroups of multiple parts or regions. The subgroups are based on the aspects of the pixels in the image. This will reduce the complexity of the image and thus performing analysis of the image becomes simpler. Image segmentation also involves the process of clustering regions based on similarities in shape or color of pixels or separating the foreground from the background image. It is generally used to detect objects and the boundaries in images. There are two main categories for image segmentation called layer-based and block-based.11 Image segmentation is generally based on one of the two basic properties, that is, segmentation based on discontinuities and similarities in intensity. 1. Segmentation based on discontinuities In this method, the approach is to partition an image based on sudden changes in the intensity, such as points, lines, or edges. 2. Segmentation based on similarities In this method, the partition of the image is done into regions that are similar according to the set of predefined criteria. It includes region growing, region splitting, and region merging. There are number of segmentation techniques:
Computational Science and Its Applications
130
¾ ¾ ¾ ¾
Edge detection Thresholding Region growing Clustering
Edge detection Edge detection is an image-processing technique used for image segmen tation by identifying the edges of objects present in the images. It is a significant tool for image segmentation. Different edge detection methods are used to transform the original images into edge images by identifying the images’ gray tones. The edge detection technique identifies the edges within the images by determining the brightness’s discontinuities. It is mainly used in image processing, data extraction, image segmentation, and computer vision. Different types of edge detection techniques include Sobel, Prewitt, Canny, Prewitt, and other fuzzy logic methods. In computer vision, edge detection helps image processing by detecting the outlines of the objects and their edges, among other objects and the background. The edge detection technique is commonly used to detect the substantial discontinuities in the intensity levels of an image or identify local varia tions in the image.
FIGURE 5.8
Edge detection techniques.
Source: https://www.researchgate.net/publication/307138664_USING_F-TRANSFORM_ TO_REMOVE_IMAGE_NOISE
Image Processing
131
Prewitt’s and Sobel Operator Prewitt operator edge detection masks are the most extensively used technique for edge detection in images. It approximates the first derivative and assigns similar weights to all the neighbor of the candidate pixel whose edge strength is being calculated. The following masks are the Prewitt edge detection mask which approximates the first derivatives Fx and Fy.12 –1
–1
–1
–1
0
1
0
0
0
–1
0
1
1
1
1
–1
0
1
Fx
Fy
1
–2
1
1
–2
1
0
0
0
0
0
0
1
–2
1
1
–2
1
Fx
Fy
In Sobel edge detection, the higher weights are assigned to the pixels close to the candidate pixels. Canny edge detection In canny edge detection, the canny operator is a first derivative edge detector coupled with noise and cleaning. Canny edge detector is not susceptive to noise compared to Sobel and Roberts mask.13 It is one of the most significant derivative operators like Laplacian Gaussian operator. In canny edge detection, the image is first smoothened with the help of a Gaussian low-pass filter, and then it takes the first derivative. This method is used to detect a broad range of edges in the images. In this method, the detector identifies the edges with the help of local maxima and the Gaussian filter is used to calculate the gradient of F(x, y).
Computational Science and Its Applications
132
Robert’s Mask It was the first edge detection proposed by Lawrence Roberts in 1963. In Robert edge detection, the Robert cross-operator computes the sum of the squares of differences between adjacent pixels and approximates the gradient with the help of discrete differentiation of the image. This edge detection would provide better results if cross-differences were taken instead of the straight difference. The following are Robert’s masks:14 1
0
0
1
0
–1
–1
0
Fx
Fy
Thresholding Thresholding is a technique that will convert each pixel into black and white or keep it unchanged if the actual color value is in the range given for the thresholding of an image. Image thresholding is an efficient way of subdividing an image into the foreground objects and the image’s background. The image thresholding is an image analysis technique where the objects are isolated from the background by converting the grayscale images to binary images. This technique is most efficient, with the images having high levels of contrast. Thresholding is used in image processing for image segmentation to convert an image’s pixels, which makes the analysis more straightforward and efficient. There are two thresholding techniques used for image processing called local and global thresholding which depend on the threshold values.15 In the thresholding process, the threshold value determines whether the pixels in an image are from the foreground or background. If the pixels in an image have a greater value than the threshold value, then the image is determined as “object” pixels. Else it would determine it as “background” pixels if the value is less than the threshold value. In this process, the value “1” is given to the object pixel in an image, whereas the value “0” is given to the background pixel. When different threshold values are used to determine the foreground and background for different regions in the image, it is called adaptive
Image Processing
133
thresholding or dynamic thresholding. In the basic adaptive thresholding approach, the original image is subdivided into subimages, and the thresh olding process is applied over each of these subimages. Region growing Region growing is an approach that groups the neighboring pixels or subregions into larger regions. Region growing is an image segmentation technique, where the neighboring pixels are determined and added to a region class. It is the most straightforward approach to image segmenta tion since it starts with a selection of a set of initial seed points, and from these sets of points, it grows into regions by including every seed point which has a similar property, which can be gray level, shape, color, or texture. For each boundary pixel present in the image, the process has to be iterated. If the adjacent regions are identified, then a region-merging algorithm is applied. The region growing-based image segmentation tech nique has proven to be better than edge detection-based techniques since it is challenging to detect edges in noisy images. Region growing algorithm can deal with the noise present in the image. The seed pixel that is the first pixel from the image is compared with other pixels, and if the pixel fulfills the homogeneity function, the value of the pixel is changed to the value of the seed pixel.16 The advantage of region-growing methods is that it can produce the original image having better segmentation results with precise edges. The basic concept is selecting the small number of initial seed points, which will eventually grow into a region. The disadvantage of the region-growing method is that it is computationally expensive. Clustering Clustering is an essential unsupervised learning algorithm that deals with the data point to find a structure from an unlabeled dataset. The cluster-based image segmentation method techniques are used in various domains such as mathematics, medical, computer, and engineering.17 Clustering in image segmentation is an efficient technique to divide the region into an image based on coherent properties like color and texture. Another approach is to cluster together with the sets of coherent tokens in an image with high similarity. The cluster-based image segmentation
Computational Science and Its Applications
134
method involves clustering algorithms like k means, fuzzy c means, and improved fuzzy c mean algorithm. Cluster-based image segmentation is a technique that describes an image based on a cluster of pixels from an image that belongs together. An image’s pixels are added to clusters based on coherent properties, such as the same color or same texture, or if the pixels are nearby. In the cluster-based image segmentation method, the clusters can be made in two different ways: partitioning and grouping. In partitioning, the image is decomposed into different regions based on coherent properties such as color or texture, whereas in grouping, the tokens from an image are collected together to form a cluster. The k-means clustering can be used for cluster-based image segmentation. It is an iterative process to partition an image based on properties into k-clusters. 5.2.4 IMAGE ANALYSIS Image analysis in image processing is used to extract meaningful informa tion from images with the help of various image-processing techniques. In image analysis, various image-processing techniques are applied over image such as preprocessing, segmentation, feature extraction, classifica tion, and interpretation. Image analysis performs different types of analysis over the image to extract important information. The techniques used by image analysis might be automatic or semiautomatic. The final output of the picture is a numerical output rather than an image. Image analysis methods are different compared to the other image processing methods such as image enhancement or restoration. Steps of image analysis 1. 2. 3. 4.
Preprocessing Segmentation Feature extraction Classification and interpretation
Image Processing
FIGURE 5.9
135
Steps of image analysis.
1. Segmentation in Image Analysis Image segmentation is an important step in image analysis. Segmentation subdivides the image into multiple parts or regions based on the aspects of the pixels of the image. It distinguishes between the background and foreground of the images. This step significantly reduces the complexity of the image and the image analysis becomes simple. The segmentation operation only subdivides the image into multiple parts or regions and does not recognize the segmented image parts. Segmentation techniques used for image analysis are 1. Thresholding 2. Boundary-based detection
Computational Science and Its Applications
136
3. Template matching 4. Region-based segmentation 5. Texture segmentation 2. Feature Extraction in Image Analysis Feature extraction is a dimensionality reduction process which extracts most relevant information from the original image and represents them in a lower dimensionality space.18 It divides the initial set of raw data and reduces it to the more manageable groups. Feature extraction makes the task of classifying the pattern simpler by formal procedure since it can describe the important shape information contained in the pattern. In this step, if the input data to the algorithm are huge to be processed for image analysis and if some data are speculated to be redundant, then the input data have to be transformed into a lower dimensionality space, that is, into a reduced set of features. This process of trans forming image data into the set of features during image analysis is called feature extraction. These features can be easily processed and able to interpret the dataset with better accuracy while keeping the originality maintained. Types of features that are extracted in feature extraction step for image analysis are as follows: 1. 2. 3. 4. 5. 6.
Spatial features Transform features Edge and boundaries Shape features Moments Texture
3. Classification in Image Analysis Image classification in image analysis classifies the images using segmentation based on their features. Image classification is classifying the objects present in the image into appropriate categories or labels. It uses convolutional neural network to classify and interpret the image for analysis. Types for classification techniques for image analysis are as follows: 1. Clustering
Image Processing
2. 3. 4. 5.
137
Decision trees Neural networks Statistical classification Similarity measures
5.3 MACHINE LEARNING AND IMAGE PROCESSING Image processing has diverse applications in different fields, such as medical, military, automobile, and agriculture. Machine learning has appeared as a critical component of computer vision programs. More research is conducted, and the importance of machine learning and image processing has increased because of image datasets’ easy availability. Innovative integration of machine learning in image processing is very likely to significantly benefit the field, contributing to finding more excel lent knowledge from complex images. Machine learning has made a tremendous change in the image processing industry. Initially, image processing was able to analyze any object from image. With the help of machine learning, images can be deeply processed, and every minute information from the image can be captured by machine. Machine learning uses datasets for processing, which contains a set of images already stored in it which help to make better understanding of any object with different shapes, sizes, and colors. Datasets play an important role in image processing with machine learning. The more data/images in datasets will help the machines to give more accurate results, as machines can learn more from it. Datasets can vary in size, depending upon the data in it. Machine learning brought image processing to an advanced level due to which detection of anything from image has been now possible. So, image processing with machine learning has vast applications in industry. Healthcare industry, defense, automobile industry, and agri culture have started to use image processing with machine learning significantly. Machine learning is good for pattern recognition, object extraction, and color classification problems in image processing problem domain.
138
Computational Science and Its Applications
Image Processing Using Machine Learning
FIGURE 5.10
Image processing using machine learning.
Feature extraction methods are scale-invariant feature transform (SIFT), bag of visual words, extraction based on color, texture, shape, and statistic. Bag of Visual Words The concept of “bag of visual words” is derived from the “bag of words” used for information retrieval in natural language processing. In bag of visual words, the words are the subparts of the image and their associated feature vector instead of text keywords. Bag of visual words is the frequency of the words which refers to a precomputed corpus of feature vectors.19 There are three steps to construct the bag of visual words: 1. Feature extraction: The first step is to perform feature extraction by obtaining descriptors from each image or detecting key points and extracting SIFT features from images.
Image Processing
139
2. Codebook construction: In this step, the construction of possible visual vocabulary is achieved after the feature extraction. Vocabulary construction is typically performed by clustering the feature vector obtained using a clustering algorithm called k-means clustering. 3. Vector quantization: It is a process of modeling an object by taking a set of nearest neighbor labels and generating a histogram of all the visual words by the distribution of feature vector is called vector quantization.
FIGURE 5.11
Image to information.
Classification methods used for image processing: • • • • • • • •
Decision trees Instant-based learning Bayesian methods Reinforcement learning Inductive logic programming Genetic algorithms Support vector machines k-Nearest neighbors
5.4 DEEP LEARNING WITH IMAGE PROCESSING Deep learning is more advanced and is a subset of machine learning. It gives much more accuracy compared to general machine-learning techniques. But it requires more time to get trained and needs a GPU for training. Deep learning has different algorithms to train and perform more tasks besides image processing. It is a multilayer neural network-based
140
Computational Science and Its Applications
training. Deep learning trains a machine by itself to perform any task. It is used with image processing, which performs classification tasks directly on the images, sound, or text. Neural network-based architecture is generally used for deep learning, where a number of layers are used in a network. Deep learning overcomes the limitation of machine learning which is unable to extract differentiating features by using artificial neural network, a multilayered structure of algorithms.20 Deep learning models have multilevel structures that help in extracting important specific information from input images. The application of deep learning and image processing ranges extensively, embracing medicine, agriculture, robotics, security, receipts, invoices, and surveillance. convolutional neural networks (CNN) reduce computation time for image processing by taking GPU benefit for computation that many other networks fail to utilize. For image processing, removing noise from images is carried out by a pretrained neural network that identifies the noise from the image and removes it. For image preprocessing, image augmentation, and feature extraction, the processing of images is conducted by deep learning techniques such as CNN or recurrent convolutional neural network (RCNN) for numerous applications. There are a wide range of applications for image processing using deep learning such as face recognition, voice recognition, text recognition, self-driving vehicle, fraud detection, hand sign recognition, and traffic sign recognition. Data preparation in deep learning is below: 1. Image classification Image classification is a technique which performs segmentation of images into multiple categories based on their features. Image classification using CNN increases accuracy. The process of Image classification involves preprocessing of data, determining the model architecture, training of model, and estimating the performance. Image classification uses neural networks that seek to classify and comprehend the entire image at once. For image classification, the image data input parameters can be the number of pixel levels, dimensions of image, and number of images. 2. Data labeling Deep learning algorithms can give good predictions if the input data are labeled so that the algorithms can learn faster and efficiently.
Image Processing
141
NLP pipelines can be implemented for tagging and annotation of the images. For better performance and less training time, the Rectified Linear Unit can be used for nonlinear activation functions. Data augmentation can be used for imitating the current images and trans forming them to increase the training dataset. Data labeling is used in image processing with deep learning to create AI models which have a higher accuracy of recognition from high-quality dataset. The image dataset is used for the training of the AI model, which is a visual perception-based model. For such visual perception-based models, the image annotation is done with the aspect to make the images easier for recognizing the process and for improved learning of neural networks to give predictions with greater accuracy. The different image labeling techniques are bounding box, cuboid anno tation, polygon annotation, and polyline annotation. 3. Using RCNN RCNN is a region-based convolutional technique which helps to detect the objects in an image easily. Different types of RCNN such as fast RCNN, mask RCNN, and faster RCNN are used for better recognition of objects within images. Recurrent networks can handle the incorporation of the additive Gaussian noise. Recurrent neural networks are better since they are more practical than the other feedforward networks, and also, they have a better ability to detect and identify the objects within the image under multiple conditions. RCNN is the combination of the two significant types of neural networks such as CNN and recurrent neural network. The underlying idea of RCNN is to add recurrent connections to all the convolutional layers of the feedforward CNN. KEYWORDS • • • • •
image processing computer vision machine learning feature extraction digital image
142
Computational Science and Its Applications
REFERENCES 1. Ravindra, H. S. Image Processing: Research Opportunities and Challenges. In National Seminar on Research in Computers, 2010. 2. Understanding Digital Image Processing, Vipin Tyagi Jaypee University of Engineering and Technology, September 2018. 3. Singh, M.; Kumar, S.; Singh, S.; Shrivastava, M. Various Image Compression Techniques: Lossy and Lossless. Int. J. Comput. App. 2016, 6, 23–26. 4. Prabhune, O.; Sabale, P.; Sonawane, D. N.; Prabhune, C. L. Image Processing and Matrices. 2017 International Conference on Data Management, Analytics and Innova tion (ICDMAI), Pune, 2017; pp 166–171. 5. Sumithra, K.; Buvana, S.; Somasundaram, R. A Survey on Various Types of Image Processing Technique. In International Journal of Engineering Research & Technology (IJERT), 2015. 6. Zaitouna, N. M.; Aqelb, M. J. Survey on Image Segmentation Techniques. In Inter national Conference on Communication, Management and Information Technology (ICCMIT), 2015. 7. Wiley, V.; Lucas, T. Computer Vision and Image Processing: A Paper Review. Int. J. Artif. Intell. 2018, 2, 29–36. 8. Ravi, A. A. Analysis of Various Image Processing Techniques. Int. J. Adv. Netw. App. (IJANA), 2017. 9. Perumal, S.; Velmurugan, T. Preprocessing by Contrast Enhancement Techniques for Medical Images. Int. J. Pure Appl. Math. 2018. 10. Gurusamy, V.; Kannan, S.; Nalini, G. Review on Image Segmentation Techniques, 2014. 11. Zaitoun, N. M.; Aqel, M. J. Survey on Image Segmentation Techniques. Procedia Comput. Sci. 2015, 65. 12. Malik, S. Comparative Analysis of Edge Detection Between Gray Scale & Color Image. CAE Commun. Appl. Electron. 2016, 5 (2). ISSN: 2394-4714, 13. Malik, S. Various Edge Detection Techniques on Different Categories of Fish. IJCA Int. J. Comput. App. 2016, 135 (7). 14. Bala Krishnan, K.; Ranga, S. P.; Guptha, N. A Survey on Different Edge Detection Techniques for Image Segmentation. Indian J. Sci. Technol. 2017. 15. Kuruvilla, J.; Sukumaran, D.; Sankar, A.; Joy, S. P. A Review on Image Processing and Image Segmentation. In 2016 International Conference on Data Mining and Advanced Computing (SAPIENCE), Ernakulam, 2016; pp 198–203. 16. Mary Synthuja, M.; Preetha, J.; Padma Suresh, L.; John Bosco, M. Image Segmentation Using Seeded Region Growing. In 2012 International Conference on Computing, Electronics and Electrical Technologies (ICCEET), Kumaracoil, 2012; pp 576–583. 17. Sulaiman, S. N.; Mat Isa, N. A. Adaptive Fuzzy-K-Means Clustering Algorithm for Image Segmentation. IEEE Trans. Consumer Electron. 2010, 56 (4), 2661–2668. 18. Kumar, G.; Bhatia, P. K. A Detailed Review of Feature Extraction in Image Processing Systems. In 2014 Fourth International Conference on Advanced Computing & Communication Technologies, Rohtak, 2014; pp 5–12. DOI: 10.1109/ACCT.2014.74.
Image Processing
143
19. Mukherjee, J.; Mukhopadhyay, J.; Mitra, P. A Survey on Image Retrieval Performance of Different Bag of Visual Words Indexing Techniques. In Proceedings of the 2014 IEEE Students' Technology Symposium, Kharagpur, 2014; pp 99–104. 20. Krishna, M.; Neelima, M.; Harshali, M.; Rao, M. V. Image Classification Using Deep Learning. Int. J. Eng. Technol. 2018, 7 (2.7), 614–617. 21. Asharul Khan, Machine Learning in Computer Vision, Procedia Computer Science Published by Elsevier, April, 2020, 167(12). 22. Saloni Khurana , Comparative Study on Threshold Techniques for Image Analysis, International Journal of Engineering Research & Technology, Vol. 4 Issue 06, June 2015, 551–554. 23. Qing Lv et al, Deep Learning Model of Image Classification Using Machine Learning, Advanced Pattern Recognition Systems for Multimedia Data, July, 2022, vol. 2022.
CHAPTER 6
Evolutionary Algorithms RICHA SHARMA ASET, Amity University, Noida
ABSTRACT An evolutionary algorithm (EA) belongs to the field of computational intelligence. It is a population-based algorithm inspired by biological evolution process. It is a step-by-step process that involves reproduction, mutation, recombination, and selection. The possible solutions to the optimization problem act as individuals in a population, and the fitness function evaluates the quality of these solutions. EAs often perform well approximating solutions to all types of problems. This chapter discusses general implementation of different EAs along with their characteristics. ECs have the capability of solving several combinatorial optimization problems and continuous optimization problems through natural evolu tionary mechanism. This chapter also highlights genetic algorithm and its variants. It also discusses genetic programming and languages used for genetic programming. 6.1 EVOLUTIONARY ALGORITHMS Evolutionary algorithms (EAs) are population-based metaheuristic techniques used to solve various complex optimization problems. These algorithms fall under the category of evolutionary computation (EC) and are inspired by the natural evolutionary process.1 EC basically is a subfield
Computational Science and Its Applications. Anupama Chadha, Sachin Sharma, & Vasudha Arora (Eds.) © 2024 Apple Academic Press, Inc. Co-published with CRC Press (Taylor & Francis)
146
Computational Science and Its Applications
of computational intelligence which is further a subfield of an artificial intelligence domain; ECs have the capability of solving several combinatorial optimization problems and continuous optimization problems through natural evolutionary mechanism. Figure 6.1 depicts the classification of artificial intelligence techniques.
FIGURE 6.1
Classification of artificial intelligence techniques.
EAs undergo different biological processes to evolve. These processes are reproduction, mutation, recombination, and selection. EAs are also termed as population-based algorithms because they constitute a large number of candidate solutions that participate in solving several complex real-world optimization problems. EAs follow Charles Darwin’s theory named “Survival of the Fittest,” which means only fit or deserving candidates will survive in coming generations. Based on the type and complexity of the optimization problem, a fitness function for the evaluation of the quality of the candidate solutions is derived. This fitness function is obtained by considering few important and essential parameters that help in determining the quality of each individual solution. Hence, for most of the real-world problems, the computation complexity in solving them using EAs is completely dependent on fitness function evaluation.2
Evolutionary Algorithms
147
6.1.1 GENERAL IMPLEMENTATION OF EVOLUTIONARY ALGORITHMS EA applies bio-inspired mechanism iteratively on a population of potential candidate solutions per each generation or iteration. They have the prop erty of finding out optimal solution to any optimization problem within minimum time period.3,4 The general implementation of any EA can be explained as follows: 1. The candidate solution to create an initial population is generated randomly. 2. Following steps are repeatedly performed on this generated popula tion per each generation or iteration until some termination criterion is met. a) Derive a fitness function depending on the nature and complexity of the problem. b) The fittest candidates with high fitness value are selected from the population for the reproduction process. These fittest candidates are alternatively termed as parent solutions. c) Offspring of the parent solutions is produced through crossover and mutation operations. Crossover and mutation operations are the fundamental operations of any evolutionary technique. d) The least-fit candidates from the population are then replaced with these newly generated individuals. 6.1.2 PSEUDOCODE OF CLASSICAL EVOLUTIONARY ALGORITHM Initialize population randomly P0 = {P01, P02, …, P0M}, P0i∈N, where N is the search space, P0 represents the initial population of first genera tion, and i depicts the number of candidate solutions Derive a fitness function or objective function (Fi) and determine the fitness of each individual from the population Store the candidate solutions with the highest fitness value For each generation While (termination condition is met) do Evaluate each individual candidate
Computational Science and Its Applications
148
Select the fitter candidates and upgrade the population using evolutionary operators like crossover, mutation, and selection Replace the least fit candidates with the produced off-springs Increment generation by one End while
FIGURE 6.2
General implementation of evolutionary algorithms.
6.1.3 CHARACTERISTICS OF EVOLUTIONARY ALGORITHMS 1. EAs are robust and efficient population-based search techniques based on Darwinian principle to obtain global optimal solutions to optimize complex real-world problems. 2. EAs are easy to implement and have a higher probability of locating a near-optimal solution in beginning stages of the optimization process.
Evolutionary Algorithms
149
3. Unlike other optimization methods, EAs require no fitness gradient information to operate and also support parallelism. 4. EAs have the capability to escape from being getting stuck into local minima and converges without being affected by continuous or differentiable nature of the optimization function. 5. EAs can efficiently handle both single and multi-objective optimiza tion problems as they can be applied with different kinds of variables like integer, discontinuous, or discrete design variables. 6. In terms of computational complexity, EAs grow linearly with the increase in the size of the problem, whereas other optimization methods grow exponentially, thereby increasing the problem complexity. 6.2 TYPES OF EVOLUTIONARY ALGORITHM The perception behind the working of any EAs is to mimic whatever nature is doing. The different types of EAs are genetic algorithm (GA), differential evolution (DE), genetic programming (GP), and evolutionary programming (EP). Figure 6.3 represents the anatomy of EAs.5
FIGURE 6.3
Anatomy of evolutionary algorithms (EAs).
Among these EAs, GA is the simplest and the most widely used evolu tionary method devised by John Holland in 1975. All these algorithms vary from each other with respect to the implementation aspects. All these types are described briefly below. 1. Genetic algorithm: It is a well-known evolutionary technique for solving real-world optimization problems through natural evolution
150
2.
3. 4.
5.
6. 7.
Computational Science and Its Applications
process. The basic concept of this algorithm is to optimize a given problem through meaningful combination and survivor selection operation. The recombination of candidate solutions also termed as parent solution is done based on their fitness. Here, fitness means the capability of producing better off-springs for upcoming genera tions. GA gives more on crossover than the mutation operation. Genetic programming: GP is a programming that work in similar fashion as the GA works. It differs from GA in an aspect that here the candidate solutions selected for the evolution process are the computer programs. The fitness of these programs is represented through their capability to solve a particular computational problem. Evolutionary programming: It is another branch of EC that follows an evolution process similar to GP that works on programs having a fixed structure but with varying numerical parameters. Evolution strategy: Evolution strategy (ES) follows the principle of self-adaption which means it allows the control parameters like mutation rate, crossover probability, and the population size to adapt as per the problem requirement. Or in other words, this approach facilitates the adjusting of the control parameters as the process evolves. In ES, candidate solutions are represented as vectors of real numbers having self-adaptive nature. Differential evolution: DE also works on vectors and is suited for solving numerical optimization problems. It evaluates vector differences to produce new and better candidate solutions per each generation. Unlike GA, it relies more on mutation factor, the crossover probability rate. Neuro-evolution: Neuro-evolution (NE) technique considers arti ficial neural networks to evolve by describing their architecture and connection weights. Learning classifier system: In these systems, the candidate solu tions are represented by set of classifiers where each classifier represents a certain rule or condition to be fulfilled.
6.3 COMPONENTS OF EVOLUTIONARY ALGORITHM The fundamental approach repeatedly followed by these EAs is (i) initializa tion of the initial population, (ii) selection of fittest parents from the current
Evolutionary Algorithms
151
population for offspring generation, and (iii) replacement of parents with their offspring based on their fitness to generate new population for next generation. As all EAs follow the same principle of biological evolution, hence they all constitute of similar components. Figure 6.4 represents the components of EAs.
FIGURE 6.4
Component of evolutionary algorithms.
All these components are explained in brief below: i) Representation of candidate solutions: The first component of any EAs is the representation of individual candidate solutions. These candidate solutions combined form an initial population to work with in order to optimize any given problem. ii) Derivation fitness function: A fitness function is derived to determine the quality of each individual from the population. This fitness function constitutes of various parameters whose values directly or indirectly affect the given optimization problem under consideration. It is derived based on the nature of the optimization problem, that is, in other words, this function sets a requirement condition or criterion that each individual from the population should satisfy to survive. iii) Parent recombination: Based on the fitness values, parents are selected to combine and modified to produce new solutions called
152
Computational Science and Its Applications
offspring. The basic operations applied on these parent solutions are reproduction, crossover, mutation, and selection. These cross over and mutation operators are also called control parameters because they determine the quality and efficacy of the search process. Crossover and mutation operators are both stochastic in nature that actually focuses on the factors like how and what parts of parents’ solutions are to be recombined or mutated to generate new solutions of good quality. iv) Survivor selection: After the production of new solution, their fitness value is compared with the fitness of their parent solu tions. If the newly produced solutions have more fitness than their parents, then they will replace their parents in the next generation. This mechanism is known as survivor selection. v) Update population for next generation: The final population for coming generation is updated after selecting the survivors as mentioned in step iv. Each EA undergoes all these components repeatedly per each iteration or generation until some termination condition is met. 6.4 GENETIC ALGORITHM AND VARIANT OF GA GA follows “Darwinian principle” to select best candidate solution from the group of given possible solutions. It is a kind of iterative search procedure to explore a large search space to find an optimal solution. GA is a kind of heuristic search procedure that helps in solving several popular problems like traveling salesman problem (TSP), target number problem, or Maxone problem. The fundamental parameters of a GA are population size, generations count, recombination probability, and mutation factor. The basic terminology used while working with GAs is explained in brief below: 1. Chromosome: A chromosome is a set of genes that combined form a candidate solution to work with. It is also termed as an individual solution. 2. Population: It is a collection of individual solution that constitutes an initial population per each generation. 3. Fitness: Fitness is a value assigned to each solution in the popula tion that depicts how much close or far that individual is from the best or optimal solution.
Evolutionary Algorithms
153
4. Fitness function: It is a problem-specific function that is derived to evaluate the quality of each individual. It helps in selecting the most-fit or deserving candidate from the mating pool to be allowed to go under recombination process. 5. Selection operation: Selection procedure selects the parent solu tions from the entire population to combine and produce new off springs. There are several types of selection methods like Roulette Wheel Selection, Tournament Selection, and Truncation method. 6. Recombination: Genes of the fittest parents are combined in some way to create a new solution. 7. Mutation: It is a process of doing modifications to the bits of the newly generated off-spring solution. 8. Termination criterion: It represents a condition on the fulfillment of which the iterative procedure is stopped. 6.4.1 VARIANTS OF GENETIC ALGORITHM There exists a wide variation of GA in our literature based on the need of application areas. The most widely known variant of GA is steady-state GA in which the newly produced solutions after the recombination opera tion replace the worst-fit solutions among the population only if they have better fitness than them. All variants of GA vary in terms of the type of selection method, recombination method, or mutation procedure adopted by them.7 a) Different kinds of encoding methods: Encoding is a way of representing an individual solution called chromosome while solving a given problem. Each chromosome is a string of bits that each bit called gene depicts necessary information about that individual solution that shows the characteristics or properties of that individual. Encoding can be of different types like binary encoding, real value encoding, order or permutation encoding, or tree encoding. All these methods differ in the way they represent a single candidate solution that constitute an initial population for problem solving and optimization. i. Binary encoding: It is one of the simplest encoding methods in which chromosomes are represented as a string of 1s and 0s that is in the form of binary numbers. The string can be
154
Computational Science and Its Applications
of variable and fixed length. In such representations, each bit position represents a particular property or characteristics of the given problem.
One of the most common problems that can make use of this encoding method is 0–1 knapsack problem. ii Permutation encoding: It represents the individual solution as a sequence of elements. In other words, this type of encoding is useful in solving problems that require ordering of events such as the TSP. In TSP, every chromosome is a string of numbers, each of which represents a city to be visited. That is why this encoding method is alternatively termed as order encoding.
iii) Real value encoding: This type of encoding is used in problems that include complicated values in the form of real numbers. This encoding method is best suitable for continuous search space optimization problem.
iv) Tree encoding: In this type of encoding, chromosomes are represented as objects in tree-like hierarchy. In other words, in such encoding methods, individual solutions are represented in the form of a binary tree, for example, binary tree formula tion of floor planning problem.
Evolutionary Algorithms
155
b) Different kinds of selection method: The selection method in any EA is used in selecting the fittest individual from the initial popu lation. This is also termed as parent selection procedure. Different types of selection methods are briefly explained below: i. Tournament selection: This selection strategy randomly selects individuals from the population and runs a tournament in between them. The individuals that succeed in the tournament are fittest and are then selected and constitute new popula tion for the next generation. In a k-way tournament selection approach, k-individuals are selected and participate in this fitness evaluation mechanism. This selection method is suit able for negative fitness values. ii. Rank selection: Just like tournament selection, rank selection also has capability to work with negative fitness values. This selection strategy is suitable for the population in which indi viduals have nearly close fitness to each other. This selection strategy gives approximately same chances of being selected from the mating pool. This approach thus avoids selection pressure for selecting fittest individuals from the population. iii. Steady-state selection: The basic principle of this selection method is that a big portion of a particular individual must
156
Computational Science and Its Applications
survive to the coming generation. The rest of the selection procedure is same as done in any EA based on the fitness of the individual solutions. iv. Elitism selection: During any evolutionary process, it is found that there are chances of losing best chromosomes at the time of mutation and crossover operation. So, in order to avoid losing those best solutions, those solutions are preserved first for the next population in elitism selection mechanism. For the rest of the population, the evolution procedure is same as other EAs based on their fitness. c) Different kind of recombination method: Crossover operation is a recombination procedure in which two parents selected using selection method as discussed above are recombined and new off springs are produced. Crossover probability to perform crossover on selected candidate solution depicts how often recombination of those solutions is to be performed. Thus, 0% crossover probability means the newly generated solution will be completely copy parent solutions whereas 100% probability means that new solutions are generated by recombination and not a copy of parent. i. One-point crossover: In this type of crossover operation, a single crossover point is selected and at that point both parents are recombined. Crossover probability is the rate of how much two selected parents are going to recombine.
ii. Multipoint crossover: Multi-point crossover is an extension to classical single-point crossover wherein alternative bits of selected parent chromosomes are swapped to produce new candidates.
Evolutionary Algorithms
157
iii. Uniform crossover: This type of crossover operation provides the uniformity in merging the bits of selected parents. In this type of crossover operation, a uniform real random number is chosen between 0 and 1 to swap bits of two selected parents. This operation produce creates two child solutions of p bits uniformly chosen from both parents. The real number selected randomly tells whether the first offspring selects the qth bits from two selected parents. d) Different kinds of mutation method: Mutation operation is also a genetic operation that is applied to newly produced solution to maintain population diversity in between two generations. In alter native terms, mutation is a very small random adjustment done in the off-spring produced through the crossover, to produce more better solution. It alters one or more genes values of the chromo some and thus is performed with very low mutation rate. Mutation is concerned with the “exploration” of the given problem search space. Mutation operators are adopted to avoid the optimization process from being getting trapped into local minima. It prevents the candidate solution in the population from becoming same to each other. a. Bit flip mutation: In this type of mutation operation, bits of the individual solution are flipped at random locations.
b. Random resetting: This mutation operation is applied for integer representation. It assigns a randomly selected value from the set of chosen values to any of the randomly selected gene in a chromosome. c. Swap mutation: In this mutation, two positions in the chromosome are randomly selected and interchanged. This type of mutation is commonly used with permutation-based encoding scheme.
158
Computational Science and Its Applications
d. Scramble mutation: This mutation operation is based on scram bling or shuffling of the bits. In this type of operation, a group of genes from the chromosome are selected and their values are shuffled randomly.
e. Inversion mutation: This mutation operation works in a manner similar to scramble mutation. But unlike it, inversion mutation instead of shuffling the subset of selected genes inverts the genes.
6.5 GENETIC PROGRAMMING GP is a biologically inspired computing. GP is an EP that works in similar fashion as the GA works. It tackles complex optimization problems by using principles inspired by natural evolution process. It differs from GA in an aspect that here the candidate solutions selected for the evolution process are the computer programs. The fitness of these programs is represented through their capability to solve a particular computational problem.8 GP is a subset of EC that creates a working computer program through automatic programming. This process of creating a computer program by following the process of genetic breeding of several existing computer programs is also known as program synthesis.9 This process is also modeled on the principles of Darwinian theory of biological evolution through several operations like selection, recombination, and mutation. 6.5.1 PREPARATORY STEPS OF GENETIC PROGRAMMING The basic terminology used by human users to convert the given problem in high-level statement to the GP system is as follows:
Evolutionary Algorithms
159
1. Terminals: Terminals are the set of independent variables of the problem or random constants of an evolutionary-breeded computer program. 2. Primitive functions: These are the functions defined in to-be evolved computer program to perform certain or specific task just like other programming languages. 3. Fitness factor: It includes the parameters to measure the fitness of individual from the population. 4. Evolution operations: These operations like mutation, crossover, and selection are used to control the entire evolutionary process. 5. Termination: The stopping criterion is to stop the iterative evolu tionary process on the fulfillment of certain conditions. 6.5.2 IMPLEMENTATION OF GENETIC PROGRAMMING The fundamental working of GP is given step by step below: 1. Randomly initializes the population of computer programs that consist of set of terminals and finite functions. 2. Implement the following substeps until some stopping condition is met: a) Evaluate the fitness of each individual computer program to determine its quality. b) Select two individual programs with the highest fitness from the population. c) Generate new individual program(s) by performing genetic operations mentioned below: i. Recombination/Crossover: Based on prespecified cross over probability, new offspring program(s) are generated by recombining the randomly selected parts from two fittest programs. ii. Mutation: New offspring is mutated based on the mutation rate by considering the randomly chosen part of program. 3. After the stopping criterion is met, the single best program in the population generated during the run (the best-so-far individual) is harvested and designated as the result of the run. If the run is successful, the result may be a solution (or approximate solution) to the problem.
160
Computational Science and Its Applications
6.6 LANGUAGES FOR GENETIC PROGRAMMING GP commences the evolution process with a randomly generated computer programs considered as the initial population. It applies genetic operations over the population of computer programs to produce a new generation of computer programs that will act as new population for the next iteration. The iterative conversion of the selected computer programs is done inside the main loop per each generation during the each run of GP. GP was usually adopted to solve problems from different domains like quantum computing, electronic design, game playing, sorting, and searching. Lisp is a famous language used for artificial intelligence development and for GP. LISP stands for list processing. This programming language was intro duced first by John McCarthy in 1959 to easily manipulate data strings. It is a machine-independent language that is represented as a mathematical notation to process symbolic data. It follows an iterative methodology to provide high-level program debugging. 6.7 GENETIC ALGORITHM IN MACHINE LEARNING GAs have been used to solve problem-related to classification and predic tion and create rules for learning and classification. The incorporation of GA with machine learning makes the learning process involved a natural one instead of algorithmic-based approach. Solving complex problems by making use of GA in machine learning includes the evaluate of fitness value of the rule using different types of learning methods like reinforcement learning and deep learning. Using reinforcement learning methods allows an agent in enhancing its performance through the feedback provided by the environment. In reinforcement learning, the system receives feedback which makes it closer to supervised learning. 6.8 APPLICATIONS OF GENETIC ALGORITHM IN REAL WORLD GA is very famous for its multidisciplinary usage in domains like natural and earth sciences, data science, machine learning, finance, and economics and artificial intelligence. It has several application areas like feature selec tion, pattern recognition, building neural network architecture, and neural network training.6,10,11 Few of the real-world applications of GAs are:
Evolutionary Algorithms
161
a) Data clustering can be done using GA to optimize several fit functions. Clustering is a process of dividing the data points into disjoint sets such that the points or data objects lying in one group are similar to each other in one or more aspects and are different from data points lying in other groups. b) GAs can be used to solve various optimization problems like port folio optimization, TSP optimization problem, job scheduling and sound quality optimization, Maxone problem, and target number problem. c) GA can be utilized to design energy aware routing in networking. It is used to cluster wireless sensor network to achieve energy efficiency and to prolong network. d) GAs can be utilized to evaluate scheduling problems in manufac turing sector. One of the most common examples is job scheduling algorithm. e) Genetic-based approach can be utilized in the field of electromag netics. GAs can be utilized for test pattern generation and pattern recognition. f) GAs can be utilized for multiple sequence alignment that is a part of sequence analysis. The basic objective is to design an objective function to evaluate multiple sequence alignments. KEYWORDS • • • • •
evolutionary algorithms genetic programming optimization problems fitness function crossover operation
REFERENCES 1. Yu, X.; Gen, M. Introduction to Evolutionary Algorithms; Springer Science & Business Media, 2010.
162
Computational Science and Its Applications
2. Fortin, F. A.; De Rainville, F. M.; Gardner, M. A.; Parizeau, M.; Gagné, C. DEAP: Evolutionary Algorithms Made Easy. J. Machine Learn. Res. 2012, 13 (1), 2171–2175. 3. Sharma, R.; Vashisht, V.; Singh, U. Nature Inspired Algorithms for Energy Efficient Clustering in Wireless Sensor Networks. In 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence); IEEE, 2019; pp 365–370. 4. Eiben, Á. E.; Hinterding, R.; Michalewicz, Z. Parameter Control in Evolutionary Algorithms. IEEE Trans. Evol. Comput. 1999, 3 (2), 124–141. 5. Mirjalili, S. Genetic Algorithm. In Evolutionary Algorithms and Neural Networks; Springer: Cham, 2019; pp 43–55. 6. Sharma, R.; Vashisht, V.; Singh, U. EEFCM-DE: Energy-Efficient Clustering Based on Fuzzy C Means and Differential Evolution Algorithm in WSNs. IET Commun. 2019, 13 (8), 996–1007. 7. Sharapov, R. R. Genetic Algorithms: Basic Ideas, Variants and Analysis; Intech Open, 2007. 8. Koza, J. R.; Poli, R. Genetic Programming. In S earch Methodologies; Springer: Boston, 2005, pp 127–164. 9. Koza, J. R.; Koza, J. R. Genetic Programming: On the Programming of Computers by Means of Natural Selection; MIT Press, 1992. 10. Roeva, O., Ed. R eal-World Applications of Genetic Algorithms; BoD–Books on Demand, 2012. 11. Sofge, D. Using Genetic Algorithm Based Variable Selection to Improve Neural Network Models for Real-World Systems. In ICMLA, 2002; pp 16–19.
CHAPTER 7
Process Simulation KSHATRAPAL SINGH1, ASHISH KUMAR1, and MANOJ KUMAR GUPTA2 Department of Computer Science & Engineering, ITS Engineering College, Greater Noida, India
1
Faculty of Computer Science & Engineering, SMVD University, Katra,
India
2
ABSTRACT Process simulation is a model-based representation of substance, physical, natural, as well as other special cycles and entity exercises in program ming. The fundamental essentials for the model are synthetic and actual properties of unadulterated parts and blends, of responses, and of numerical models which, in mix, permit the computation of interaction properties by the product. Process simulation programming depicts measures in stream charts where entity tasks are located and combined by item. The objective of process simulation is to find the perfect environment for a cycle.1 This is normally an improvement concern, which must be addressed in an itera tive cycle. 7.1 NEED FOR PROCESS SIMULATION Process simulation is an incredible programming instrument that permits treatment facility to proprietors, administrators, and architects for all intents and purposes to model an interaction in extraordinary detail without investing the energy, labor, or cash truly testing their Computational Science and Its Applications. Anupama Chadha, Sachin Sharma, & Vasudha Arora (Eds.) © 2024 Apple Academic Press, Inc. Co-published with CRC Press (Taylor & Francis)
164
Computational Science and Its Applications
plan in a genuine climate.2 It is regularly performed during the plan ning stage or before a plant turns out to be completely operational to perceive how changes in hardware determinations, booking, vacation, and support can influence a process all through the span of its life cycle. The following factors are the main reasons for which process simula tion is needed: 7.1.1 RISK-FREE ENVIRONMENT Demonstrating simulations gives a protected approach to test and inves tigate diverse “consider the possibility that” situations. The result of adjusting staffing levels in a manufacturing unit might be visible without bringing creation crisis. Settle on the right decision prior to rolling out true advancements. 7.1.2 SAVE MONEY AND TIME Virtual investigations with simulation designs are highly economical and they utilize less time as compared with other investigations unlike avenues respecting authentic resources. Showcasing attempts can be tried without cautioning the obstruction or superfluously going through cash. 7.1.3 VISUALIZATION Simulation models can be vivified in 2D/3D, permitting concepts as well as thoughts to be the entire extra effectively confirmed, imparted, and comprehended.3 Examiners as well as architects acquire trust in a model by viewing it in real life and can plainly exhibit discoveries to the board. 7.1.4 INSIGHT INTO DYNAMICS In contrast to the accounting page or solution-based examination, simula tion demonstration permits the perception of framework conduct over the
Process Simulation
165
long run, at any degree of detail. For instance, checking the distribution center’ extra room usage on some random date. 7.1.5 INCREASED ACCURACY A simulation model can catch a larger number of subtleties than a logical model, yielding extended correctness and exact anticipation. Mining organizations can altogether reduce costs by advancing resource use and getting their future hardware requirements.4 7.1.6 HANDLE UNCERTAINTY Accountability in activity times and the result can be effectively addressed in simulation approaches, thus permitting hazard measurement, as well as for extra strong answers to be found. In coordinations, a sensible image can be built to utilize re-enactment containing eccentric information, for example, shipment drive times. 7.2 FEATURES OF PROCESS SIMULATION The desirable features of process simulation are given below: 7.2.1 MODEL BUILDING FEATURES Feature
Description
Input-data analysis capability
Assess factual or statistical circulation from raw data
Graphical model-building
Process-flows, block-diagrams, or network methods
Conditional routing
Route entity based on prescribed conditions or attributes
Simulation programming
Strength to cast procedural logic with high-level strong simulation languages
Syntax
Easy, simple, consistent, unambiguous, English-like.
Input flexibility
Accept data from outside files, databases, spreadsheet, or interactively
Modeling conciseness
Powerful actions, block, or nodes.
Computational Science and Its Applications
166
7.2.2 RUNTIME ENVIRONMENT Feature
Description
Execution speed
Many runs are required for a scenario, as well as for replication.
Model size: count of, variable, and attribute
Should be made in minutes
Interactive debugger
Monitor the simulation in deep as it progresses. Able to break, trap, run until, step; to show status, attribute, and variable
Model status and calculation Show all-time simulation duration. Runtime license
Able to modify parameters and run a model (but not to modify logic or make a new model).
7.2.3 ANIMATION AND LAYOUT FEATURES Feature
Description
Kind of animation
True to scale or iconic (for example process flow diagrams).
Import, drawing and object file
From CAD (vector format) drawing or icon (bit-map or raster
graphics).
Display step
Curb of animation pace.
Selectable object
Changing status and statistics shown upon selection.
Hardware requirements
Typical, or certain video card, RAM requirements.
7.2.4 OUTPUT FEATURES Feature
Description
Scenario management
Make user-defined scenarios foe simulation.
Run manager
Build all runs (scenario and replication) and save outputs
for, future, analysis.
Warm up ability
For steady-state analysis.
Autonomous replication
Using a distinct set of random numbers.
Optimization
Genetic algorithm.
Standard report
Concise report including average, count, minimum and, maximum, and others
Process Simulation
167
Feature
Description
Customized report
Tailor presentation for managers.
Statistical analysis
Confidence interval, designed experiment, and others
Business graphics
Bar chart, pie chart, time line, and others
Cost module
To include activity-based cost.
File transport
Input to spreadsheets or databases for custom processing.
Database maintenances
Store output in an organized manner.
7.3 SIMULATION APPROACHES In recent years, in spite of the fact that simulation procedures shift contingent upon the size and complexity of the process, alongside the product being utilized to play out the demonstration, there are two primary kinds of refining simulation. The principal type is named consistent-state simulation, where an architect can show different situations by “fiddling” with plan boundaries.5,6 This type of simulation is regularly performed during the calculated duration of a task with a finish target to acquire an admirable understanding of how an idea can be modified to take advantage of the cycle from both a business point of view and a layout point of view. The subsequent type of simulation is known as dynamic simulation, which changes from consistent-state simulation in that it permits the administrator to essentially execute a process (that is now been planned) under various conditions to perceive how it performs. The motive of such a simulation method is to ensure that the interaction will stay protected under distressing or unfamiliar conditions. 7.4 APPLICATIONS OF SIMULATION IN VARIOUS SECTORS Some important application areas of simulation are as follows: 7.4.1 LOGISTIC SIMULATION Optimize hard and dynamic logistic processes with simulation.
168
Computational Science and Its Applications
7.4.2 SIMULATION IN PRODUCTION Contains modeling single manufacturing lines, from the layout of production resources and buffer extent to the simulation of whole manufacturing plant.7 7.4.3 EXACT PRODUCTION PLANNING Gaining of start planning while taking into consideration dynamic conditions, such as current availability resources, inventory, filling level of the ease, etc. 7.4.4 PLANNING OF MACHINE SCHEDULING Enhancing machine limit utilization by decreasing set-up times and preventing standby and waiting period. 7.4.5 CONTROL STATION SIMULATION Improving of control strategy with the help of simulation. 7.4.6 PERSONNEL SIMULATION Help for personnel resource planning. 7.4.7 SUPPLY CHAIN SIMULATION To model as well as analysis of supply network. 7.5 SIMULATION SOFTWARES Consistent-state and dynamic plant simulations are incredible assets which assist professionals in making ideal cycle plans to analyze the plant activities, to create execution improvement systems, screen, and upgrade tasks significantly higher.
Process Simulation
169
We are giving a whole rundown of process simulator bundles with their main qualities.8–10 They are accessible for various ventures, purposes, scales under various business settings. While a few of them are over-the top costly, there are too reasonable, and a few of them which are absolutely complimentary. In this way, there are no more reasons for not utilizing process simulation instruments. 7.5.1 ASPEN PLUS Developer: AspenTech Main Features: Aspen Plus is quite possibly the better-accepted interaction test system in business, and perhaps, it is the costly one. It empowers a large scope of computation opportunities for the plan, activity, as well as improvement of protected, productive assembling offices. It empowers the consistentstate/steady-state and dynamic recreation of synthetic and drug measures, containing nonideal, electrolytic, and strong frameworks. Blended arrangement philosophies can be utilized to accomplish quick figuring and to provide full determination adaptability. Influences displaying specula tions by scaling from a single model to a full office flowsheet. 7.5.2 CHROMWORKS Developer: YPSO Facto Main Features: YPSO Facto is a chromatographic process simulation programming. It permits the normal utilization of trial information and simulation of common unit segments just as perplexing consistent multisection measures. It is a product bundle for the simulation of Ion Exchange measures that permits recreating totally different circumstances which include amino acids purification, natural corrosive recuperation, or hydrometallurgy. Based on the demonstrated specialized contempla tions and supplemented by a cost assessment module, this incredible and easy-to-understand reproduction apparatus is intended to coordinate with the methodology and necessities of scientists and organic chemists.
170
Computational Science and Its Applications
7.5.3 CHEMCAD Developer: Chemstations Inc. Main Features: Synthetic cycle simulation programming that incorporates libraries of substance segments, thermodynamic strategies, and unit tasks to permit consistent-state and dynamic recreation of ceaseless compound cycles from laboratory range to full range. Chemcad is a standard software for clients who need to configure cycles, or quantify existing cycles in consistent cases. Dynamic interaction reproduction programming gains consistent-state recreations to the following degree of loyalty to permit the dynamic examination of flowsheets. The prospects are unending: oper able registration, PID circle tuning, administrator preparation, even online cycle control, and delicate sensor usefulness and is suitable for clients who need to plan or ratio dynamic cycles. 7.5.4 DESIGN II FOR WINDOWS Developer: WinSim Inc. Main Features: Plan II achieves total warmth as well as material equilibrium figurings for a large assortment of pipelines and handling applications. The test system’s not difficult to-make flowsheets permit the specialists to focus on designing, as opposed to PC tasks. A minimum amount of informa tion is needed to use DESIGN II FOR WINDOWS. WinSim’s test system highlights, for example, measuring and ranking of warmth exchanger and separator inside the flowsheet. The DESIGN II for Windows dataset contains 1000+ unadulterated parts, and others can be accumulated. 7.5.5 COCO Developer: AmsterCHEM Main Features: COCO is a cape-open to cape-open recreation environment with a module which has given the fascinating name, for example, COFE—the
Process Simulation
171
CAPE-OPEN Flowsheet Environment is an instinctive GUI to substance flow sheeting. COFE shows the property of stream, manages unit transfor mation, and provides plotting offices. TEA—COCO’s Thermodynamic for Engineering Application depends on the code of the thermodynamics library of ChemSep and incorporates an information bank of more than 410 generally utilized synthetics. This bundle shows in excess of 100 property estimation strategies with their insightful or mathematical subsidiaries. 7.5.6 EMSO Developer: Alsoc Project Main Features: It is the abbreviation for Environment for Modeling Simulation and Optimization. The ALSOC project creates as well as keeps up specifica tions of demonstrating languages appropriate for the blend, re-enactment, improvement, and interaction handle of common cycles. EMSO is completely written in C++; at present, it is accessible for Windows and Linux; however, it can be aggregated for different stages whenever necessary. It is an Equation-Oriented test system and has an enormous arrangement of inherent capacities. Designs are made in demonstrating languages, as the client need not be a software engineer/programmer. It underpins the static reproduction and dynamic recreation. A graphical UI can be utilized to demonstrate improvement, recreation execution, and visualizing of results. 7.5.7 HYDROFLO Developer: Tahoe Design Software Main Features: HYDROFLO decides the consistent-state streams and pressure and more working boundaries in a solitary source/unit release, impact, and siphoned stream frameworks. Siphoned frameworks can be shut circle or open repository/tank frameworks, and practically any incompress ible liquid framework generally found in the mechanical cycle, fire
172
Computational Science and Its Applications
assurance, substance measure, water system, and HVAC enterprises are demonstrated. 7.5.8 ITHACA Developer: Element Process Technology Main Features: It is an easy dynamic interaction test system for synthetic substances, mining, and minerals. The features of ITHACA incorporate a graphical interface for measuring stream outlines, constant data about the levels of opportunity of the re-enactment (worldwide) and by hardware (local), coor dination with Microsoft Excel methods, yields trading in simple content, powerful reproduction MSO exporting (to be utilized by the EMSO cycle test system), also having updated thermodynamics libraries with the latest models devices for oil test demonstration by means of pseudocomponents, explicit library for water/steam measures, correspondence with the infor mation logging frameworks and working frameworks by means of OPC. 7.6 VERIFICATION AND VALIDATION TESTING IN SIMULATION One of the actual concerns that the simulation expert faces is to validate the model. The simulation design is true only if the design is a correct depiction of the actual system, else it is not valid.11,12 As shown in Figure 7.1, validation and verification are the two parts of any simulation activity to validate a model. Validation is the task of comparing two outputs. In this, we require to compare the depiction of a conceptual design to the real system. If the comparison is correct, then it is valid, else not valid. Verification is the task of comparing two or more outputs to ensure their correctness. In this task, we have to compare the model’s execution and its associated data with the developer’s conceptual explanation and specification. Verification and Validation Methods There are different methods applied to do Verification & Validation of Simulation design.13–15 Following are some of the standard methods:
Process Simulation
173
Techniques to Perform Verification of Simulation Model
FIGURE 7.1
Verification and validation.
The following are the action to carry out the verification of simulation model: • By applying programming skills to write and debug the programs in sub-programs. • By applying the “Structured Walk-through” approach in which more than one entity is to read the program. • By tracing the intermediate outcomes as well as correlating them with observed results. • By evaluating the simulation design output using different input combinations. • By contrasting the last simulation output with analytic outputs. Methods to Validate Simulation Model Step 1: Make a model with more validity. This can be got using the following steps: • The design must be discussed with the system professionals. • The model must communicate with the client throughout the task. • The result must be supervised by system professionals. Step 2: Test the design at assumption data. It can be achieved by the appli cation of assumption data into the design and testing it properly.
174
Computational Science and Its Applications
Step 3: Calculate the representative result of the simulation design. This can be got using the following steps: • Calculate how near is the simulation result with the real system result. • Differentiation can be applied using the Turing Test. It represents the data in the system format, which can be given by professionals only. • Statistical approach can be applied to compare the design result with the real system result. 7.7 DISADVANTAGES OF SIMULATION The following are the main disadvantages of process simulations: • They always depend on the last output of the professional. • They are never fully accurate. • They are based on different designs, iterations, and numerical data which can have some errors within. • Process simulations are not an achievement, but rather a tool to get another achievement. • They may still report different risks. 7.8 CASE STUDY The use of simulation with a case study: Simulation Modeling for Efficient Customer Service. This particular model may likewise be appropriate to the higher broad issue of human and specialized assets of the executives, where organiza tions normally look to bring down the expense of underutilized assets, specialized specialists, or gear, for instance.16,17 Searching for the ideal count of staff to convey a predefined nature of administration to clients coming to a bank. Right off the bat, for the bank, the degree of administration was characterized as the normal line size. Significant framework scopes were then chosen to fix the boundaries of the reproduction model—the count as well as recurrence of client appearances, the time a teller takes to go to a client, and the regular varieties which can happen by taking all things together of these, specifically, break hour surges and complex solicitations.18
Process Simulation
175
A flowchart comparing with the construction and cycles of the office was then made (Figure 7.2). Re-enactment plans just need to focus on those vari ables which sway the issue being investigated. For instance, the accessibility of office administrations for corporate records, or the credit division has no impact on those people, as they are actually and practically independent.
FIGURE 7.2
Flowchart construction and cycles of the office.
Lastly, after feeding the model data, the simulation could be executed and its action visible over time, permitting refinement and analysis of the outputs (Figure 7.3). If the standard queue size exceeded the given limit, the count of accessible staff was expanded and a new analysis was done. It is attainable for this to happen automatically until the best result is obtained.
FIGURE 7.3
Process simulation.
Computational Science and Its Applications
176
Generally, different situations might be investigated rapidly by shifting boundaries. They can be reviewed and questioned while in real life and thought about against one another. The aftereffects of the demonstration and re-enactment, hence, give certainty and clearness for investigators, architects, and administrators the same. KEYWORDS • • • • •
simulation environment visualization validation test
REFERENCES 1. Choi, K.; Bae, D.-H.; Kim, T. An Approach to a Hybrid Software Process Simulation Using the Devs Formalism. Softw. Process 2006, 11 (4), 373–383. 2. Kitchenham, B. Guidelines for Performing Systematic Literature Reviews in Software Engineering. Technical Report, Keele University, 2007. 3. Zhang, H. Qualitative and Semi-Quantitative Modelling and Simulation of Software Engineering Processes; PhD thesis, University of New South Wales, 2008. 4. Zhang, H.; Jeffery, R.; Zhu, L. Hybrid Modeling of Test and- Fix Processes in Incremental Development. In International Conference on Software Process; Springer: Leipzig, Germany, 2008. 5. Zhang, H.; Kitchenham, B.; Pfahl, D. Reflections on 10 Years of Software Process Simulation Modelling: A Systematic Review. In International Conference on Software Process; Springer: Leipzig, Germany, 2008. 6. Banks, J. et al. Discrete-Event System Simulation; Pearson Prentice Hall, 2005. 7. Hexmoor, H., Venkata, S. G.; Hayes, D. Modelling Social Norms in Multiagent Systems. J. Exp. Theor. Artif. Intell. March 2006, 18 (1). 8. Pach, F. P.; Gyenesei, A.; Arva, P.; Abonyi, J. Fuzzy Association Rule Mining for Model Structure Identification. 10th Online World Conference on Soft Computing in Industrial Applications, 2005. 9. Müller, M.; Pfahl, D. Simulation Methods. Guide to Advanced Empirical Software Engineering, Section I; Springer-Verlag: New York, 2008; pp 117–152. 10. Monteiro, V.; Araújo, M. A.; Travassos, G. H. Towards a Model to Support In Silico Studies Regarding Software Evolution. ESEM 2012, Sept 2012.
Process Simulation
177
11. Stopford, B.; Counsell, S. A Framework for the Simulation of Structural Software Evolution. ACM Trans. Model. Comput. Simulation 2008, 18. 12. Ferreira, S.; Collofello, J.; Shunk, D.; Mackulak, G. Understanding the Effects of Requirements Volatility in Software Engineering by Using Analytical Modeling and Software Process Simulation. J. Syst. Softw. 2009, 82, 1568–1577. 13. Ambrosio, B. G.; Braga, J. L.; Resende-Filho, M. A. Modeling and Scenario Simulation for Decision Support in Management of Requirements Activities in Software Projects. J. Softw. Maintenance Evol. 2011, 23 (1), 35–50. 14. Lopes, V. P. Repositório de Conhecimento de um Ambiente de Apoio a Experimentação em Engenharia de Software. Dissertação de Mestrado. PESC-COPPE/UFRJ, 2010. 15. França, B. B. N.; Travassos, G. H. Reporting Guidelines for Simulation-Based Studies in Software Engineering. In 16th International Conference on Evaluation & Assessment in Software Engineering—EASE 2012; Ciudad Real, Spain, May, 2012; pp. 156–160. 16. Lauer, C.; German, R.; Pollmer, J. Discrete Event Simulation and Analysis of Timing Problems in Automotive Embedded Systems. In IEEE International Systems Conference Proceedings, SysCon 2010; San Diego, CA, 2010; pp 18–22. 17. Tokoz, E. Industrial Engineering and Simulation Experience Using Flexsim Software. Comput. Educ. J. Dec 2017, 8 (4). 18. Akshay, D., Shahare, A. S. Productivity Improvement by Optimum Utilization of Plant Layout: A Case Study 2017, 04 (06).
CHAPTER 8
Need for Deep Learning BOBBY SINGH, NIKITA GUPTA, SACHIN SHARMA and ANUPAMA CHADHA Manav Rachna International Institute of Research and Studies, Faridabad, India
ABSTRACT In this chapter basics of deep learning, working of deep learning, compar ison of machine learning with deep learning and deep learning models are discussed. 8.1 INRODUCTION Deep learning is nothing but another form of machine learning that helps computer to learn what to do next just like a human brain makes decisions naturally for any particular situation, that is, learning from examples. The computer version models of deep learning contain various images, texts, or sound with many layers of the architecture of neural networks are basically used to tackle any particular situations or problems. This large amount of labeled data and the neural network architectures are also called training data. These set of models or training data are helpful for the computer to understand and visualize the real world just like humans do.
Computational Science and Its Applications. Anupama Chadha, Sachin Sharma, & Vasudha Arora (Eds.) © 2024 Apple Academic Press, Inc. Co-published with CRC Press (Taylor & Francis)
180
FIGURE 8.1
Computational Science and Its Applications
Concept of deep learning.
Source: Reprinted from https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/ uploads/2018/03/AI-vs-ML-vs-Deep-Learning.png.
In the class of artificial neural networking, the methods initiate knowledge from untimely layer of networking slowly about the process of detecting low-level features of its subsequent layers to represent the complete picture. Sometimes it even achieves a great level of accuracy, which is beyond the imagination of human-level performances. Many applications of artificial intelligence (AI) are getting into exis tence, all thanks to deep learning. The idea of creating an artificial neural network that actually helps to understand the stimulation process of human brain has arrived in the field of AI since 1950. The fact that the process of execution actually takes more time is true, because it requires to get trained with the large amount of data. The performances are continuously improved as more data are fed; the more the system gets trained, the more accurate results we get. 8.2 WORKING OF DEEP LEARNING The evolution of deep learning and the digital era comes into light together, when the world is busy increasing the exponential amount of data of all the forms such as images, text, sounds, or videos year by year. Such a massive amount of facts gets unruffled from different sources.
Need for Deep Learning
181
As we know that deep learning is the concept of creating an arti ficial human brain having hundreds of billions of neurons, which is connected to thousands more of its neighbors in a way that machine can understand it.
FIGURE 8.2
Organization of neural networks.
Source: Reprinted from eoDD63iteO3c4vbg.png.
https://miro.medium.com/v2/resize:fit:1582/1*l5lnS
In fact, we just have a rough idea how deep learning is, as no one can really program a PC to do these things perfectly. The tremendous amount of data is fed into these neural nets and lets the different algorithms calcu late how to understand various patterns. Just like our brains have neurons and the signal coming from one neuron travels downstream into the next dendrites, connection of signal passing is known as a synapse. A single neuron is useless while many of them can work together to form a huge network and this is the idea to use the deep learning technique, you catch the input from the surveillance and put the input in one layer that the layer creates and the output, which will turn into the input for the subsequently layer, and so on until you get the final score. Similarly in deep learning, the output of the previous layer becomes the input for each successive layer in this network, and this is how they learn how to transform their input data into increasingly abstract and accurate approach.
182
Computational Science and Its Applications
8.3 COMPARISON BETWEEN MACHINE LEARNING AND DEEP LEARNING While defining the terms deep learning and machine learning, we can differ entiate them in basic terms like a connected chain. Deep learning is a division of machine learning while machine learning is a division of AI. These terms often get interchanged while defining the concept of AI learning because somehow, they have similar functions yet different in their capabilities. Let us discuss the difference in detail on the basis of given parameters. 8.3.1 DATA CONDITIONS The momentous difference among deep learning and AI is the size of information. Deep learning calculations need a lot of information to understand it spotlessly and conventional AI calculations are done with their high-quality standards. 8.3.2 HARDWARE CONDITIONS A deep learning calculation dynamically deals with high-end machines, whereas conventional deals with low-end machines. Deep learning calcu lations intrinsically do a lot of network duplication activities. 8.3.3 FEATURE DESIGNING Highlight designing is a cycle of placing area information into extractors to diminish the intricacy of the information. This cycle is troublesome and costly as far as time and skill are concerned. For instance, highlights can be pixel esteems, shape, surfaces, position, and direction. 8.3.4 PROBLEM-SOLVING APPROACH While using standard AI figuring, it is all things considered recommended to isolate the issue into different parts, unwind them autonomously, and go along with them to get the result that are bound to adapt the curiously sponsor to deal with the troublesome beginning to finish.
Need for Deep Learning
FIGURE 8.3
183
Difference between the architecture of deep learning and machine learning.
Source: Reprinted from https://www.edge-ai-vision.com/wp-content/uploads/2019/05/ MathWorksFigure3.png
In an average AI approach, you would partition the issue into two stages, object identification, and item acknowledgment. To begin with, you would utilize a bouncing box identification calculation and discover all the potential items. 8.3.5 EXECUTION TIME Normally, a deep learning calculation needs a long effort to prepare. This is on the grounds that there are a significant number of numerous boundaries in a deep learning calculation. But in class deep learning calculation, ResNet takes around 14 days to get ready without any training. At analysis time, the deep learning calculation needs substantially less effort to run, while on the other hand, k-closest neighbors’ (a sort of AI calculation) test time increments on expanding the size of information. Despite the fact that this is not relevant on all AI calculations, some of them have little testing occasions as well. 8.3.6 INTERPRETABILITY There is a concept of interpretability for correlation of AI and deep learning which is the basic explanation deep learning used in the industry. 8.4 MODELS OF DEEP LEARNING 8.4.1 LEARNING NEURAL NETWORKS AND THEIR PARAMETERS 8.4.1.1 AUTOENCODERS Autoencoders are an unsupervised learning in which we elaborate neural network for the task of representation learning. Autoencoders are the
184
Computational Science and Its Applications
family of neural network for which the input is same as output. They squeeze the input into various space representations and then restructure the output from these representations, where the term autoencoder is used for compression and decompression which is implemented with neural network. 8.4.1.2 DEEP BELIEF NET Deep belief networks are the class of neural network in which all the algorithm is modeled after the human brain. Deep neural network has a unique structure because they have a large and hidden struc ture between the input and output layer. This concept can be seen as restricted Boltzmann machine, where each subnetwork concealed layer assumes part as obvious information layer for the contiguous layer of the organization. The shrouded variable is utilized as the onlooker variable to prepare each layer of profound structure. In a similar way, each layer of organization is prepared autonomously and eagerly. If we are having additional number of layers, it means we have more pathways so that the information travel through the network and allow the network to perform high complex task such as high-speed mage and video analysis. 8.4.1.3 CONVOLUTION NEURAL NETWORKS A convolution neural network is a type of feed-forwarding multilayer where each neuron defines the order of response to all regions which are overlapping within the visual area. Deep convolutional neurons network works by continuously combining the information in smaller part and combining them deeper in the network. In easy conditions, it can be interpreted that the first layer identifies the edges and draws a template while the other remaining layers try to figure out the edges and draw different object templates while the third layer tries to match the input with all the patterns and the last prediction is subjective amount of all of them like putting all the missing puzzle pieces together which helps to complete the final object templates.
Need for Deep Learning
185
8.4.1.4 RECURRENT NEURAL NETWORKS The convolution neural network works on fixed input and produced same size of vector as an output but in case of the latest network, the association between layers forms a directed sequence. Apart from these, recurrent neural networks are not independent; they depend on each other. In other work, input and output both depend on the other. We cannot produce the result in the absence of any parameter. They share the standard parameter at every level. A variation is called a bidirectional intermittent neural organiza tion which is utilized in numerous applications. Bidirectional methods’ adjusting is done in two different ways. In two-manner and straight-forward repetitive neural organizations, profound learning can be accomplished by presenting various concealed layers. Speech, image, and natural language processing are the fields where we use recurrent neural networks. 8.4.1.5 REINFORCEMENT LEARNING TO NEURAL NETWORK Reinforcement learning is the learning to intermingle with the surround ings and get the best result which is based on hit-and-trial method. In this learning, there is a reward for a correct and wrong answer and, accord ingly, the model train itself. Once trained, it is prepared to forecast the new data. Some popular algorithms we used in reinforcement learning are Q learning and SARSA and many more. Some examples of reinforcement learning are a chessboard, a building, or position or speed on racetrack. Deep reinforcement is a stimulating and demanding area and undeniably be a crucial part of the upcoming AI scope. 8.4.2 DESIGN AND TRAINING OF DEEP LEARNING MODELS In this learning, models are built using neural network which contains a huge amount of processors in parallel form. An artificial neural network is modeled using hidden layer which receives the input and uses activation function with a threshold to determine if the message is passed along. In the deep learning model, there are multiple layers where the first one is called the input layer and last one is called an output layer. These layers
186
Computational Science and Its Applications
contain either one or multiple neurons. The first tier receives the input information; and in the same way, each succeeding tier gets the output from the prior one. An artificial neural network is the group of neurons which are connected to each other. A neuron receives the input from the predecessor neuron. The neuron network consists of different connections; every neuron passes an output which ultimately becomes the input for the next neuron. Each connection assigned the particular weight. There are some functions through which we can determine the output of neuron. 8.4.2.1 NEURAL NETWORK LEARNS FROM THE TRAINED DATA Neural network contains large amount of data in the preliminary stages, and in general, we give the input in the form of training data elements and get the production. For example, facial detection is the most recent technique implemented by a number of mobile phones where the input is analyzed on the basis of some training data sets. To train the neurons, we have to provide the data in the human understandable form and will try to put up its in-house data for a better result. There are some rules defined through which it may be decided what should be sent to the next layer taking into consideration the input from the prior one. 8.4.3 ACCELERATING DEEP LEARNING MODELS Deep learning has enabled the rapid development of various technologies such as machine learning, speech recognition, object detection, and many more. Deep learning rapidly scans the large amount of data so that it can find the subtle connection from information. Machine learning follows numerous algorithms so that system can learn from data and improve performance based upon some real experi ence. In a similar way, deep learning is the division of machine learning that helps to train several neural networks which uses inference means for the learning from new data. In deep learning, neural network learns various things from initial training and runs subsequently. The main purpose of deep learning model is to solve the problem in the same manner what a human brain would. These networks have been approached through various application areas.
Need for Deep Learning
187
8.5 NEURAL NETWORKS AND DEEP LEARNING 8.5.1 NEURAL NETWORK ELEMENTS To properly understand about the neural network and their elements, we need to take a look at the working and the structure of actual neurons present in the human body and human brain because these neural networks are demanding to replica the functioning of human brain. Neurons have an axon, dendrites, and a cell body. Nucleus is present in the cell body. The axon extends from cell; dendrites lengthen from the neuron cell body and collect the messages in the form of signal from further neurons. Synapses are the joining point between two neutrons through which they communicate with each other. The dendrites are surrounded by synapses formed by the end of axons form other neurons. All the neurons are connected to one another and form tissues. There is a tiny gap between every neuron, they do not touch each other and this gap is called the synapse. In these gaps, there can be electrical synapses or chemical synapses and these chemicals help in the passing of message signal from one neuron to another neuron. Neural networks fundamentally contain various algorithms, which are transformed and designed like human brain work to recognize the patterns. These networks can understand only sensory data and recog nize the pattern in the form of images, sound, text, or time series. The first job of neural network is to classify and cluster the input we gave and store that data for further processing according to the algorithms working on that data; it can be anything such as extract some specific information. The elements of neural networks are as follows: a) Processing elements The processing elements are made up of two units which are the summing input followed by the output unit. The main function of the summing unit is to take n number of input values and estimate the weighted sum of those input values. The activation value is basically the weighted sum of summing unit and the output is produced on the bases of this activation value.
Computational Science and Its Applications
188
Both the output and input can either be discrete or continuous as well as they can be either fuzzy or deterministic. b) Topology This is an essential footstep to systematize all the valuable proces sion elements in an appropriate manner and can attain the job of pattern recognition. Basically, topology simply means the arrange ment or organization of the inputs, outputs, processing elements, and their interconnections. Normally, in neural network, all the procession units are organized into particular layers. Each layer is having the same output values and the activation values. All the layers are associated with each other in various ways such as processing unit of one layer is connected to a unit of another layer, and so on. The most commonly used neural network topologies are the following: • • • • • •
Group of outstar Bidirectional associative memory Autoassociative memory Instar Outstar Group of instar
c) Learning algorithm This is the last element of neural network. This element is very
important because all the operations in any neural network are
processed or governed by the neural dynamics which consists
of these dynamics, activation state dynamics as well as synaptic
weight dynamics.
These learning laws can be a mixture of unsupervised, supervised,
or a hybrid of both.
Some of the commonly used learning algorithms are as follows:
• Correlation learning law • Instar learning law • Hebb’s law • Perception learning law • Wildrow and Hoff LMS learning law • Delta learning law
Need for Deep Learning
189
8.5.2 CONCEPTS OF DEEP NEURAL NETWORKS The concept behind this network is the multiple layers of neural network one after another which processes the data set and involuntarily filters the data and gives us the desired output or result because all the data are passing from multiple steps of pattern recognition. In deep neural network, each layer is having its own functionality and features based on the layer before this layer. If we go further into this neural network net, it is getting more complex and difficult to understand as they aggregate and reunite or combine again and again features from the previous layer and thus this is known as feature hierarchy. This capability of neural network helps in handling the large amount of data which is having millions of parameters that pass through nonlinear functions. Feature hierarchy is a hierarchy of increasing abstraction and complexity. The majority of data in the world is in the form of unstructured or unlabeled but the deep learning is capable of processing this unstructured data. This unstructured data means the raw data which is in the form of pictures, texts, video recordings, and audio recordings. The major problem deep learning solves is the processing of the raw data, the unlabeled media, anomalies, and so on that no one in the world has structured in a relational database. As an example, deep neural network can process up to millions of images and as a result gives us a cluster of a different kind of data which contains similar images of same kind, for example, dogs in one place, all images of your mother at another place, and third all images of different places. From this example, we can understand the basis of smart photo albums. Deep neural network applies this same processing technique to other data types also, such as to create different kind of cluster from raw text (unlabeled or unstructured). News blogs and e-mails comes under this category of raw text. Different kinds of cluster can be made according to the data provided to the deep neural network such as news blog of disaster can come into one cluster, news blogs of terrorist activities come into different cluster, and e-mails having complaints comes under different category. Deep neural networks are also very useful in customer relationship management (CRM). If we apply these different processing techniques to the e-mails, voice mails, and messages regarding the customer also, we are able to filter them by making different clusters. This results into
190
Computational Science and Its Applications
a better understanding of the customer such as their behavior, likes, dislikes, and their interest; this will change the way of marketing in the upcoming years. As we can see nowadays, most of the huge companies have started focusing more on the CRM through this technique of deep neural networks so that they can grow more or influence the market with their products. Deep neural network also performs automatic feature extraction process without any human intervention, unlike other machine learning algorithms. In this process of extraction, data from different kinds of data sets work on the basic principles of neural network which is to learn from the results or optimal solution given by previous layers result, by recognizing the patterns from these results and also making connections between the inputs and output from these different kinds of layers of deep neural networks. 8.6 LANGUAGES FOR DEEP LEARNING 8.6.1 DEEP LEARNING AND MATLAB Deep learning is an AI function that tries to copy or tries to perform same function as the human brain for the processing of different kinds of data. Translating languages, detecting objects, making decisions, and speech recognition are some functions that we can perform with the help of deep learning. Without any kind of directions or support from humans, deep learning can withdraw data from any kind of data set, whether it is unlabeled or unstructured. Deep learning is also helpful in increasing security because it is able to detect money laundering or fraud in other functions. Some examples where deep learning is the technology behind them are voice control in most of the devices like wireless speakers and mobiles that are also used in driverless cars and also enable them to recognize signals while the driverless car is moving on the streets. MATLAB is a kind of programming language; many deep learning algorithms are written in this language as this makes deep learning understandable and easy. This language is also having different kinds of toolboxes which are easily compatible for working with neural network, machine learning, computer vision, and automated driving.
Need for Deep Learning
191
The toolbox of this language helps in easily managing a big collec tion of data or we can say large data sets. This language is so easy that we can do deep learning without being an expert and within just few lines of code, we can visualize models and make them running at the servers. Some good features about MATLAB for deep learning are as follows. 8.6.1.1 VISUALIZING AND CREATING MODELS EASILY This language allows us to write minimum of the code for making deep learning models. Any predetermined data can be imported through this language and can perform the operations such as visualize and debugging. We can also adjust their training parameters. 8.6.1.2 PERFORMING DEEP LEARNING QUICKLY Many of us do not have the knowledge of deep learning before entering into this field and we learn about deep learning while on the job. With this language, it is easy for us to learn about this area, visualize it, and also practically try it. 8.6.1.3 AUTOMATING GROUND TRUTH LABELING OF VIDEOS AND IMAGES Automated ground truth labeling of videos and interactively label objects within images can lead us to better results in a short time period with the help of this language. 8.6.1.4 USING DEEP LEARNING IN A SINGLE WORKFLOW This language offers many numbers of toolboxes and functions for deep learning so that we can think and program in a single environment by providing us a range of domains for using deep learning algorithms such as data analysis, signal processing, and computer vision.
192
Computational Science and Its Applications
8.6.2 DEEP LEARNING AND PYTHON Python is a normally used high-level programming language and it is mainly used in the area of data science and also for the purpose of writing or creating deep learning algorithms. In this language, we have many numbers of libraries and frameworks for solving complex prob lems. Here are some popular and mainly used or famous libraries such as NumPy, SciPy, pandas, matplotlib (graph plotting), and many other libraries. There are some highly used frameworks also such as Theano, TensorFlow, and keras; this language has a huge community support and with this language, it is easy to solve analytical and quantitative computing. This language python is dynamically typed but very easy to understand. There are some popular libraries used in python for deep learning. 8.6.2.1 NUMPY This stands for numerical python, which specifies various operations on n-arrays in python. 8.6.2.2 SEABORN It describes the visual representation on statistical models. 8.6.2.3 MATPLOTLIB It describes how to draw charts, histograms, and plots. 8.6.2.4 PANDAS It is used for data wrangling, data manipulation, aggregation, and visualization. 8.6.2.5 SCIKIT-LEARN We can do analysis, built on NumPy, and open source.
Need for Deep Learning
193
8.7 DEEP LEARNING APPLICATIONS While talking about applications, we can say that many AI applications came into existence into this high-profile digital life; all thanks to deep learning. So, what do you think, what things can be done through a system with deep learning? Let us learn more about these as mentioned below. 8.7.1 IN THE FIELD OF ROBOTICS • Self-supervised learning: It is a method where a computer get trained without giving any kind of specific instructions from humans (like providing the image of an apple with the labeled data “apple”) and is able to identify any situations/problems or objects on their own and is able to conclude for the same. • Multi-agent learning: It is a simple method of learning from different agents to improve the capability of making the decisions via different experiences and way of communicate and coordinate with other agents. 8.7.2 IN THE FIELD OF RECOGNITION • Voice recognition: It is a method in which it is able to differentiate audios on the basis of the measuring the level of pitches, tones, and noise included in a record, for example, voice passwords. • Speech recognition: A process in which it is able to break down the audio into small pieces which can be of any language and tries to identify the best-fit word to that particular recording, for example, Siri and Alexa. • Handwriting recognition: It is a simple method where software receives the input from any images or a note to recognize the hand writing and find out the text from that note, for example, Google handwriting inputs. • Facial recognition: Here, biometric mapping is used to match the facial features from any photograph or video and is able to find the related information from database, yet sometimes it raises privacy issues.
194
Computational Science and Its Applications
• Object detection and recognition: It is a method where a computer learns to answer questions like “what is the object in the image?” is called recognition of the object, whereas questions like “what is that object?” refer to the process of detecting the object rather what size and whichever angle of the object is visible in the given picture or video, for example, Google Lens. 8.7.3 IN THE FIELD OF FASHION TECHNOLOGY • Apparel designing: As you already know, the trending fashion styles and brands are utilizing innovation to comprehend client needs and configuration better clothing, for example, Zalando (a German company partnered by Google) making great deals by offering customer’s preferred design, color, and texture on demand. • Manufacturing: It helps to identify the current trends and able to make seasonal demanding accessories, for example, top brands like H&M and Zara. 8.7.4 IN THE FIELD OF MEDICATIONS • Disease diagnosis: In this process, a large amount of medical records is provided to a system and they are able to identify both old and new diseases as well, for example, Google DeepMind Health developing technology for macular degeneration in aging eyes. • Personalized treatments: It helps to detect the set of diagnosis and notifies patient risks according to the patient history records without worrying about its accuracy, for example, IBM Watson Oncology provides Cancer treatments. 8.7.5 IN THE FIELD OF GAMING This allows the system to incarnate human strategies for playing games. Different games were at first prepared on rounds of people players and afterward were additionally prepared by games against themselves, for example, AlfaGo Game, DeepChess (super-human performance level), StarCraft, and many car racing games.
Need for Deep Learning
195
8.7.6 IN THE FIELD OF SECURITY AND SURVEILLANCE Yes, it is true that it is still at a nascent stage but the fact that it has a great future ahead cannot be neglected, for example, GitHub and BreifCam. 8.7.7 IN THE FIELD OF SALES AND MARKETING Deep learning is a game changer in the field of sales and marketing as it helps in content age, constant offering on promotion systems, chatbots, discourse acknowledgment, and regular language handling. 8.7.8 ADDING COLORS TO BLACK AND WHITE IMAGES It is a method which helps to improvise the color of black and white, stripped, and damaged due to time and able to get renewal without compromising on its accuracy level, for example, Algorithmia API. 8.7.9 ADDING SOUNDS TO SILENT VIDEOS It is a method in which a system tries to learn how every sound is different while striking on various surfaces and having different pitches, for example, some MIT researches have provided in algorithms which actu ally can give the perfect sound to a silent video. 8.8 FUTURE OF DEEP LEARNING By the end of decade, deep learning will adopt various tools; current deep learning has a limited tool, most of which are open source. The most tools which deep learning will adopt in the future are TensorFlow, BigDL, open deep, and many more. It would appear that this community is learning support for TensorFlow. Most of the deep learning development depends on Spark and Hadoop which are data analyst platforms. Hadoop is basically used for analyzing the data and it stores the data in the data warehouse as well as it is used for finding the hidden structures of the data.
Computational Science and Its Applications
196
On the other hand, Spark is the platform which is used for scaling and accelerating deep learning algorithm and cannot deal with these algorithms without the data analytical capability provided by these platforms. Many deep learning is using various technologies to train the network such as hyperparameter, optimization, fast-in memory, data cleaning, and data processing. Apart from these, deep learning is working on compressing the size of neural network. Due to the size of network, it has become difficult to build mobile apps that use multiple networks. To load the network in RAM, it requires extra memory and time. To reduce these complexities, it is working on the deep compression, which is similar to JPEG compression. Deep compression is able to reduce the memory for loading the various networks in the memory without losing accuracy. Most of the services have moved their solution toward self-service cloud base delivery model. Five to ten deep learning tools and language will become a standard component of every software development toolkit. Deep learning will adopt the integrated, cloudbased environment that provides access to a wide range of pluggable algorithms. KEYWORDS • • • • •
reinforcement big data spark NumPy python
REFERENCES 1. 2. 3. 4.
https://www.deepinstinct.com/2019/04/16/applications-of-deep-learning/ https://machinelearningmastery.com/inspirational-applications-deep-learning/ https://www.brilliantread.com/difference-between-ai-ml-deep-learning/ https://www.analyticsvidhya.com/blog/2017/04/comparison-between-deep learning-machine-learning/#:~:text=The%20most%20important%20difference%20 between,data%20to%20understand%20it%20perfectly
Need for Deep Learning 5. 6. 7. 8.
9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
197
https://mc.ai/what-is-deep-learning-and-how-does-it-work-3/?amp https://www.mathworks.com/discovery/deep-learning.html https://www.simplilearn.com/tutorials/deep-learning-tutorial/what-is-deep-learning https://www.edureka.co/blog/deep-learning-with-python/#:~:text=%20Why%20 Python%20for%20Deep%20Learning%3F%20%201,Numpy%2C%20 Seaborn%2C%20Matplotlib%2C%20Pandas%2C%20and%20Scikit-learn%20 More%20 https://www.mathworks.com/help/deeplearning/ug/deep-learning-in-matlab.html https://wiki.pathmind.com/neural-network#:~:text=Key%20Concepts%20of%20 Deep%20Neural%20Networks%20Deep-learning%20networks,pass%20in%20 a%20multistep%20process%20of%20pattern%20recognition https://wiki.pathmind.com/neural-network#:~:text=Key%20Concepts%20of%20 Deep%20Neural%20Networks%20Deep-learning%20networks,pass%20in%20 a%20multistep%20process%20of%20pattern%20recognition https://www.investopedia.com/terms/n/neuralnetwork.asp https://towardsdatascience.com/6-deep-learning-models-10d20afec175 https://www.tacc.utexas.edu/texascale/2018/powering-discoveries/accelerating deep-learning https://roboticsbiz.com/different-types-of-deep-learning-models-explained/ https://towardsdatascience.com/the-near-future-of-deep-learning-deb9fb9e811e https://www.infoworld.com/article/3172554/6-predictions-for-the-future-of-deep learning.html https://www.tacc.utexas.edu/texascale/2018/powering-discoveries/accelerating deep-learning
CHAPTER 9
Computational Intelligence for Big Data Analysis ANU MANCHANDA Department of MCA, CMR Institute of Technology, Bengaluru, India
ABSTRACT In this chapter, the concept of big data along with the framework is discussed. Also, various challenges and security issues have been described for managing the data. 9.1 NEED FOR BIG DATA 1. Big Data: What Is It? At a first glance, the term ‘Big data’ appears to be pointing to the data that is huge and here we just refer to the volume not the potential value that we could extract from this data. But the reality is entirely different. The term Big data does not give any clue of the real picture of Big Data. It’s a misnomer. It is often explained as the huge data sets that have grown so big that now they cannot be handled by the tools that we were using till date. Data has existed always, and it has grown slowly with the advent of technologies. Today we live in a machine-driven world, where data keeps on emerging from all the heterogeneous sources frequently and rapidly. We fetch data from various sources like IoT, Social media, Internet, etc. A lot of variation occurs in the format of data as it arrives from multiple sources and data keeps on changing rapidly. Various mechanisms were suggested by Berman1 through which Big data gets its existence: Computational Science and Its Applications. Anupama Chadha, Sachin Sharma, & Vasudha Arora (Eds.) © 2024 Apple Academic Press, Inc. Co-published with CRC Press (Taylor & Francis)
200
Computational Science and Its Applications
1. Collection of data as part of normal routine activities is used normally to improve their activities based on the information gathered rather than to observe something new out of it. So, Big data is required to reorganize the routine activities. 2. Data has been gathered and now it is required to support many new activities. 3. Decision is made to prepare a model based on resources of Big data. 4. The group entities share their data sources. 5. Organizations collect data and utilize it for the purpose of gaining benefit for their own and clients as well, but high skills are required for implementation. 6. Big data resources are built for the benefit of other organizations for the purpose of discoveries. There are three V’s of Big data to characterize it and are discussed below: Volume: Deals with voluminous data. Velocity: Frequent variation is seen in data and it must be acted upon actively. Variety: Data keeps on generating rapidly and arrives in various formats like images, text, video, audio, etc. These three V’s complete the definition of Big data. According to definition given by Gartner,2 the Big data is data of more volume, high velocity, and with vivid variety and its processing can result in better deci sion making. 2. How It Differs from Traditional Data? Earlier, organizations used to process their data using Enterprise Resource Planning and Customer Relationship Management tools to perform their day-to-day data analysis and transactions. Data captured from different sources was stored in databases and structured. The data maintained in this manner was suitable only for transaction processing and not suitable for query processing. For that purpose, data was moved to data warehouses where it was utilized for different kind of reporting using consolidation tools called ETL tools (Extract, Transform, and Load). These tools were utilized to transform and move data to construct data warehouses. Due to rapid increase in the amount generated and unstructured form, it cannot be processed using traditional data processing tools. So, other technologies
Computational Intelligence for Big Data Analysis
201
are needed that would have computational intelligence to work on data that is stored traditionally. There are many tools to process and fetch relevant information from Big data. 3. Importance of Big Data Big data helps improvise the way information could be collected and utilized by the organizations to efficiently make decisions.3 It is ubiquitous in nature and the benefit of Big data technologies lies in the fact that it can work on unstructured data and process huge volumes of data. There are many sectors in which businesses can take advantages of Big data and its computational intelligence. It helps in knowing the customer preferences, buying behavior, and predicting the future buying trends. Based upon the Big data analysis, products can be referred to the customers. Social media data analysis can help reach organizations to a large community of potential customers also. Companies can introduce modified products and even plan to have the location of their stores and outlets by knowing the buying preferences and trends of the customers. The computational power Big data may let organizations to identify the risks involved beforehand and take preventive measures before something disastrous could happen. There are many other sectors like healthcare, education, banking, automobiles, etc. where computational intelligence of Big data may bring revolution. 9.2 BIG DATA MANAGEMENT It consists of large amount of data in vivid forms. Traditional processing systems were not computationally efficient to extract the meaningful information and to handle the variety and volume of data. A set of tools and technologies were required to overcome these challenges. 9.2.1 EVOLUTION OF APACHE HADOOP Apache Hadoop was developed to let companies manage and process large volume of data conveniently. It is an ecosystem of various tools and technologies to store and analyze Big data. Apache Hadoop is an open-source framework developed in Java and can handle huge sets of unstructured data.
Computational Science and Its Applications
202
Hadoop project was made by Doug Cutting in 2005. Its first version Hadoop 1.0 was launched in 2006 and in 2012 was made available to public. Its second revised version Hadoop 2.3.0 was launched and made available to public in 2014 by Apache Software Foundation. 9.2.2 FRAMEWORK OF APACHE HADOOP Hadoop is a distributed file system (HDFS) with distributed storage and utilizes programming model of MapReduce. HDFS is capable of processing voluminous data in structured and unstructured forms. Large files are partitioned into blocks and are distrib uted across commodity servers known as clusters. Servers can be added or removed from the clusters as per the need. Hadoop can recognize the server failures and can cope up with them automatically. Its basic framework has following four main components: • • • •
Hadoop Common Hadoop Distributed File System Hadoop YARN Hadoop MapReduce Hadoop Common: It contains libraries and common utilities to manage other components. Filesystem, RPC, and serializa tion are the libraries of Hadoop common. Its main function is to manage the failures in Hadoop clusters. HDFS: It stands for Hadoop Distributed File System. Data is stored in the form of blocks across the cluster. Each block of data is replicated many times. YARN: It stands for Yet Another Resource Negotiator. It allocates the resources to applications. MapReduce: MapReduce programming model has two func tions—map function creates key and value pairs and in reduce phase, final output is produced using these key–value pairs.
9.2.2.1 HDFS ARCHITECTURE HDFS keeps large volume of data in a fault-tolerant storage system. It builds clusters on commodity machines. It keeps on operating even in the
Computational Intelligence for Big Data Analysis
203
case of cluster failure and also as the work is taken over by another machine in the cluster. Files are stored as blocks on the servers and multiple copies of blocks are stored across the servers.4
FIGURE 9.1
HDFS architecture.
Source: Adapted from Ref. [5].
It has master/slave architecture.5 It has a NameNode and many DataNodes. NameNodes are responsible to manage the namespace and control the access of files by the clients, whereas DataNodes are responsible to manage the data storage on their nodes. Storage and computation are distributed across DataNodes and it allows the clusters to grow horizontally. NameNode In HDFS, a cluster has one NameNode that acts like a master server and maintains the metadata information about all the DataNodes. File is stored in the form of blocks on these DataNodes, which are multiple in a cluster. NameNode handles various operations related to the files and directories in the namespace like open, close, rename, etc. DataNodes remain in continuous contact with NameNode and keep on checking the update on the tasks to be performed. NameNode maintains all the information about status of DataNodes. NameNodes are critical for the proper functioning of the cluster. NameNode Functions 1. Manages DataNodes 2. Handles metadata of DataNodes 3. Updates metadata whenever required
204
Computational Science and Its Applications
4. Regularly updates DataNodes status 5. Maintains information about DataNode replication 6. Identifies new DataNode in case of failure of a DataNode DataNode DataNodes are slave nodes and run on commodity machines. DataNodes are managed by the NameNodes. In a cluster, nodes are collected into a rack. NameNodes maintain a rack ID to identify the location of DataNodes in the cluster. Clusters are rack-aware in HDFS. Since blocks of data are replicated on various DataNodes so it becomes mandatory to maintain the data integrity which in turn is maintained by HDFS through different proce dures. It maintains transaction logs to keep track of various operations. Data blocks are replicated on several DataNodes so that failures do not affect the efficiency of the system. Replication factor is normally maintained to three and is generally decided at the time of cluster implementation. DataNode Functions 1. Store data 2. Perform read and write requests 3. Send heartbeat to NameNode 4. Create, delete, and replicate block 9.2.2.2 HADOOP MAPREDUCE MapReduce is a programming model. It is used to do distributed and parallel processing of voluminous datasets. MapReduce consists of two main tasks, that is, Map and Reduce. In the Map phase, the key–value pairs are produced post-processing of data. This key–value pair generated by multiple mappers is taken as input by the reducer phase. The reducer works on this input data to produce the results. MapReduce is the heart of the system. It carries out all the tasks in a fault-tolerant manner. Data is produced in a form that is more relevant for further processing or utilization. Map Generation of key–value pairs is the first phase of processing input data. An instance of map works on each key–value pair. OutputCollector takes
Computational Intelligence for Big Data Analysis
205
the intermediate output from the mapper and provides to the reducer. A reporter function keeps track of the map tasks. The intermediate results generated by the map need to be collected and shuffled. This task is carried out by the partitioner and sort. Then the output from mapping phase is given to the reducer. Reduce Reduce works on each of the output generated by the map. Reduce can begin its task after getting the results from the map. Output of the reduce phase is also a key–value pair. The results are finally stored in HDFS. 9.2.2.3 HADOOP YARN In Apache Hadoop 2.0, YARN was introduced to divide the task of managing resources and job scheduling separately. It serves two main functions—Global resource management through ResourceManager and per-application management through ApplicationManager. The ResourceManager controls NodeManager. ResourceManager has a Scheduler to allocate the required resources to the running applications and it does not monitor the status of the applications. It performs its sched uling activities depending upon the requirements of resources from the applications. Each node has a NodeManager. It sends report to the ResourceManager about usage of different resources like CPU, disk, memory, and network. There is an ApplicationManager on each of the node to accept new jobs and monitor the requirement of additional resources. It reports NodeManager for the additional resource requirement which in turn reports to the ResourceManager. NodeManager also checks the task progress in its node. 9.2.3 HADOOP ECOSYSTEM There are various other technologies built upon this framework and together form Hadoop ecosystem. These components are required to build and manage the applications of Big data. Following are the technologies that form part of Hadoop Ecosystem:
Computational Science and Its Applications
206
FIGURE 9.2
YARN architecture.
Source: Adapted from Ref. [5].
9.2.3.1 APACHE HBASE Apache HBase is written in Java. HBase can store all types of data. It is designed after BigTable of Google. It runs on top of HDFS. HBase is an open-source and non-relational database. It can store huge volume of data and is therefore suitable for Big data real-time applications. HBase tables normally have many rows and columns. Fault-tolerance is the main feature that it provides to store data. 9.2.3.2 APACHE HIVE Apache Hive is a batch-oriented data warehouse infrastructure tool. Hive is built on top of Hadoop. It provides way to store and process data on commodity hardware. Processing of queries takes time in Hive and is therefore suitable for adhoc queries. MapReduce capabilities are utilized for data analysis and users can access structured data by HiveQL. Hive data is organized into Tables, Partitions, and Buckets (Cluster). Tables: Data is stored in tables in the form of rows and columns.
Computational Intelligence for Big Data Analysis
207
Partitions: Partition is a storage unit. Tables can have partition keys to understand how the data is stored. It lets the user to know which rows match the specified criteria. Partitions represent the distribution of data. Buckets (Cluster): Partition data may be further divided into Buckets. Buckets are created based on the value of hash column in a table. 9.2.3.3 APACHE PIG Apache Pig is high level programming language. It lets programmers to analyze bulk datasets and save time in writing the MapReduce programs. Pig has two components: PigLatin and a runtime environment. PigLatin is a language in which input data is processed with a series of operations to generate the output. These operations lead to MapReduce tasks to get the output. There are two modes in Pig execution environment: Local mode and MapReduce mode 9.2.3.4 APACHE SQOOP Many times it is required to move data back and forth from other data stores to Hadoop. Sqoop offers the capability to move data from other data stores, transform into Hadoop usable form, and load it. Sqoop is a command-line interpreter and commands are executed one at a time. Few features of Sqoop are: • • • •
Import entire database with a single command. Direct import of database to Hadoop and vice versa. Loading data directly to Hive or HBase. Incremental load of parts of the table as when data gets updated.
9.2.3.5 APACHE SPARK Apache Spark is cluster computing framework. It is used for real-time processing and is capable of processing huge datasets. It gives adaptation to faults and overcomes the limitations of MapReduce. It is best suited for machine learning algorithms.
208
Computational Science and Its Applications
There are many more tools, apart from discussed above that could be added to the basic framework of Hadoop to provide more functionality. 9.3 BIG TOOLS AND SOFTWARE Traditional data systems are suitable to store, manage, and work with structured data. Big data involves data in different formats and cannot be handled in terms of storage, management, and analysis. Various types of challenges are associated with Big data in terms of data, processing, and managing.6 Data challenges refer to Big data characteristics and its visualization. Processing challenges include data capturing and its storage, extraction of information, analysis, etc. Management challenges involve data security, governance, sharing of data, and its ownership. Various software tools are used by applications to work with Big data. Categorization of Big data tools is as follows: 1. Storage and management tools 2. Data analysis tools 3. Data visualization tools
These tools will be discussed under headings mentioned above.
9.3.1 STORAGE AND MANAGEMENT TOOLS Big data is generated in large amount and its storage is a critical issue. Data need to be stored efficiently for further processing. Various tools are available to store and manage Big data. Few of the tools that are utilized for Big data storage are explained below. Cloudera It is an open-source Apache Hadoop distribution. Services and support are given through Cloudera Enterprise Data hub, Cloudera Analytic DB, Cloudera Operational DB, Cloudera Data Science and Engineering, and Cloudera essentials. Operational DB, Data Science & Engineering, Analytic DB, and Clou dera Essentials are part of database management platform of Cloudera Enterprise Data hub.
Computational Intelligence for Big Data Analysis
209
Cloudera distribution provides security, easy management of Hadoop and is fast for the business. Chukwa Apache Chukwa is an open-source system built on top of HDFS. It is used for collecting data on the distributed systems. It offers a toolkit to display, manage, and examine the outputs. It has four components: Agents, Collectors, MapReduce, and Hadoop Infrastructure Care Center (HICC). Agents function to collect the log information. Collectors are responsible to save the logs collected from agents. MapReduce performs the archive and parsing of data. HICC is used to display data. Cassandra It is a free and open-source distributed NoSQL database management system. It is capable of handling large amounts of data spread over on commodity servers. It provides the multi-node Cassandra clusters to store data and avoids any point of failures. It is column-oriented database. MongoDB MongoDB does not store data in the form of rows and columns. Instead, it uses documents and collections. It is a cross-platform and documentoriented database program. Documents have key-value pairs and groups of documents form collection. CouchDB CouchDB uses JSON to store documents. JavaScript is used to access documents. It uses schema-free model to store data. 9.3.2 DATA ANALYSIS TOOLS Data analysis is the process of extracting useful results from the large datasets. Many data analytic tools are available to perform the data analysis. Few of these are discussed below.
210
Computational Science and Its Applications
Tableau Public Tableau is used to analyze data and is one of the business intelligence tools. It processes the data by connecting with the Big data sources. It works on raw data and presents it in format that can be easily understood by the users. It lets users create and distribute an interactive dashboard. Some of the features of Tableau Public include its ability to collaborate and analyze data in real time. OpenRefine OpenRefine is a tool used to clean and perform data analytics on Big data. It also serves the purpose of converting data from one format to another. It can parse data from the websites. It has various extensions and plugins that make working on Big data easier. RapidMiner RapidMiner is used for data mining as well as predictive analytics. It is mainly used for commercial and business applications. It accepts raw data in the form of database or text and then analyzes automatically. It can be utilized for data preparation, visualization of the results, and model validation. R Project R is a programming language and environment to perform statistical analysis of data. It has a range of statistical techniques like linear and nonlinear modeling, time series analysis, clustering and classification, etc. It is an open-source software. It is easy to learn and use. 9.3.3 DATA VISUALIZATION TOOLS Data visualization tools let the designers create visualizations of huge datasets. Data visualizations are mainly used to prepare annual reports, dashboards, etc. These tools let present information in easy-to-interpret formats like charts, tables, graphs, and maps, etc. It lets the user identify the hidden trends and patterns in large datasets which are not easy to detect in non-graphical form of data. A large range of data visualization tools are available and few of these are discussed below.
Computational Intelligence for Big Data Analysis
211
Google Chart Google Chart is a very popular free visualization tool for large datasets. It works on every type of chart from drawing charts to complex charts. The charts created in Google Chart can be customized using CSS editing. It supports browser compatibility. Real-time data is managed efficiently by Google charts and it also can work on data on other Google applications. The charts created can be controlled through an interactive dashboard. Sisense Sisense provides users with easy-to-use interface. It lets the user have a single repository of data by collecting data from different sources. It does analysis on large datasets and lets the user visualize in different formats like graphs, charts, and maps, etc. Datawrapper Datawrapper is an easy-to-use data open-source visualization tool. It lets the user create visualizations in the form of maps, graphs, tables, and charts, etc. It is mainly used by the journalists and publishers. Fusion chart Fusion chart is based on JavaScript and is used to create web and mobile dashboards. Data is collected in XML or JSON format and then visualized in charts through JavaScript. It integrates with AngularJS and React frameworks and PHP and ASP.Net languages. Fusion chart lets the user embed ready-to-use code for all its charts and maps in their websites. Microsoft Power BI It’s a popular and powerful tool for data visualization. Multiple data sources can be connected easily with Microsoft Power BI. It comes in two versions—Power BI desktop and Power BI mobile. Its desktop version is free whereas another version is commercial. Power BI is easy to use for preparing data for visualizations and creates custom visualizations. Users can create interactive visualizations, reports, and dashboards.
212
Computational Science and Its Applications
9.4 SECURITY ISSUES IN BIG DATA It is known undoubtedly that Big data provides a plethora of prospects in various sectors like banking, education, healthcare, entertainment, transportation, social media, etc. The amount of data being generated is escalating each day. The massive amount of data brings along with it many challenges in terms of security and privacy. Security and privacy related issues are mainly because of Big data’s massive volume, variety, and velocity. Vulnerability in security is mainly caused due to varying formats of data and its sources, streaming data acquisition, and data migration. The utilization of cloud infrastructure provides a range of software platforms across networks leading to threats of security breach. Initially, Big data usage was limited to very few organizations and they were having their own infrastructure to handle it. Nowadays trend has changed and all small to big organizations are utilizing the benefits of public cloud infrastructure. Various software frameworks let perform parallel processing of data by distributing it across the network. Usage of public cloud for Big data has led to new security challenges. Also, IoT lets every device to connect to other devices using internet. The devices can communicate with other connected devices. The emergence of cloud and IoT is propelling the amount of data, and security is the main concern to be addressed. Traditional security mechanisms are not adequate to secure Big data. Security mechanisms like firewalls are not capable to provide information security. Various security issues pertaining to Big data are discussed. There are four main aspects of Big data security according to the Big Data Working Group at the Cloud Security Alliance organization. These are security of infrastructure, privacy of data, data management, and integrity and reactive security.7 Big data security involves the confidentiality of data, integrity of data, and availability of data. These four aspects are discussed in detail below: 1. Security of Infrastructure: Two main challenges under this category are to secure distributed processing of data and security best practices for Non-Relational Databases. Distributed programming involves parallel processing and storage for processing of huge amount of data. The data need to be protected on distributed
Computational Intelligence for Big Data Analysis
213
environment. NoSQL databases don’t provide complete security to the data. These databases are mainly focusing on other objectives rather than considering the security of data. These databases do not offer mechanisms to secure the data. 2. Data Privacy: Organizations using Big data are concerned about data privacy. Enormous amount of data is utilized by the orga nizations to extract meaningful information. Big data probably leads to the intrusion of privacy. Various tools and technologies are available to protect the data while allowing the organizations to solve their purpose. Different techniques are used to maintain the privacy of the data including cryptography, access control, confidentiality, social network privacy, and privacy protecting analytics. Cryptography is used to protect data and can be utilized for protection of Big data as well. 3. Data Management: It involves Secure Data Storage and Transac tion Logs, Granular Audits, and Data Provenance. It is important to maintain the log of transactions in a multitiered storage media. Auto-tiering solutions fail to maintain the logs, thus posing more threats of data security. Audit information needs to be maintained to identify the attacks. In Big data applications, provenance meta data is huge and its analysis for applications based on security or confidentiality is a rigorous task. 4. Integrity and reactive security: It consists of End-point valida tion and filtering and Real-Time Security Monitoring. Data is stored at various nodes. Data storage and sharing is a challenging task. As data is collected from different sources so it becomes utmost important to validate the input. A decision is to be made regarding the data sources reliability. Real-time security moni toring is another challenge in Big data. The size of the problem increases with an increase in volume of data. 9.4.1 ADDRESSING THE CHALLENGES FOR BIG DATA SECURITY The Big data security concerns get magnified with its volume and velocity. The traditional methods used to protect data are not suitable for Big data. The mechanism that are used to store and process Big data, make it vulnerable to malicious attacks and create a threat to its security.
214
Computational Science and Its Applications
Encryption could be the fundamental solution to secure the data. Querying encrypted data must be efficient. There are many techniques available from searching encrypted data to homomorphic encryption.8 Legitimate users must have access to the data and for that proper access control techniques must be in place. Many access control techniques are available for Big data management systems like NoSQL databases and social network data.9–11 Access control mechanisms are required to support the policies based on the relationship of user and data.12 Another important factor to be considered is regarding the identifica tion of various data sources. There is need to identify and protect sensitive data by implementing various access control techniques and strict data handling procedures. Security mechanisms need to be implemented at the origin of data and mechanisms to access control and prevention of data retrieving must work in conjunction.13 Data provenance is the main factor to affect quality of data involved in making decisions.14 Machine learning algorithms are utilized in almost all the aspects and their impact on our decision making need to be analyzed.15 Big data security solutions should look after security mechanisms from organizations to the public cloud.16 Security to data must be provided starting from its collection till its usage. Proper mechanisms must be in place to handle the Big data. Data confidentiality must be maintained using encryption techniques. Tradi tional encryption techniques are not enough to protect and handle Big data problems involved in data computation.17 So, the new techniques are required to keep the data secure while performing computations on the selected data. Organizations are collecting lot of private data from the individuals. There is need to address the issue of making these organiza tions accountable in maintaining and using the privacy of personal data. Latest regulations like GDPR, asks the organizations to utilize the data only for the purpose it was taken consent for.18 Apart from the recommendations discussed above, it is also mandatory to think about providing the infrastructure security. Networks are more prone to security breaches. If an intruder gains access to the network, then he would gain access to all the data. Therefore, new control and security measures are required to be in place. Continuous monitoring and control over the data is a must. New approaches are necessary to improve security at various levels of data life cycle.
Computational Intelligence for Big Data Analysis
215
9.5 BIG DATA APPLICATIONS Businesses are always under pressure to maximize their profits and reducing the risks to the minimum. All the business organizations know the advantage of performing analytics on huge amount of data and it is becoming an integrated part of their business decision-making routines. Organizations are taking advantage of the value gained through computa tional intelligence of Big data. It gives businesses a vision to accept chal lenges and make profitable moves. In the last decade, Big data analytics has gained widespread popularity among academicians and industry, whereby industry utilized it to perform better to reach their customer demands and enjoy the power of voluminous data at their hand. Various technologies are amalgamated to gain the computational intelligence, thereby increasing the chances to meet their targets and goals. Big data analytics has picked up momentum in the diverse application areas. Its computational intel ligence benefits have been utilized by many domains of industries. In the following section applications are discussed: 9.5.1 RETAIL BUSINESSES Almost all big brands are using Big data computational intelligence for various purposes like recommending products to the customer, tracking buying behavior pattern of the customer, introduction of new product, identifying the location of new stores, improving the quality of service to their customers, etc. 9.5.1.1 PRODUCT RECOMMENDATION Big brands like Amazon utilize Big data to recommend products to their customers by monitoring the product customer is buying. This recommen dation is made by performing the analytics in the background from the records of customers purchase histories. Customers purchase order details of online shopping are maintained by the retailers and then through this historical data the buying patterns of customers are filtered by applying various data analytical techniques. Based upon the result of collaborative filtering19 techniques recommendations are made to the customers.
216
Computational Science and Its Applications
Movies, Songs, and Videos recommendations are also made using collaborative filtering techniques. 9.5.1.2 PREDICTION OF PURCHASE PATTERNS Customer details are collected through various transactions. With the help of various analytical tools, purchase patterns are identified to plan different activities like advertisement placements, arrangements of the products in the shelves, etc. 9.5.2 BANKING SECTOR Enormous amount of data is generated from offline and online services offered by the banks. To have better interaction with their customers banks are providing services in different ways and so are the various channels of data generation. Social networking sites and reviews posted by the customers are few of them to name. These data are used by banks to detect and prevent frauds, retain their customers, analysis of risk and its management. 9.5.2.1 DETECTION AND PREVENTION OF FRAUD There are various types of frauds observed by banking sector like card fraud, money laundering frauds, etc. Banks need some real-time mechanisms to identify and detect fraudulent activities and take preventive measures to control it. Since financial sectors are prone to frauds, they need some tools that could enable them to study the hidden patterns in fraudulent transactions. These anomalies can be detected by the analysis of historical transactional data, which may lead to identification of fraudulent and non-fraudulent transactions. Pattern analysis leads to identification of fraudulent activities by comparing it with similar types of stored data. There are various analytical techniques involved in the process of fraud detection like pattern detection, graphical techniques, time series analysis, etc.
Computational Intelligence for Big Data Analysis
217
9.5.2.2 RISK MANAGEMENT Banks utilize Big data tools to identify and mitigate various kinds of risks pertaining to mainly loan accounts and credit card accounts. Big data tools along with data mining techniques identify relevant features. These tools also lead to improved response time and increased risk prediction. The risk management department can obtain real-time data from various sources and benefit from predictive power of risk models to mitigate the risks. 9.5.3 EDUCATION SECTOR Educational data is generated by the different ways like students inter acting with their learning systems, examination results, course curriculum and different learning activities, etc.20 This captured data can be analyzed for various purposes listed below: 1. 2. 3. 4. 5. 6.
Monitoring of academic performance of students and giving feedback Monitoring the dropout pattern Identifying the students requiring special attention Designing of course curriculum Identifying the students who are at risk of fail Applying teaching pedagogy as per student needs
9.5.4 HEALTHCARE Big amount of data is produced through different sources like biomedical sensors, mobile apps, patient’s record, different types of scans, lab tests, etc. The Big data from healthcare industry has lot of potential information and its analysis can provide more insights to improved healthcare in future.21 Following are few of the uses of Big data analysis in healthcare domain: 1. 2. 3. 4. 5. 6.
Reduction in the treatment cost Early diagnosis of disease and treatment Improved quality of service to the patients Innovation of new treatments Improved study of drugs effect on the patients Fair claim review process
Computational Science and Its Applications
218
9.5.5 TELECOMMUNICATIONS Telecommunication industry need to maintain their customers as retaining the old customers is not expensive as compared to acquiring the new ones. There is lot of competition between the companies to retain, maintain, and acquire customers. Companies need to act proactively to continue providing services to hold their customers and for the same purpose the data collected through social media, customer emails, feedback surveys, call centers, etc. is analyzed to understand the customer behavior. Further analysis of Big data is performed and is used for the following purposes: 1. 2. 3. 4. 5. 6.
Improved user experience Identify new services based on existing subscription plans Improved services Retain customers Promotion of new services Detection and prevention of frauds
9.5.6 MARKETING Organizations are always looking for improved marketing strategies and strive to get value for money. Marketing analytics offer services to organi zations by applying different analytical tests on data. The analysis results are used by organizations for the following purposes: 1. 2. 3. 4. 5. 6. 7. 8.
Improve marketing strategies Identify advertisement positioning Offer personalized products and services to customers Identify and optimize product price Analyze marketing performance Analyze consumer buying pattern Analyze market trend Understand consumer behavior
9.5.7 MANUFACTURING INDUSTRY With increased competition among the manufacturing industries, it is need of hour to understand the requirements of customers and their demands.
Computational Intelligence for Big Data Analysis
219
Organizations are always looking for improved marketing strategies and strive to get value for their money. Marketing analytics offer services to organizations by applying different analytical tests on Big data. The results obtained through analysis are useful to the manufacturing industry in the following ways: 1. 2. 3. 4. 5.
Optimize the cost of products Prediction of rise in demand Timely arrangement of raw materials based on demand forecast Predicting sales Demand analysis
9.6 BIG DATA AND FUZZY LOGIC Technologies like IoT, social media, mobile devices, and many more are booming up and resulting in the generation of huge volume of data which in turn has posed a challenge in terms of acquisition, storage, and processing. There is necessity as well as challenge in discovering the information from Big data by developing computationally effective methods that would be efficient in terms of speed and accuracy in their working. The Big data has lot of significant “hidden” knowledge and information that need to be extracted. Data analysis has become vital for the companies to survive in the competitive world. The underlying information and patterns in the data must be learnt through innovative technologies. Computational intelligence using fuzzy logic techniques is one of the alternatives to manage data smartly for extracting knowledge and relevant information from the Big data. 9.6.1 COMPUTATIONAL INTELLIGENCE Data must be processed at a fast pace in short periods due to a high influx of data. Modern techniques with effective algorithms are capable to uncover the underlying information from the data. Computational intelligence is very effective solution towards Big data. Computational intelligence is mainly involved in replication of the behavior of human instead of looking into the mechanisms to work upon getting that behavior.22 The use of computational intelligence needs to focus on problem. Fuzzy logic is one of the technologies associated with computational intelligence.
220
Computational Science and Its Applications
9.6.1.1 FUZZY LOGIC Human reasoning is not precise and is able to handle uncertainties associated with human activities. Humans can make decision with the uncertainty factor but it’s a complex thing for computers. Fuzzy logic plays a vital role when imprecise reasoning is involved. It assigns to a fuzzy set degree of truth in a set between 0 and 1. Fuzzy logic is applied to Big data when uncertainty is involved in its processing. It can be coupled with other decision-making techniques like probability, neural network, etc. Fuzzy logic is efficient and relevant in handling the Big data using linguistic variables. Also, linguistic variables let the data analysts interpret the results conveniently. Fuzzy systems have application in control systems, traffic signal control, etc. 9.6.1.2 FUZZY CLASSIFIER A classifier is an algorithm for assigning a class label depending on the description of the object. The object description is derived from the values of its attributes that is further utilized to classify it. Based on the training algorithm and training data set, a classifier can predict the object class. Fuzzy classifiers consist of many models, mainly including the Fuzzy rule-based classifiers and Fuzzy prototype-based classifiers. 9.6.1.3 FUZZY MODELS FOR BIG DATA/FUZZY RULE-BASED SYSTEMS Fuzzy models are good to handle the veracity and variety of Big data.23,24 These models are based on fuzzy logic and can manage information that is vague and uncertain. Big data can be handled with fuzzy models depending on its characteristics. Fuzzy rule-based systems interpret the information and generate information as fuzzy models or in an easy to understand linguistic rules. These models for Big data have mostly emphasized on classification applications. Most relevant algorithm for design and implementation for fuzzy classification models is mainly Fuzzy Rule-Based Classifiers.
Computational Intelligence for Big Data Analysis
221
9.6.1.3.1 FUZZY RULE-BASED CLASSIFIERS A FRBC has a Rule-base (RB) and a Database (DB). Database stores fuzzy set definitions and reasoning method. A Rule-base involves M rules written as: Rm: IF X1 is A1,jm,1 AND. . .AND XF is AF,jm,F THEN XF+1 is Cj,m with RWm where, Cj,m is the class label of mth rule RWm is the rule weight Chi-FRBS algorithm can be used to get values of RB and DB.
9.6.1.3.2 Chi-FRBCS Algorithm It generates rules as25 “Rule Rj: IF x1 IS A1 j AND …AND xn IS Anj THEN Class = Cj with RWj” • Fuzzy partition is built with equally distributed triangular member ship functions. • RB is built using a fuzzy rule associated to each example. Rules with the same antecedent can be created. • Same consequent leads to deletion of duplicated rules. • Different consequent can preserve highest weight rule. 9.6.2 BIG DATA CLASSIFICATION WITH FUZZY MODELS Big data involves uncertainty due to many reasons. Uncertainty is mainly due to vivid sources of data, different formats, variety, and veracity in data. The uncertainty of data and lack of data is handled by fuzzy rulebased classification systems. 9.6.2.1 CHI-FRBCS-BIGDATA Lopez et al.26 implemented distributed version of Chi et al. algorithm. Their algorithm has been developed by adopting MapReduce design based on the Chi et al. FRBCS.
Computational Science and Its Applications
222
As per the MapReduce paradigm, in first phase, the training dataset is split into small chunks which are given as input to the mappers. Then RB is generated using the Chi et al. algorithm for each chunk by the mapper. In second phase, the single reducer produces a final RB by joining the RBs. The Chi-FRBCS-BigData has two versions. Both the versions differ in the Reduce function of the fuzzy RB building. • •
ChiFRBCS-BigData-Max ChiFRBCS-BigData-Ave
9.6.2.1.1 Chi-FRBCS-BigData-Max This method identifies the rule with same antecedent. Final RB contains the rule with the highest weight. It is not required to check whether the consequent is same or not. 9.6.2.1.2 Chi-FRBCS-BigData-Ave This method identifies the rule with same antecedent. After that, the average weight of the rules with same consequent is calculated. Final RB contains the rule with the greatest average weight. 9.7 BIG DATA AND WEB INTELLIGENCE 9.7.1 WEB INTELLIGENCE The term “Web Intelligence” was first given by Ning Zhone Lin Yao and Y.Y Ohsuga27 in a research paper in the year 2000 in a conference. Web intelligence is based on utilizing information technology and arti ficial intelligence on the web and internet. It is mainly aimed to develop products and frameworks. Web intelligence is also used to provide services through web and internet. Web intelligence is spread over various domains including semantics and ontology engineering, filtering and retrieval of web information, web applications, social network and intelligence, knowledge grid and grid
Computational Intelligence for Big Data Analysis
223
intelligence, etc. It is mainly focused on building wisdom web-based intelligent systems in which human capabilities are integrated. 9.7.2 WEB INTELLIGENCE USING BIG DATA METHODS There are so many web pages and each page has a lot of data. Using tradi tional processing it is not possible to get desired and relevant information. Therefore, Big data analytics is aimed at use of efficient technologies. Web Intelligence using Big data involves use of six methods—Look, Listen, Learn, Connect, Predict, and Correct.28 9.7.2.1 LOOK The search of data on the web is carried out by Indexing and Ranking of web pages. Indexing adds information about web page to the search engine’s index. An index is basically a database having information about the collected web pages. PageRank is a term used to rank the web pages as per their relative impor tance on the internet. PageRank algorithm is used to calculate and maintain the relative importance of the web pages. It lets to display the web pages of more importance first in the search results. It’s a technique that is utilized to compute the importance of web pages. When search is made, the matched results are listed from the web pages in order of their relative page rank. 9.7.2.2 LISTEN The content that we search is always listened to work upon it further and more researches are made to get the sophisticated results. Also, the visits made to different web pages are utilized to understand about the intention of the consumer. This in turn is used to place advertisements on the web. So, whatever web pages we visit are listened. The web intelligence lies in understanding the intent of the web consumers to know whether the customer would be making a purchase or not. Language is the basis of advertisement. Natural language queries have also made it easier to understand the intent of the consumer and to place the advertisements accordingly. To understand the intent, sentiment
224
Computational Science and Its Applications
mining is done. Sentiment mining works on extracting the opinion from the written text. The text is mostly posted on the social media. Sentimentmining engines are used to know what consumer feels about a brand or product. Web is continuously listening to our intentions. 9.7.2.3 LEARN Machine needs to be trained in order to understand the sentiment or the text on the web pages in most of the cases. Machine learning is of two types— supervised and unsupervised learning. In supervised learning, the machine is trained from the data that are labelled, known as training dataset and the model is created. The test set of data is used to check the accuracy of the model by matching the results obtained through model prediction with the known outputs. Classification problem is solved through supervised learning. In unsupervised learning, there is no prior information available about the data. Clustering is one of the techniques of unsupervised learning. 9.7.2.4 CONNECT Human beings have the capability to do reasoning. Reasoning basically involves collecting the facts and working with certain rules on the facts. It also involves logical inferences and logic is the basis of the reasoning. Semantic web includes linked data, different inference rules, query, and engines. Semantic web is denoted by web ontology language. Semantic search engine extracts knowledge from the web. Reasoning is complex and it can be imple mented in machines using techniques like resolution and unification. 9.7.2.5 PREDICT Lot of data is generated through the social media and predictive analysis is utilized to predict the future trends. Predictive analysis has become popular in the prediction of trends for the businesses. The models used in predictive analysis use rules, classification techniques, and model involving statistics. In web analytics, random indexing technique is used to predict the behavior of the user.
Computational Intelligence for Big Data Analysis
225
User’s browsing information is gathered and prediction is made through analytics. 9.7.2.6 CORRECT Feedback control is used to control the errors. Feedback control can be used to understand the concept of human actions. 9.8 BIG DATA AND IOT 9.8.1 INTRODUCTION Internet of things (IoT) has captured a part in our daily lives and is gaining attention of researchers. IoT’s promising technologies are making our lives and devices smarter. These technologies are enabling connectivity among people and things anytime and anywhere. IoT definition is given by ‘‘Things having identities and virtual personalities operating in smart spaces using intelligent interfaces to connect and communicate within social, environmental, and user contexts.”29,30 The widespread use of various sensor devices has led to the rapid increase in utilization of IoT technology. IoT devices keep on generating data and have resulted in the massive explosion of data. The huge amount of data generated from these devices has contributed towards the growth of Big data and a close relationship can be seen between the two. IoT will keep on contributing to the three V’s of Big data, that is, volume, velocity, and variety. It is estimated that there would be 41 billion of IoT devices by the year 2027 and their number would grow up to 125 billion by 2030. The main challenge lies in collecting and storing data coming from IoT devices. Another major challenge for large applications lies in extracting hidden and relevant information from the data captured. Proper data analysis could cause revolutions in different sectors of utmost importance like healthcare, etc. Big data allows industries to understand needs of their customers and work upon designing and creating products as per the requirements. The Big data generated through IoT can be analyzed using proper data analytics tools and can be applied further in making decisions. The integration of Big data and IoT is an upcoming field of research to examine the challenges associated with these.
226
Computational Science and Its Applications
9.8.2 BIG DATA AND IOT Increase in the number of internet users and further easy availability of resources has resulted in the growth of IoT connected devices. New technologies and devices have given rise to IoT. IoT has stepped in our lives and every day more and more devices are connecting to IoT. According to Cisco annual internet report,31 total number of connected devices by 2023 would be 29.3 billion and internet would reach 5.3 billion people. The growth in traffic is expected to grow further. By 2023, 70% of the world’s population is expected to have mobile-based network connectivity and the number of mobile subscribers would grow to 5.7 billion. Big data and IoT complement each other. Increase in IoT devices produces Big data, and applications of Big data in IoT field may lead to research and development in this field. 9.8.2.1 NEED OF BIG DATA ANALYTICS IN IOT IoT devices communicate with each other through sensor fitted in them and generate huge amount of data over the internet. IoT generates revenue but produces huge amount of data. Therefore, organizations should work upon collecting, managing, and analyzing data generated from IoT devices. IoT Big data provides the desired platform to handle and analyze the data. The framework of IoT Big data is helpful in dealing with problems related to managing smart buildings. Sensors in the buildings collect data and further decision making is carried out by performing data analytics. Traffic is handled in smart cities via sensors fitted in traffic signals to collect statistics of traffic and IoT-based system to manage traffic. IoT-based systems have lot of potential to improve working of medical sector. The sensors generate data about the patient and which in turn could be analyzed in real time to make quick decision to give better treatment to the patient. Data coming from IoT enabled devices need to be captured and stored for analysis to help making quick and better decisions in real time. 9.8.3 CHALLENGES AND ISSUES IN IOT SYSTEMS Big data generally needs lot of storage and high processing power. These chal lenges should be taken care of in Big data processing and computation in the architecture of IoT layers. Apart from these, IoT also has issues with privacy,
Computational Intelligence for Big Data Analysis
227
security, interoperability, and availability. Analysis of data generated from IoT devices requires effective Big data analytics tools and need to be implemented with the end user application which in turn is a challenging task. Following are the major challenges associated with IoT systems: • Security & privacy of IoT systems-Due to connectivity of IoT devices over internet, there is always privacy and security concerns related to the data. Confidence in end users can be built only when various threats in IoT systems are handled and proper mechanisms are in place. Proper mechanisms of security must be in place at different layers of IoT architecture. Methods are required to detect threatful activities. To maintain privacy in IoT systems, proper authorization and authentication must be maintained. • Exchange of information related issues: There are various concerns while exchanging data between IoT devices due to the varied technologies used in its deployment. To overcome this issue, varied functionalities are being provided for exchange of information among IoT systems. • Regulatory issues: At one hand IoT systems make our life easy, on the other hand usage of IoT systems leads to challenges related to its ethical usage. It’s important to maintain security of data and at the same time its privacy. It is also important to consider this issue to build trust and confidence in users of IoT systems. • Availability and scalability related issues: IoT systems need to support different types of devices with different processing and storage. Availability and scalability are important issues to be considered. Resources need to be made available to the authorized objects irrespective of location and time of requirement. • QoS: Quality of service is a vital measure to assess different param eters related to IoT systems. An IoT system must meet the QoS standards as per its QoS metrics. 9.8.4 APPLICATION SECTORS OF BIG DATA IN IOT SYSTEMS • Smart city: There has been a steady growth in connected devices in recent years. Large volume of data is being generated every day. Big data analytics has enabled to understand and manage activities in city, resulting in an efficient smart city. In smart cities, technologies
Computational Science and Its Applications
228
are being used to control the smart utilization of energy. Traffic is also detected and controlled through smart devices. • Agriculture: Agriculture is a vital sector to meet the food require ments of growing population. By utilizing the IoT systems and Big data analytics, it is possible to have more harvest. Greenhouse technology along with effective usage of IoT devices and sensors can be used to monitor and control the environmental factors in yielding more crops. • Smart Surveillance: Sensors and IoT enabled technologies and Big data analytics are being employed to improve safety monitoring in the industries as well as at homes. Potential disaster warnings can be given in advance by the utilization of these technologies. • Healthcare: Wearable watches, smart devices, and various healthcare based mobile apps are integrated to mIoT (medical Internet of Things). These devices generate data related to the health fitness. By data analytics, it’s possible to reveal data interpretation results. This could help in faster and improved healthcare. Insights obtained from Big data analysis can lead to protect many lives by quick decision making. Apart from applications discussed above, Big data analysis in IoT has lot more applications in various sectors like firefighting, mining industry, logistics, smart homes, etc. KEYWORDS • • • •
big data Hadoop Apache data node
REFERENCES 1. Berman, J. J. Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information; Elsevier: Amsterdam, The Netherlands, 2013.
Computational Intelligence for Big Data Analysis
229
2. GARTNER. 2013. Gartner IT Glossary: Big Data [Online]. http://www.gartner.com/ it-glossary/big-data/. 3. Ularu, E. G.; Puican, F. C.; Apostu, A.; Velicanu, M. Perspectives on Big Data and Big Data Analytics. Database Syst. J. 2012, 3, 3–14. 4. Bhosale, S. H.; Gadekar, D. P. A Review Paper on Big Data and Hadoop. Int. J. Sci. Res. Pub. 2014, 4. 5. https://hadoop.apache.org 6. Sivarajah, U.; Kamal, M. M.; Irani, Z.; Weerakkody, V. Critical Analysis of Big Data Challenges and Analytical Methods. J. Busi. Res. 2017, 70, 263–286. 7. Big Data Working Group; Cloud Security Alliance (CSA). Top Ten Big Data Security and Privacy[Online], 2013, April. https://downloads.cloudsecurityalliance.org/initiatives/ bdwg/Expanded_Top_Ten_Big_Data_Security_and_Privacy_Challenges.pdf 8. Gentry, C. A Fully Homomorphic Encryption Scheme; PhD. Thesis, Stanford Univer sity, 2009. 9. Ulusoy, H.; Colombo, P.; Ferrari, E.; Kantarcioglu, M.; Pattuk, E. Guardmr: Finegrained Security Policy Enforcement for Mapreduce Systems. In Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security; ASIA CCS: Singapore, 2015; pp 285–296. DOI: 10.1145/2714576.2714624. 10. P. Colombo and E. Ferrari: Access Control Enforcement within MQTT-Based Internet of Things Ecosystems. In Proceedings of the 23nd ACM on Symposium on Access Control Models and Technologies, SACMAT 2018; Indianapolis, IN, 2018; pp 223–234. DOI: 10.1145/3205977.3205986. 11. Carminati, B.; Ferrari, E.; Heatherly, R.; Kantarcioglu, M.; Thuraisingham, B. M. A Semantic Web Based Framework for Social Network Access Control. In SACMAT; Carminati, B.; Joshi, J.; ACM: New York, 2009; pp 177–186. 12. Pasarella, E.; Lobo, J. A Datalog Framework for Modeling Relationship-Based Access Control Policies. In Proceedings of the 22nd ACM on Symposium on Access Control Models and Technologies, SACMAT 2017; Indianapolis, 2017; pp 91–102. DOI: 10.1145/3078861.3078871. 13. Kindervag, J.; Balaouras, S.; Hill, B.; Mak, K. Control and Protect Sensitive Informa tion in the Era of Big Data, 2012. 14. Bertino, E.; Kantarcioglu, M. A Cyber-Provenance Infrastructure for Sensor-Based Data Intensive Applications, In IEEE International Conference on Information Reuse and Integration, IRI 2017; San Diego, CA, 2017; pp 108–114. DOI: 10.1109/ IRI.2017.91 15. Sweeney, L. Discrimination in Online Ad Delivery. Commun. ACM 2013, 56, 44–54. DOI: 10.1145/2447976.2447990. 16. Juels, A. Oprea: New Approaches to Security and Availability for Cloud Data. Commun. ACM 2013, 56, 64. DOI:10.1145/2408776.2408793. 17. MIT, Big Data Privacy Workshop, Advancing the State of the Art in Technology and Practice—Workshop Summary Report, 2014. http://web.mit.edu/bigdatapriv/images/ MITBigDataPrivacyWorkshop2014_final05142014.pdf 18. Voigt, P.; Bussche, A. V. D. The EU General Data Protection Regulation (GDPR): A Practical Guide; Springer Publishing Company, Incorporated, 2017. 19. Su, X.; Khoshgoftaar, T. M. A Survey of Collaborative Filtering Techniques. Adv. Artif. Intell. 2009. DOI: 10.1155/2009/421425.
230
Computational Science and Its Applications
20. Vaitsis, C.; Hervatis, V.; Zary, N. Introduction to Big Data in Education and Its Contribution to the Quality Improvement Processes. In Big Data on Real-World Applications; Sebastian Ventura Solo, InTech, 2016. 21. Dash, S.; Shakyawar, S. K.; Sharma, M.; Kaushik, S. Big Data in Healthcare: Manage ment, Analysis and Future Prospects. J. Big Data 2019, 6, 54. 22. Hill, R. Computational Intelligence and Emerging Data Technologies. In Proceedings of the 2nd International Conference on Intelligent Networking and Collaborative Systems, INCOS 2010, 449–454. 23. Fernandez, C. C.; del Jesus, M.; Herrera, F. A View on Fuzzy Systems for Big Data: Progress and Opportunities. Int. J. Comput. Intell. Syst. 2016, 9, 69–80. 24. Hariri, R. H.; Fredericks, E. M.; Bowers, K. M. Uncertainty in Big Data Analytics: Survey, Opportunities, and Challenges. J. Big Data 2019, 6, 44. 25. Chi, Z.; Yan, H.; Pham, T. Fuzzy Algorithms with Applications to Image Processing and Pattern Recognition. In Advances in Fuzzy Systems-Applications and Theory, 1996; p 10. 26. Río, S.; López, V.; Benítez, J. M.; Herrera, F. MapReduce Approach to Address Big Data Classification Problems Based on the Fusion of Linguistic Fuzzy Rules. Int. J. Comput. Intell. Syst. 2015, 8, 422–437. DOI: 10.1080/18756891.2015.1017377. 27. Zhong, N.; Liu, Y.; Jiming; Yao, Y. Y.; Ohsuga, S. Web Intelligence (WI), Web Intel ligence, Computer Software and Applications Conference, 2000. COMPSAC 2000. The 24th Annual International, 2000; p 469. DOI: 10.1109/CMPSAC.2000.884768. 28. Shroff, G. The Intelligent Web: Search, Smart Algorithms, and Big Data, 2014. ISBN 978-0-19-964671-5. 29. Atzori, L.; Iera, A.; Morabito, G. The Internet of Things: A Survey. Comput. Netw. 2010, 54, 2787–2805. 30. Gubbi, J.; Buyya, R.; Marusic, S.; Palaniswami, M. Internet of Things (IoT): A Vision, Architectural Elements, and Future Directions. Fut. Gen. Comput. Syst. 2013, 29, 1645–1660. 31. https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual internetreport/white-paper-c11-741490.html
CHAPTER 10
Hybridization of Computational Intelligent Algorithms ANUPAMA CHADHA Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India
ABSTRACT The main of aim of combining (hybridizing) algorithms from different computational intelligence areas is to solve complex real life problems. The concept of hybridization is becoming popular due to the fact that hybrid algorithms perform better than applying individual computa tional intelligence algorithms. This chapter discusses various types of hybridized algorithms, their applications in real world and also their constraints. 10.1 HYBRID ALGORITHM Definition: A hybrid algorithm is designed using more than one algorithm to provide solution to a particular problem. The algorithms used to design hybrid algorithm work for the same problem but differ in their attributes. A hybrid algorithm may choose one algorithm or switches between the algo rithms depending on the input data. The purpose of the hybrid algorithm is to integrate the best features of various algorithms resulting into a more efficient algorithm. The most common applications of hybrid algorithms are the imple mentations of algorithms that work on divide and conquer rule and Computational Science and Its Applications. Anupama Chadha, Sachin Sharma, & Vasudha Arora (Eds.) © 2024 Apple Academic Press, Inc. Co-published with CRC Press (Taylor & Francis)
232
Computational Science and Its Applications
decrease and conquer rule. In these types of algorithms one algorithm works covering entire dataset, but as it moves deeper it changes its path to different algorithms. For example, Insertion sort, which works efficiently on small data, is applied in the final step, after applying another sorting algorithm, like Merge sort or Quicksort. 10.2 MORE EXAMPLES OF HYBRID ALGORITHMS Timsort integrates Merge sort, Insertion sort, and Binary Search and is a poplar hybrid algorithm. Introsort algorithm begins with a quicksort, but may move to Heap sort later on. Similarly, Introselect begins with Quickselect, but may move to median of medians later on. Distributed algorithms are also examples of hybrid algorithms, where an algorithm runs on a centralized server. This algorithm divides the task into small tasks and allocates them to machines on the same network. The outputs produced by different machines are submitted to the master algorithm executing on the centralized server which combines the results and publish them. For example, distribution sorts submit the subsets to different processors or machines which sort the subsets, and then the algorithm running on the centralized server combines the subsets into a sorted dataset. However, all distributed algorithms are not always hybrid algorithms. For example, in MapReduce model, the Map and Reduce algorithms solve two different problems, and their results are combined to give solution to a third problem. 10.3 TYPES OF HYBRID ALGORITHMS 10.3.1 HYBRID EVOLUTIONARY ALGORITHMS Before discussing hybrid evolutionary algorithms, let’s discuss what evolutionary algorithms are. Evolutionary algorithms follow heuristicbased approach to provide solution to those problems that cannot reach their solution in polynomial time. The example of such kind of problems is NP-Hard problems.
Hybridization of Computational Intelligent Algorithms
233
10.3.1.1 EVOLUTIONARY ALGORITHMS The process of an evolutionary algorithm contains four phases: initialization, selection, genetic operators, and termination. Each of these phases corresponds to the steps of natural selection. Also the evolutionary algorithm works on modular approach, hence easy to implement. In simple words, in an EA, the more the fitter the less the chance to die, hence will not contribute to the future generations. Figure 10.1 represents the EA process.
FIGURE 10.1
The phases in evolutionary algorithm.1
Let us discuss theses EA phases in detail. Before discussing the EA phases, we will first define the problem: Find the values of the variables that maximize some predefined fitness function. Since the algorithm is iterative, we will stop algorithm in the following two cases: first, the algorithm has run for predefined maximum number of iterations; second, predefined fitness threshold has reached. Initialization In this phase, we generate a pool of members randomly. Members are defined as the possible solutions to the problem. If compared with the process of natural selection, these members represent a gene pool.
234
Computational Science and Its Applications
Selection After the members are decided, the next step is to evaluate them according to a fitness function already decided. Creating the fitness function is often difficult as the fitness function should be problem and data specific. The fitness of all members is calculated and top-scoring members are selected. Multiple objective functions Sometimes EAs use multiple fitness functions. This result in multiple optimal solutions instead of one making the process complicated. A decider is then used to select a single solution. Genetic Operators Two substeps are included in this step: crossover and mutation. In crossover step, top two or more members are selected to create the next level in the hierarchy. In order to reach the optimal solution, new genetic material is introduced into the generation. This step is called mutation. This mutation process is governed by a probability distribution. This probability distribution decides the chance and severity of the mutation. Termination Eventually, the algorithm ends up governed by two situations. Either the algorithm reached the maximum number of iterations or the algorithm has achieved the threshold value. 10.3.1.2 DRAWBACKS OF EVOLUTIONARY ALGORITHMS An evolutionary algorithm does not provide optimal solution but provides “better” solutions as compared to other algorithms. This is because the EAs stop when the maximum runtime or the number of iterations is achieved. That is why the evolutionary algorithms are best suited for those problems where optimal solutions cannot be guaranteed. Secondly, in evolutionary algorithms the domain knowledge required to provide optimal solution is lacking. This drawback can be overcome by hybridizing the evolutionary algorithms. It is interesting that the various components of evolutionary algorithms, that is, objective function, the survivor selection operator, and the parameter settings can be hybridized. The objective function is hybridized by incorporating local search
Hybridization of Computational Intelligent Algorithms
235
heuristics into the already existing heuristic function. The selection operator responsible for selecting top members from the possible set of solutions now deals with neutral solutions. Neutral solutions are those solutions which give similar value of the objective function but have different way of representation. The exposure of neutral solutions will help explore the search space for new solutions. In third type of hybridization, the parameters like the probability of mutation, the tournament size of selection, or the population size are considered. In order to analyze the behavior of the hybrid algorithms for various parameters, self-adaptation method is used. In self adaptation approach, the parameters are encoded with the individuals so that these parameters are able to adapt themselves when subjected to the same evolutionary pressures. The experiments done by various researchers show that hybridization improve the accuracy of the evolutionary algorithms. 10.3.2 MORE EXAMPLES OF HYBRIDIZED ALGORITHMS Hybridizing an evolutionary algorithm with another evolutionary algo rithm; examples of such cases are: usage of genetic programming technique in a genetic algorithm, use of neural network in evolutionary algorithms, use of Fuzzy logic in evolutionary algorithm, and use of Particle swarm optimization (PSO) in evolutionary algorithm. Clustering is a data mining technique which divides the data into groups. The data items similar to each other are placed in the same group. This similarity feature varies as per dataset to be divided into groups. K-Means is a clustering algorithm, which is famous for its simplicity and ease of implementation, but the drawback of K-Means is that, it does not work with higher dimensional data. To overcome this constraint of K-Means it is fused with other clustering algorithms such as Principle Component Analysis (PCA) for better and accurate results. The other constraint of K-Means algorithm is that it chooses the initial centroids randomly. Not choosing good candidates as initial centroids may affect the quality of clusters produced. To overcome this constraint various algorithms have been proposed which hybridize the K-Means with heuristic methods to choose initial centroids.2 In today’s healthcare system, the diagnosis of various diseases can be done using medical imaging technique. However, in some cases single multimodal medical image is not sufficient for diagnosing the disease.
236
Computational Science and Its Applications
Therefore, creating a medical image by fusion of multimodal medical image comes to rescue in these cases. Various attempts have been made to combine the striking features of methodologies, namely, computed tomography (CT) and magnetic resonance imaging (MRI), Positron Emission Tomography (PET) and Single photon Computed Tomography(SPECT). The image produced by the fusion of these modalities gives a very clear image as compared to the image produced by traditional algorithms. Another example of hybridization is combining Lexisearch and genetic algorithms to solve optimization problems. Exact optimal solution is given by Lexisearch algorithms; on the other hand, genetic algorithms have heuristic approach.3 10.4 REAL LIFE APPLICATIONS OF HYBRID ALGORITHMS A lot of data is being generated in the healthcare every hour of the day. It is not possible for the classical statistical methods to analyze this data and make decisions. Nature-inspired metaheuristic algorithms are becoming popular in every field as they provide optimal or near optimal solutions. The genetic algorithms can be used by physicians in various medical specialties like radiology, radiotherapy, oncology, etc. for disease screening, diagnosis, treatment planning, and healthcare management A lot of data is being generated in today’s digital world. The data generated can be analyzed to make crucial business decisions. The organizations nowadays prefer to keep the data and the analysis techniques on cloud so that they can be protected from any technical failure or natural calamity. We all know job scheduling algorithms have a direct implication on the performance of cloud computing algorithms. Since there is a clash between multiple jobs approaching the server at the same time, job scheduling algorithms resolve these clashes. Many researchers have hybridized the traditional round robin algorithm to overcome the constraints attached to it.4 10.5 HYBRID INTELLIGENT SYSTEMS Hybrid Intelligent Systems are those systems in which two or more than two intelligent technologies are mingled. For example, when neural network is combined with a fuzzy system, a hybrid neuro-fuzzy system is generated.
Hybridization of Computational Intelligent Algorithms
237
10.5.1 NEURAL EXPERT SYSTEMS Expert systems work on the basis of rules generated by decision trees and try to model human thinking and logical ability. neural networks focus on simu lating the human brain and do this by parallel data processing. IF-THEN rules represent knowledge in expert systems, whereas in neural networks knowledge is represented as synaptic weights between neurons. In expert systems, knowledge can be represented in independent rules, whereas in neural networks it cannot be broken into pieces. The striking features of expert systems and neural networks can be combined to make a more effi cient expert system called as neural expert system as shown in Figure 10.2.
FIGURE 10.2
The neural expert system.5
238
Computational Science and Its Applications
As can be seen in Figure 10.2, the inference engine takes a central place in the neural expert system and controls the flow of information in the system. 10.5.2 NEURO-FUZZY SYSTEMS For designing an intelligent system, fuzzy logic and neural networks act as basic building blocks, but in a complementary way. While neural networks work at low level working with unstructured data, fuzzy logic works with data at higher level after acquiring linguistic information from the domain experts. Fuzzy system lacks learning ability unlike neural networks. We can integrate neural networks and fuzzy system to develop neuro-fuzzy systems. The neuro-fuzzy system is helpful in those situations where the output generated by neural network cannot be used to make decisions. In this situation, the output generated by neural networks is fed to the fuzzy system, as the fuzzy system has the capability to produce crisp output when the inputs are having some fuzziness. The neuro-fuzzy system is shown in Figure 10.3. 10.6 APPLICATIONS OF HYBRID INTELIGENT SYSTEMS TO IOT In today’s smart digitized world, Internet of Things (IoT) keeps physical objects globally connected via internet. Day by day the numbers of objects in the network of IoT are increasing and have reached a few trillions. Huge amount of private information is generated by these objects. The information generated requires processing, dissemination, and storage. Researchers are exploring and have developed various algorithms guided by factors like quality performance, sturdiness to attacks, and point to point security to provide security to information in IoT. They have also explored that hybridizing some of the encryption and cryptography-based algorithms can increase the confidentiality of data transmission. One such example is a hybrid algorithm based on combination of Advanced Encryption Standard (AES), Elliptic Curve Cryptography (ECC), and Message-Digest algorithm (MD5). In Geo-encryption or location-based encryption techniques, hybrid algorithms can be integrated to increase the
Hybridization of Computational Intelligent Algorithms
239
confidentiality while transmitting the data. The resultant hybrid algorithm performs better compared to classical algorithms with respect to time taken to encrypt and decrypt. Also, some researchers have suggested an encryption algorithm by hybridizing Data Encryption Standard (DES) encryption algorithm and Digital Signature Algorithm (DSA). This hybrid algorithm enhances the security of equipment information by encrypting electronic tags in IoT management system.7
FIGURE 10.3
The neuro-fuzzy system.6
Computational Science and Its Applications
240
Also in IoT scenario, there is a need to optimize the energy utilized in the network. There are factors like temperature, the load of the node that acts as Cluster Head (CH), number of nodes that are live, etc. that affect the amount of energy consumed by sensor nodes. A hybrid Whale Optimization AlgorithmMoth Flame Optimization (MFO) is proposed to make the selection optimal CH. This CH further helps in optimization of aforementioned factors.8 KEYWORDS • • • • •
computational algorithms hybridization computational intelligent systems real life applications constraints
REFERENCES 1. https://towardsdatascience.com/introduction-to-evolutionary-algorithms-a8594b 484ac#:~:text=An%20EA%20contains%20four%20overall,implementations %20of%20this%20algorithm%20category (accessed on 10-05-2021) 2. Karimov, J.; Ozbayoglu, M. Clustering Quality Improvement of k-Means Using a Hybrid Evolutionary Model. Procedia Comput. Sci. 2015, 61, 38–45. 3. Ahmed, Z. H. A Hybrid Algorithm Combining Lexi Search and Genetic Algorithms for the Quadratic Assignment Problem. Cogent Eng. 2018, 5 (1). 4. Abdalkafor, A. S.; Alheeti, K. M. A. A Hybrid Approach for Scheduling Applications in Cloud Computing Environment. Int. J. Electr. Comput. Eng. 2020, 10 (2), 1387–1397. 5. Negnevitsky, M. Artificial Intelligence: A Guide to Intelligent Systems, 3rd Ed.; Pearson Education. 6. Negnevitsky, M. Hybrid Neuro-Fuzzy Systems: Heterogeneous and Homogeneous Structures. WRI World Congress on Computer Science and Information Engineering, 2009. 7. Xu, P.; Li, M.; He, Y-J. A Hybrid Encryption Algorithm in the Application of Equipment Information Management Based on Internet of Things. Proceedings of 3rd International Conference on Multimedia Technology, 2013; pp 1123–1129. 8. Maddikuntaa, P. K. R.; Gadekallua, T. R.; Kaluria, R.; Srivastava, G.; Parizid, R. M.; Khane, M. S. Green Communication in IoT Networks Using a Hybrid Optimization Algorithm. Comput. Commun. 2020, 159, 97–107.
Index
A Apache Chukwa, 209
Artificial intelligence (AI), 1, 180
application of, 3
blockchain, internet of things (IoT), 3
application of, 7–10
assurance, 5
cloud computing, 5–6
coherence, 5
dexterity, 5
remodeling of finances, 10
roles, 5
smartphone, 4
transforming business, 6–7
computational intelligence, 3
position chart, 2
self-corrective measures, 2
strong, 26
weak, 26
Artificial neural network (ANN), 19
Augmented analytics, 75
data
discovery, 76
Google Data Studio, 86
graphs generated, 86
Microsoft, 79–80, 86
PowerBI interface, 79–80, 81–86
preparation, 76
science and machine learning, 77
types/techniques, 79
visualization, 77
natural-language generation (NLG), 76
Automated machine learning, 27–28
benefits, 28
importance, 28
B Banking sector
detection and prevention, 216
risk management, 217
Big data, 199–200 applications
product recommendation, 215–216
purchase patterns, 216
retail businesses, 215
banking sector
detection and prevention, 216
risk management, 217
data analysis tools
OpenRefine, 210
R Project, 210
RapidMiner, 210
Tableau, 210
data visualization tools, 210
datawrapper, 211
fusion chart, 211
Google Chart, 211
Microsoft Power BI, 211
Sisense, 211
educational data, 217
ETL tools (Extract, Transform, and
Load), 200
and fuzzy logic
BigData-Ave, 222
BigData-Max, 222
CHI-FRBCS, 221–222
Chi-FRBCS algorithm, 221
classifier, 220
computational intelligence, 219
FRBC has a Rule-base (RB), 221
rule-based systems, 220
Hadoop distributed file system (HDFS),
202
Apache Hadoop 2.0, YARN, 205
Apache HBase, 206
Apache Hive, 206–207
Apache Pig, 207
Apache Spark, 207–208
ApplicationManager, 205
architecture, 202–203
Index
242 DataNodes, 204
Hadoop Ecosystem, 205–206
Map, 204–205
MapReduce, 204
NameNode, 203–204
reduce, 205
ResourceManager, 205
Sqoop, 207
healthcare, 217
importance of, 201
Internet of things (IoT), 225
agriculture, 228
analytics in, 226
application sectors, 227–228
challenges and issues, 226–227
healthcare, 228
smart city, 227–228
smart surveillance, 228
management Apache Hadoop, 201–202
manufacturing industries, 218–219
marketing analytics, 218
security issues in
challenges, 213–214
data management, 213
data privacy, 213
infrastructure, 212–213
integrity and reactive security, 213
storage and management tools
Apache Chukwa, 209
Cassandra clusters, 209
Cloudera, 208–209
CouchDB, 209
MongoDB, 209
telecommunication, 218
tools and software
management challenges, 208
web intelligence, 222
connect, 224
correct, 225
learn, 224
listened, 223–224
look method, 223
predict, 224–225
Blockchain
Internet of Things (IoT), 3
application of, 7–10
assurance, 5
cloud computing, 5–6
coherence, 5
dexterity, 5
remodeling of finances, 10
roles, 5
smartphone, 4
transforming business, 6–7
C Cassandra clusters, 209
Cloud computing, 33
automatic system, 37
availability, 38
date breaches, 38
easy maintenance, 38
economical, 38
great availability of resources, 37
large network access, 37
on-demand self-service, 37
pay as you go, 38
role in, 39
security, 38
security issues with, 38
service vendors, 34
shared dangers, 38–39
shared technology, 38–39
software as a service (SaaS), 36
types, 34–35
Cloud service models
infrastructure as a service (IAAS), 36
platform as a service (PaaS), 36
software as a service (SaaS), 36
Cloudera, 208–209
Conformal cyclic cosmology (CCC), 104
Convolutional neural networks (CNN), 140
CouchDB, 209
Customer relationship management
(CRM), 189, 190
D Data analysis tools
OpenRefine, 210
R Project, 210
RapidMiner, 210
Tableau, 210
Index Data science
and big data, 86
Apache Hadoop, 90
distributed file systems, 90
Google file system, 90
graphs generated in, 89
hadoop ecosystem, 91–92
PySpark, 92–93
spark and storm, 92
data analytics techniques
descriptive analytics, 46
diagnostic analytics, 46–47
giant social networking site, 45
information, 45
marketing managers, 47
predictive analytics, 47
prescriptive analytics, 47
Rise of Big Data, 44
SINTEF, 44
Internet of Things (IoT)
potential benefits, 93
process, 47
actual versus predicted, 74
attribute information, 51
bar chart, 63
boxplot, 62
data collection/retrieval, 48
data modeling, 49
data preprocessing, 48–49
data set, 72
data visualization, 49–50
exploratory data analysis (EDA), 49,
54–56
gross income versus branch, 64
heat map, 65
model building, 68–70
model comparison, 75
payment methods, 62
profit generated, 66
quantity distribution, 67
random forest, 73
sale over time, 66
sales per hour, 61
total sales per product, 59–60
Data visualization tools, 210
data wrapper, 211
fusion chart, 211
243 Google Chart, 211
Microsoft Power BI, 211
Sisense, 211
Deep learning
applications
adding color, 195
facial recognition, 193
gaming, 194
handwriting recognition, 193
multi-agent learning, 193
object detection and recognition, 194
sales and marketing, 195
security and surveillance, 195
self-supervised learning, 193
silent videos, 195
speech recognition, 193
voice recognition, 193
artificial intelligence (AI), 180
computer version models, 179
concept
customer relationship management
(CRM), 189, 190
design and training, 185
accelerating deep learning models, 186
neural network, 186
fashion technology
apparel designing, 194
manufacturing, 194
languages
MATLAB, 190–191
performing deep learning, 191
single workflow, 191
videos and images, 191
visualizing and creating models
easily, 191
learning neural networks
autoencoders, 183–184
convolution neural network, 184
deep belief networks, 184
recurrent neural networks, 185
reinforcement learning, 185
machine learning (ML)
data conditions, 182
execution time, 183
feature designing, 182
hardware conditions, 182
interpretability, 183
problem-solving approach, 182–183
Index
244 medications
disease diagnosis, 194
personalized treatments, 194
neural networks
learning algorithms, 188
processing elements, 187–188
topology, 188
Python
MATPLOTLIB, 192
NumPy, 192
PANDAS, 192
scikit-learn, 192
seaborn, 192
working of, 180–181
Deep learning (DL), 27
Differential evolution (DE), 150
E Environment for Modeling Simulation and
Optimization (EMSO), 171
ETL tools (Extract, Transform, and Load),
200
Evolution strategy (ES), 150
Evolutionary algorithms (EAs)
characteristics of, 148–149
classical evolutionary algorithm
pseudocode of, 147–148
components of, 150
candidate solutions, 151
fitness function, 151
next generation, 152
parent recombination, 151–152
survivor selection, 152
evolutionary computation (EC), 145,
146
general implementation, 147
genetic algorithm (GA)
applications of, 160–161
binary encoding, 153–154
bit flip mutation, 157
chromosome, 152
elitism selection, 156
fitness, 152
fitness function, 153
inversion mutation, 158
multi-point crossover, 156
mutation, 153
one-point crossover, 156
permutation encoding, 154
population, 152
random resetting, 157
rank selection, 155
real value encoding, 154
recombination, 153
scramble mutation, 158
steady-state selection, 155–156
swap mutation, 157
termination criterion, 153
tournament selection, 155
traveling salesman problem (TSP),
152
tree encoding, 154
uniform crossover, 157
genetic programming (GP), 158
evolution operations, 159
fitness factor, 159
implementation of, 159
languages for, 160
primitive functions, 159
terminals, 159
termination, 159
in machine learning (ML), 160
Survival of the Fittest
theory, 146
types
differential evolution (DE), 150
evolution strategy (ES), 150
evolutionary programming, 150
genetic algorithm, 149–150
genetic programming (GP), 150
learning classifier system, 150
neuro-evolution (NE), 150
variant
applications of, 160–161
binary encoding, 153–154
bit flip mutation, 157
chromosome, 152
elitism selection, 156
fitness, 152
fitness function, 153
inversion mutation, 158
multi-point crossover, 156
mutation, 153
Index
245
one-point crossover, 156
permutation encoding, 154
population, 152
random resetting, 157
rank selection, 155
real value encoding, 154
recombination, 153
scramble mutation, 158
steady-state selection, 155–156
swap mutation, 157
termination criterion, 153
tournament selection, 155
traveling salesman problem (TSP), 152
tree encoding, 154
uniform crossover, 157
Evolutionary computation (EC), 145, 146
Exploratory data analysis (EDA), 49
F FRBC has a Rule-base (RB), 221
Fuzzy logic
BigData-Ave, 222
BigData-Max, 222
CHI-FRBCS, 221–222
Chi-FRBCS algorithm, 221
classifier, 220
computational intelligence, 219
FRBC has a Rule-base (RB), 221
rule-based systems, 220
G Genetic algorithm (GA)
applications of, 160–161
binary encoding, 153–154
bit flip mutation, 157
chromosome, 152
elitism selection, 156
fitness, 152
fitness function, 153
inversion mutation, 158
multi-point crossover, 156
mutation, 153
one-point crossover, 156
permutation encoding, 154
population, 152
random resetting, 157
rank selection, 155
real value encoding, 154
recombination, 153
scramble mutation, 158
steady-state selection, 155–156
swap mutation, 157
termination criterion, 153
tournament selection, 155
traveling salesman problem (TSP), 152
tree encoding, 154
uniform crossover, 157
Genetic programming (GP), 158
evolution operations, 159
fitness factor, 159
implementation of, 159
languages for, 160
primitive functions, 159
terminals, 159
termination, 159
Google Data Studio, 86
H Hadoop distributed file system (HDFS),
202
Apache Hadoop 2.0, YARN, 205
Apache HBase, 206
Apache Hive, 206–207
Apache Pig, 207
Apache Spark, 207–208
ApplicationManager, 205
architecture, 202–203
DataNodes, 204
Hadoop Ecosystem, 205–206
Map, 204–205
MapReduce, 204
NameNode, 203–204
reduce, 205
ResourceManager, 205
Sqoop, 207
Human visual system
digital image file format, 119–120
lossless compression, 120
lossy compression, 120
primary lens, 118–119
raster or bitmap image, 119
vector image, 119
Index
246 Hybrid algorithm, 231
attempts, 236
distributed algorithms, 232
evolutionary algorithm, 234–235
hybrid intelligent systems, 236
applications, 238–240
fuzzy logic and neural networks, 238
neural expert systems, 237–238
K-Means algorithm, 235
particle swarm optimization (PSO), 235
Principle Component Analysis (PCA),
235
real life applications, 236
types
evolutionary algorithm, 232
evolutionary algorithms, 232
genetic operators, 234
initialization, 233
multiple objective functions, 234
selection, 234
termination, 234
Hybrid intelligent systems, 236
applications, 238–240
fuzzy logic and neural networks, 238
neural expert systems, 237–238
I
Image-processing computer vision, 122–123 process filtration, 124–125
data augmentation, 127
deep learning, 139
convolutional neural networks (CNN), 140
data labeling, 140–141
image classification, 140
RCNN, 140–141
recurrent convolutional neural
network (RCNN), 140
digital image representation
matrices and, 121
human visual system
digital image file format, 119–120
lossless compression, 120
lossy compression, 120
primary lens, 118–119
raster or bitmap image, 119
vector image, 119
image analysis
classification techniques, 136–137
feature extraction, 136
segmentation, 135–136
steps, 134–135
image segmentation
canny edge detection, 131
clustering, 133–134
discontinuities, 129
edge detection, 130
Prewitt operator edge detection, 131
region growing, 133
Robert edge detection, 132
similarities, 129–130
Thresholding, 132–133
machine learning (ML), 137–138
bag of visual words, 138
codebook construction, 139
feature extraction, 138
scale-invariant feature transform
(SIFT), 138
vector quantization, 139
phases, 121
acquisition, 122
color image processing, 122
filtering methods, 129
Gaussian noise, 128
image compression, 122
image enhancement, 122
image restoration, 122
morphological processing, 122
and multi-resolution processing, 122
object detection and recognition, 122
representation and description, 122
salt and pepper, 128
segmentation procedure, 122
wavelets, 122
preprocessing. methodology of, 128
standardization, 127
techniques, 126–127
types
binary, 120
8-bit color format, 120
16-bit color format, 120
black and white, 120
Index
247
Infrastructure as a service (IAAS), 36
Internet of things (IoT), 225
agriculture, 228
analytics in, 226
application sectors, 227–228 challenges and issues, 226–227 healthcare, 228
smart city, 227–228 smart surveillance, 228
K K-Means algorithm, 235
L Learning neural networks autoencoders, 183–184 convolution neural network, 184
deep belief networks, 184
recurrent neural networks, 185
reinforcement learning, 185
M Machine learning (ML), 26, 137–138
advantages, 16
anomaly detection, 21
application
agriculture, 40
education, 40
energy, 40
feedstock, 40
financial services, 40
health care & life science, 39–40
manufacturing, 39
retail, 40
travel and hospitality, 39
and utilities, 40
artificial neural network (ANN), 19
bag of visual words, 138
cloud computing, 33
automatic system, 37
availability, 38
date breaches, 38
easy maintenance, 38
economical, 38
great availability of resources, 37
large network access, 37
on-demand self-service, 37
pay as you go, 38
role in, 39
security, 38
security issues with, 38
service vendors, 34
shared dangers, 38–39
shared technology, 38–39
software as a service (SaaS), 36
types, 34–35
cloud service models
infrastructure as a service (IAAS), 36
platform as a service (PaaS), 36
software as a service (SaaS), 36
codebook construction, 139
concept, 14
data conditions, 182
decision tree, 17
defined by Arthur Samuel, 14
disadvantage, 16
examples, 15–16
execution time, 183
feature designing, 182
feature extraction, 138
Gaussian mixture model, 20
hardware conditions, 182
Internet of Things (IoT)
application, 32
cloud computing, 30
devices and sensors, 29–30
gateway, 30
goals, 31
security, 32–33
user interface, 30
interpretability, 183
K-mean clustering, 21
knowledge of humans, 15
linear regression, 16–17
logistic regression, 17
models
classification, 22
clustering, 22, 25
dimensionality reduction, 25
model selection, 25
processing phase, 25
regression, 22
Naive Bayes classifier, 17–19
phases
learning and prediction, 15
Index
248 problem-solving approach, 182–183
random forest, 19
scale-invariant feature transform (SIFT),
138
support Vector Machine (SVM), 20
types, 22, 23–24
vector quantization, 139
MongoDB, 209
N Natural-language generation (NLG), 76
Neuro-evolution (NE), 150
O One-point crossover, 156
OpenRefine, 210
HYDROFLO, 171–172
increased accuracy, 165
insight into dynamics, 164–165
ITHACA, 172
risk-free environment, 164
save money and time, 164
simulation approaches, 167
simulation software’s, 168–169
testing
verification and validation, 172–174
visualization, 164
Python
MATPLOTLIB, 192
NumPy, 192
PANDAS, 192
scikit-learn, 192
seaborn, 192
P
Q
Particle swarm optimization (PSO), 235
Platform as a service (PaaS), 36
PowerBI interface, 79–80, 81–86
Principle Component Analysis (PCA), 235
Process simulation, 163
application
control station simulation, 168
exact production planning, 168
logistic simulation, 167
machine scheduling, 168
personnel simulation, 168
in production, 168
supply chain simulation, 168
Aspen Plus, 169
case study, 174–176
chemcad, 170
chromworks, 169
COCO, 170–171
design II for windows, 170
disadvantages of, 174
Environment for Modeling Simulation
and Optimization (EMSO), 171
features of
animation and layout features, 166
model building features, 165
output features, 166–167
runtime environment, 166
handle uncertainty, 165
Q language, 113 QCL (Quantum Computer Language), 112 QPL (Qualified Public Liability
Corporation), 113
Quantum computing applications
astronomy, 101–102
cyber security, 99–100
financial markets, 101
healthcare, 101
quantum cryptography, 100
quantum simulation, 99
random number generation, 102
Shor’s algorithm, 100
classical bits and qubits, 98
computer and classical computer, 98–99
engineering challenges
hardware and algorithms, 108–110
interface, 111
matters internal, 111
normalize or optimize, 111–112
mathematical description, 102
conformal cyclic cosmology (CCC),
104
hypothesis 1, 104–107
mathematical tools, 107–108
postulates, 103–104
structure, 103
Index
249
programming
functional, 113
imperative language, 112–113
Lambda calculus, 114
Q language, 113 QCL (Quantum Computer Language),
112
QPL (Qualified Public Liability Corporation), 113
Quantum Guarded Command Language (qGCL), 113
Quantum Guarded Command Language (qGCL), 113
R R Project, 210
RapidMiner, 210
Recurrent convolutional neural network
(RCNN), 140
S Scale-invariant feature transform (SIFT),
138
Security issues
challenges, 213–214
data management, 213
data privacy, 213
infrastructure, 212–213
integrity and reactive security, 213
Shor’s algorithm, 100
Software as a service (SaaS), 36
Storage and management tools
Apache Chukwa, 209
Cassandra clusters, 209
Cloudera, 208–209
CouchDB, 209
MongoDB, 209
Support Vector Machine (SVM), 20
T Tableau, 210
Termination criterion, 153
Tournament selection, 155
Traveling salesman problem (TSP), 152
Tree encoding, 154
U Uniform crossover, 157
V
Variant applications of, 160–161 binary encoding, 153–154 bit flip mutation, 157
chromosome, 152
elitism selection, 156
fitness, 152
fitness function, 153
inversion mutation, 158
multi-point crossover, 156
mutation, 153
one-point crossover, 156
permutation encoding, 154
population, 152
random resetting, 157
rank selection, 155
real value encoding, 154
recombination, 153
scramble mutation, 158
steady-state selection, 155–156 swap mutation, 157
termination criterion, 153
tournament selection, 155
traveling salesman problem (TSP), 152
tree encoding, 154
uniform crossover, 157
Vector quantization, 139
W Web intelligence, 222
connect, 224
correct, 225
learn, 224
listened, 223–224 look method, 223
predict, 224–225