128 95 4MB
English Pages 131 [125] Year 2021
Studies in Big Data 94
Parikshit Narendra Mahalle Gitanjali Rahul Shinde Priya Dudhale Pise Jyoti Yogesh Deshmukh
Foundations of Data Science for Engineering Problem Solving
Studies in Big Data Volume 94
Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Big Data” (SBD) publishes new developments and advances in the various areas of Big Data- quickly and with a high quality. The intent is to cover the theory, research, development, and applications of Big Data, as embedded in the fields of engineering, computer science, physics, economics and life sciences. The books of the series refer to the analysis and understanding of large, complex, and/or distributed data sets generated from recent digital sources coming from sensors or other physical instruments as well as simulations, crowd sourcing, social networks or other internet transactions, such as emails or video click streams and other. The series contains monographs, lecture notes and edited volumes in Big Data spanning the areas of computational intelligence including neural networks, evolutionary computation, soft computing, fuzzy systems, as well as artificial intelligence, data mining, modern statistics and Operations research, as well as self-organizing systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. The books of this series are reviewed in a single blind peer review process. Indexed by SCOPUS, SCIMAGO and zbMATH. All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/11970
Parikshit Narendra Mahalle · Gitanjali Rahul Shinde · Priya Dudhale Pise · Jyoti Yogesh Deshmukh
Foundations of Data Science for Engineering Problem Solving
Parikshit Narendra Mahalle Department of Artificial intelligence and Data Science Vishwakarma Institute of Information Technology Pune, India Department of Computer Engineering STES’s Smt. Kashibai Navale College of Engineering Pune, Maharashtra, India
Gitanjali Rahul Shinde Department of Computer Engineering Vishwakarma Institute of Information Technology Pune, India Jyoti Yogesh Deshmukh Department of Computer Engineering JSPM’s Bhivarabai Sawant Institute of Technology and Research Pune, India
Priya Dudhale Pise Dr. D. Y. Patil Biotechnology and Bioinformatics Institute Pune, India
ISSN 2197-6503 ISSN 2197-6511 (electronic) Studies in Big Data ISBN 978-981-16-5159-5 ISBN 978-981-16-5160-1 (eBook) https://doi.org/10.1007/978-981-16-5160-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
Senses (Sense-Organs) are purified by water; Mind is purified by truth; Soul is purified by learning and penance; while intelligence is purified by knowledge. —Sanskrit Shubhashitani
The book Foundations of Data Science for Engineering Problem Solving is envisioned to present the detailed and comprehensive overview of data science foundations including data science evolution, data collection, preparation, analysis of data using machine learning algorithm, data visualization and how data science can make better insights into various use cases in science and engineering. Since the last decade, there is much advancement in very large scale integration technology and the semiconductor industry making electronic wearable devices, and all Wi-Fienabled devices cheaper and tiny having functionalities of sensing, computing and communication. In addition to this, the Internet is also available at a more faster and affordable cost as compared to the past. Due to these reasons, the data generated by these devices and its posting on the cloud is increasing at a very faster rate. The data has become big in terms of volume, variety, velocity and complexity, and all information technology leaders are facing the problem of how to deal with this big data. The book focuses on how data science can enrich the applications of the science and engineering domain for making it smarter. The main objective of this book is to help readers to understand how this evolving field of data science is going to be useful in forecasting, prediction, estimation and recommendation. The entire notion of the book is exploring foundations of data science from the basics to applications followed by case studies in science and engineering. The entire book is mainly divided into three parts. The first part of the book deals with Big data and its emergence in today’s context, data science basics, its evolution and need for today, and various applications. Data collection and preparation is the main part of any data science application which includes data exploration, various types of datasets, their classification based on the sources and types, data preprocessing phases and different tasks involved;
v
vi
Preface
web scrapping tools like Beautiful Soap, Scrapy and URLLIB are presented and discussed in this part of the book. The next part of the book covers the important topic of data visualization, its need, challenges, respective tools and modelling of data. Visualization tools like Tableau, Matplotlib, Looker, Seaborn, PowerBI, IBM Cognos Analytics and their functioning are also discussed in detail in this part of the book. The process of data modelling, impact of modelling on the outcomes, decision making process as well as the role of data science in engineering problem solving are also elucidated in this part of the book. The main objective of this part of the book is to focus on various emerging tools and techniques adapted in industry for data science applications to enhance business intelligence. The last part of the book discusses case studies of data science in the field of information, communication, technology, civil engineering, mechanical engineering and health care. This part covers important use cases like structural engineering, geotechnical engineering, construction management, recommendation system, clinical decision support system, preventive health care, control engineering, solid mechanics, predicting mechanical failure, etc. The detailed use of data science in these areas, open research issues and future outlook of data science are also presented and discussed. The main characteristics of this book are • • • • • • •
In-depth and detailed description of all the topics; Use case and scenario-based descriptions for easier understanding; Individual chapters covering case studies in prominent branches of engineering; Research and application development perspective with implementation details; Hands-on results and discussion; Numerous examples, technical descriptions and real-world scenarios; Simple and easy language so that it can be useful to a wide range of stakeholders like laymen to educate users, from villages to metros and national to global levels.
Data science and its applications to various branches of science, technology and engineering are now fundamental courses to all undergraduate courses in Computer Science, Computer Engineering, Information Technology as well as Electronics and Telecommunication engineering. Many universities and autonomous institutes across the globe have started an undergraduate programme titled “Artificial Intelligence and Data Science” as well as honours programmes in the same subject which is open for all branches of engineering. Because of this, this book is useful to all undergraduate students of these courses for project development and product design in data science, machine learning and artificial intelligence. This book is also useful to a wider range of researchers and design engineers who are concerned with exploring data science for engineering use cases. Essentially, this book is most useful to all entrepreneurs who are interested to start their start-ups in the field of application of data science to civil, mechanical, chemical engineering and healthcare domains as well as related product development. The book is useful for Undergraduates,
Preface
vii
Postgraduates, Industry, Researchers and Research Scholars in ICT, and we are sure that this book will be well-received by all stakeholders. Pune, India Pune, India Pune, India Pune, India
Parikshit Narendra Mahalle Gitanjali Rahul Shinde Priya Dudhale Pise Jyoti Yogesh Deshmukh
Contents
1 Introduction to Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 What is Data Science? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Evolution with a Need for Data Science . . . . . . . . . . . . . . . . . . . . . . 1.3 Applications of Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Use of Data Science in D-Mart (E-commerce and Retail Management) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Narrow Artificial Intelligence (AI) . . . . . . . . . . . . . . . . . . . . . 1.3.3 Trustworthy Artificial Intelligence (AI) . . . . . . . . . . . . . . . . 1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 2 6 6 8 9 11 12
2 Data Collection and Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Taxonomy of Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Statistical Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Dataset Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Handling Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Removing Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.2 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Data Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Attribute Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.3 Numerosity Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Web Scrapping Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15 15 17 17 19 21 22 23 24 24 24 26 27 28 28 29 29 30 30
ix
x
Contents
3 Data Analytics and Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Data Analytics Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Machine Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Deep Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Data Science Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33 33 35 35 43 44 44 45 46
4 Data Visualization Tools and Data Modelling . . . . . . . . . . . . . . . . . . . . . 4.1 Need of Visualization of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Challenges of Data Visualization . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Steps of Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Visualization Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Importance of Usage of Tools for Visualization . . . . . . . . . . 4.2.2 MS Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Tableau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Datawrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Microsoft PowerBI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49 49 50 51 53 54 55 57 60 61 66 69 71
5 Data Science in Information, Communication and Technology . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Case Study in Computer Engineering . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 To Choose Fastest Route to Reach Destination . . . . . . . . . . 5.3.2 To Get Food Recipe Recommendations of Our Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 The Famous Netflix Case Study . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Case Study of Amazon Using Data Science . . . . . . . . . . . . . 5.3.5 Case Study on KCC (Kisaan Call Center) . . . . . . . . . . . . . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73 73 74 75 75
6 Data Science in Civil Engineering and Mechanical Engineering . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Case Studies in Civil Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Case Studies in Mechanical Engineering . . . . . . . . . . . . . . . . . . . . . . 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87 87 89 91 96 98 98
76 77 79 82 84 84
Contents
xi
7 Data Science in Clinical Decision System . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Case Study in Clinical Decision System . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Preventive Measures for Cardiovascular Disease Using Data Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Case Study on COVID-19 Prediction . . . . . . . . . . . . . . . . . . 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
101 101 103 105 105 106 110 111
8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Open Research Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Future Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
113 113 114 116 117
About the Authors
Dr. Parikshit Narendra Mahalle obtained B.E. degree in Computer Engineering from Amravati University, M.E. degree from SPPU, Pune, and Ph.D. in specialization in Wireless Communication from Aalborg University, Denmark. He was Postdoctoral Researcher at CMI, Aalborg University, Copenhagen. Currently, he is working as Professor and Head in the Department of Artificial intelligence and Data Science at Vishwakarma Institute of Information Technology and is recognized as Ph.D. Guide of SSPU Pune. He has 20 years of teaching and research experience. He is on Research and Recognition Committee at several universities. He is Senior Member of IEEE and ACM and Life member of CSI and ISTE. He is Reviewer and Editor of ACM, Springer, Elsevier Journals and Member of Editorial Review Board for IGI Global. He has published 150+ publications with 1242 citations and H index 14. He edited 5 books and authored 13 books and 7 patents to his credit. He has published a book on Data Analytics for COVID-19 Outbreak. He has delivered 100+ lectures at national and international levels on IoT, big data and digitization. He had worked as BOS-Chairman for Information Technology and working as MemberBOS Computer Engineering, SPPU, and several other institutions also. He received “Best Faculty Award” by Sinhgad Institutes and Cognizant Technologies Solutions. Dr. Gitanjali Rahul Shinde has overall 12 years of experience and is presently working as SPPU-approved Assistant Professor in the Department of Computer Engineering, Vishwakarma Institute of Information Technology, Pune. She has done Ph.D. in Wireless Communication from CMI, Aalborg University, Copenhagen, Denmark, on Research Problem Statement “Cluster Framework for Internet of People, Things and Services”—Ph.D. awarded on May 8, 2018. She obtained M.E. (Computer Engineering) degree from University of Pune, Pune, in 2012 and B.E. (Computer Engineering) degree from University of Pune, Pune, in 2006. She has received research funding for project “lightweight group authentication for IoT” by SPPU, Pune. She has presented research article in World Wireless Research Forum (WWRF) meeting, Beijing, China. She has published 50+ papers in national and international conferences and journals. She is an author of 5+ books with publisher
xiii
xiv
About the Authors
like Springer Nature and CRC Taylor & Francis Group. She is also Editor of books with De Gruyter and Springer Nature Press. She is Reviewer of prominent journal IGI publication and IEEE Transactions. Dr. Priya Dudhale Pise has 16 years of experience. She has done her Ph.D. in Cloud Computing and Big Data Security from JJTU, Rajasthan, with tittle “Sensitive Data Sharing Securely in Big Data for Privacy Preservation on Recent Operating Systems”—Ph.D. awarded on November 25, 2018. She has pursued her B.E. in Information Technology from MIT Kothrud (SPPU) and Master’s degree M.E. in Computer Engineering from MIT Alandi (SPPU) in 2012. She won “Best Technical Paper Award” for 2 national and 1 international conferences. She has bagged “Backbone of Indian Technical Academics” in December 2018. She is an author of a book on “Fundamentals of Data Structures.” She also has published one national and one international patent on her name. She recently has presented her research article in ACM International Conference held in University of Cambridge, London, UK. She has presented 50+ papers in national and international conferences and journals. She is an editorial member of one of the renowned journal. Ms. Jyoti Yogesh Deshmukh has overall 11 years of experience and is presently working as SPPU-approved Assistant Professor in the Department of Computer Engineering, JSPM’s Bhivarabai Sawant Institute of Technology and Research, Wagholi, Pune-412207. She is pursuing her Ph.D. in Cloud Computing and Data Security from JJTU, Rajasthan, on Research Problem Statement “Message Privacy with Load Balancing Using Attribute-Based Encryption.” She obtained M.E. (Computer Engineering) degree from University of Pune, Pune, in 2015 and B.E. (Information Technology) degree from STES’s Smt. Kashibai Navale College of Engineering, Pune-41, in 2006. She has published 10+ papers in national and international conferences and journals.
Chapter 1
Introduction to Data Science
1.1 What is Data Science? In this chapter, we are going to get an introduction about what is data science and how it is needed in our day-to-day life in concern of real-life uses, engineering purposes and medical fields. Data science nowadays is a much-needed thing, as people, different kinds of companies, commercial organizations, banking sectors, farmers, medical administration along with different kinds of reports and educational sector are all creating, storing, analysing, predicting and generating results in the form of data only [1]. So we will have a look at what this data is and what exactly data science is? Data is being created in a variety of formats like text, image or video by the above listed different big organizations. And to analyse this data with a different perspective which uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data, and apply knowledge and important insights from data across a broad range of business domains, this is what data science is! Sample structure or requirements of data science are shown in Fig. 1.1. Data storage and use by the world has dramatically increased each year. The best part is a dataset in the right hands can help, predict and shape the future companies that use a lot of data to run and expand their business [2]. Data science can help companies to take better and fast decisions which may lead to business expansion and to be on the top. The impact of data science on real-world life is to improve customer service, proper navigation, useful and timely recommendations also to convert voice over text. It also helps you to work on image processing. The world of data science is different from any other computational world. It includes statisticians, data engineers and business analysts [3]. A remarkable difference between a data scientist and a data engineer is that data scientists optimize data processing and data engineers are responsible to optimize the data flow. Data scientists are also responsible for establishing collection methods, work with system enterprises. A key difference to note amongst data scientists and statistician is data science works for data collection © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 P. N. Mahalle et al., Foundations of Data Science for Engineering Problem Solving, Studies in Big Data 94, https://doi.org/10.1007/978-981-16-5160-1_1
1
2
1 Introduction to Data Science
Fig. 1.1 Data science. Alt Text: Fig. 1.1 presents data science building blocks, i.e. scientific methods, visualization, statistical modelling, statistical computing, data technology, data research, data consulting and real-world applications
and data cleaning, whereas statistician works on surveys, polls and experiments. In a similar fashion, one needs to understand the difference between a business analyst and a data analyst who works for database design. Perfect synchronization between a data analyst and a business analyst extract and analyse huge information in less time to get perfect analysis. Data science has an immense impact on the healthcare domain, as all the patients’ records of the past many years can be utilized to visualize, which left with a better result than earlier. Data science also helps to visualize the endangered species to understand their habitat deeply. Predictive models and algorithms can create insights to give a solution to make them survive. In short, data science is making the world redefine the problem to get more accurate desired results. To make the world a better place, data science helps to analyse and visualize the solution.
1.2 Evolution with a Need for Data Science After 2012, data has been considered as the newest form of currency. Recently, rapid enhancement in technology tends to generate huge information. In short, data became a strategic resource of the time and this big captured data shall help a business to grow in an unimaginable fashion. It is data of such large scale and complexity that
1.2 Evolution with a Need for Data Science
3
it cannot be easily stored or handled by any conventional data processing tools. Big data is also data but it refers to the massive, complex collections of information growing at ever-increasing rates on a huge scale. Nowadays, we are habitual of using different kinds of electronic gadgets for easy and smart life. If we still continue ahead and see the information storage for single and simple uses, it is 10 × than earlier so why not adapt to the technology to make our lives better than earlier. One of the important aspects is to concentrate on the following examples where the data in big size is saved, archived, processed, analysed, visualized and used for the consistent growth of the business. Data which is stored in a big size has got immense importance to its business. Internet minute tentative information (data) generation is as follows: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Facebook (1,47,000 photos uploaded) WhatsApp (41.7 messages shared) Youtube (500 h of video uploaded) Netflix (4,04,444 h of video streamed) Instagram (3,72,000 stories uploaded) ZOOM (2,08,333 participants in a meeting) Microsoft Teams (52,083 users connected) Twitter (319 new users gained) Amazon (6,659 packages shipped) LinkedIn (69,444 jobs applied)
Smartphone logs are generating 1 TB each day. Imaging technologies produce the highest amount of images to determine the medicinal diagnosis. The above enlisted are few important and crucial sources of data generations, which must be stored and used for various analytical purposes. To handle such a big volume of data, the technique of data science is introduced so that all the information can be stored, understood and analysed, and a decision could be made to get excellent profit [4, 5]. Keeping it in view, the following are the important attributes of big data, as shown in Fig. 1.2. 1.
2.
Volume: Volume refers to the unlimited and huge amount of information generated in less time unit by social media, smartphones, credit cards, files, applications, banks and whatnot. In big data, a distributed system is used to store data on various locations and then brought together by the Hadoop framework. Variety: Commonly, the information generated by maximum electronic gadgets are in multiple varieties, namely: a.
b.
Structured: Information be stored, accessed and processed in any fixed format, named as “Structured Data”. Data contains a defined data type format such as RDBMS and OLAP data. Semi-structured: Textual data files which are self-describing, semistructured data, as its name says, may contain both the forms. Commonly, information in an XML file is a semi-structured one.
4
1 Introduction to Data Science
Fig. 1.2 Big data Vs. Alt Text: Fig. 1.2 presents big data Vs, i.e. value, volume, velocity, variety and veracity
c.
3.
4.
5.
Unstructured info has no inherent format text, pdf, audio, video etc. The information that exists with an unknown structure or a format is classified as unstructured data. Such data is intended to have many challenges in terms of deriving and processing the values.
Value—As the name, it is the amount of data stored and/or processed. Big data infers a big volume of data. Big data implies enormous volumes of data. It used to be employee-created data. Now, since the data is generated by machines, networks and human interaction on systems like social media, the volume of data to be analysed is massive. Velocity—It plays a vital role and refers to the speed to which data is generated and also the speed to which we provide information veracity. It is a very crucial parameter of big data as it indicates how many different forms the data is generated so consistently for common applications. Government and military domain: Huge amount of records are generated during processing. Healthcare domain: Big data has started to show the difference in using traditional and advanced techniques and their impact on the lives of people in diagnostic and predictive analysis. It results in the best personal healthcare service available at a doorstep. Telecommunication and multimedia domain: It is one of the prime applications of big data as there are no less than zeta bytes of data generation every single day. To save and process such a huge amount of data in tourism and travel, banking and loan, share market, air flights and so on, big data technologies are needed. Veracity: Data truthfulness universally is how correct or truthful a knowledge set could be. Within the context of huge knowledge, however, it takes on a small amount of additional huge knowledge. Removing things like unfairness, irregularities or inconsistencies, duplication and volatility are many aspects that issue up the accuracy of huge knowledge.
1.2 Evolution with a Need for Data Science
5
A traditional way of earning in previous days was whatever available sources people have, then on that basis was employment or earning. But nowadays for “Attrition and retention of employees”, big organizations or companies need to plan out in a good manner [6]. And here comes data science. Consider there are six departments in an organization. And recruitment of employees has been done as per their qualification, age, experience, location, salary expectation and the expertise to be a part of a team (organization). So the organization will have a dataset of all employees as per the prescribed format (for data science operation (analysis) we need the dataset in a CSV format) of an organization. The main problem is how to retain the employees, how to make them satisfied and how to motivate them for the progress and profit of the organization as well as their enhancements [7, 8]. Depending on the requirement of the organization and the performance of the employee, an organization will try to retain the employee. So we will see how data science helps us to do so. We, in all, have three categories to handle this scenario as per the data science concept [9]. 1.
2.
3.
Descriptive: We (data scientists) will be having a summary of the whole statistics of all the employees of the organization for the last 10 years [10]. For example: In the last 10 years, not a single employee has been left unless and until that employee is in the retirement age. Predictive: Here in this category machine learning algorithms or AI models are being referred to use current data and previous experiences or policies that are being implemented for better retention of the employees. For example: Predict that how many employees can be retained in the next 10 years and how? Prescriptive: In this category, a recommendation system is used to get improved results or benefits with the help of a prescriptive model [11]. For example: If any of the employees are willing to switch for any of not considerable reason, then an organization can retain that employee by explaining some of the recommendations that are useful for the employee’s future job life.
Most of the young generation, as well as all age users nowadays, has a Facebook account, to connect socially with a number of completely known or partially known people (called Facebook friends). In the nineteenth century, people were sending letters/photos/important documents to house or office postal address. Then, people were using email addresses to send softcopy images, audios or videos as a private attachment to respective (particular or known) email address. But nowadays in the world of big data or data science, we are habitual of using Facebook (social site) to upload any updated images, posts, videos or events. Facebook is taking care of all its users by observing their online behaviour and predicting their future behaviour to access more features of Facebook by giving different recommendations related to more number of friends, memories and social groups related to user’s interests [12]. Social site clouds (particularly datasets which are stored on respective clouds) of Google, Facebook and WhatsApp are working together and giving their best recommendations to Facebook users.
6
1 Introduction to Data Science
• How many Facebook friends are there of any particular Facebook user in the last 1 year?—Descriptive • How many Facebook friends can be there of that particular Facebook user in the coming 1 year?—Predictive • How many likes or comments are there to a particular post or photo of that particular Facebook user? • How long do those particular Facebook users spend their time accessing Facebook? • How frequently these users are posting events, photos, thoughts (text) or videos? • How many mutual friend suggestions can be recommended to that particular Facebook user?—Prescriptive. There are different (business/educational/promotional/NGO/helping a farmer or needy people/providing an opportunity to share authentic details related to any field etc.) purposes of using Facebook (social site). Nowadays many Facebook groups are being created on Facebook to connect socially with a number of Facebook users [13]. Data science plays a very important role to analyse such socially connected user’s online behaviour for prediction and recommendations of particular product/activity/polls/reviews/framing guidelines etc.
1.3 Applications of Data Science Data science is everywhere, as every one of us is using a number of electronic gadgets, smart devices, online activities on different e-commerce sites, online shopping sites and social connectivity using various social websites. So there are a number of applications of data science. In this section, we will have a look at some of the applications that are useful in various domains for getting noticeable profit.
1.3.1 Use of Data Science in D-Mart (E-commerce and Retail Management) Everyone who is in city areas knows about D-Mart with its slogan, i.e. “Daily Discount Daily Savings”. So how do they prove their slogan line that will see in this application of data science? Mainly it considers the customer’s expectations and their choices, and as per their choices, recommended products will be kept near to their liked product. The customer’s behaviour or frequency to choose any particular product is being observed. Seasonal products are to be taken into consideration for a recommendation of a particular product [14]. D-Mart has the best possible features which attract all kinds of customers for shopping at their retail store.
1.3 Applications of Data Science
7
Fig. 1.3 D-Mart retail store features. Alt Text: Fig. 1.3 presents D-Mart retail store features like visual merchandizing, e-commerce and retail management
• • • • • • • •
Low prices Big stores Great discount and schemes Short distribution channel Wide variety of products and brands Offering value for money Stand-alone stores Spacious and located at prime locations.
Things in the stores can be placed in such a way so that if customers are moving in the store then in a glance they can easily get the required things that are available at a particular rack or shelf. Here they are following the supervised learning approach. This technique is called “visual merchandizing”, which reduces the time of enquiry and searching about a particular product. This strategy is being observed and implemented by data analysis of customer requirements. Generally, all category families or people eat biscuits, chocolates, ice-creams or some fast food kind of products, so these products with a variety of companies, brands and types will be kept at the entrance [15]. The best part is they consider customer requirements and keep the various sizes of trolleys at the entrance only for convenience. There will be small racks on which some toys or some small things will be kept by having an analysis that at billing counter customers wait in a queue and by having a glance at those products they may think to buy those needy products as well. For each and every product packing and pricing tags are available. And staples will be in bulk so that they will also be available with pricing tags for a variety of brands.
8
1 Introduction to Data Science
Fig. 1.4 D-Mart own building. Alt Text: Fig. 1.4 presents D-Mart own building, i.e. factory to direct consumers through D-Mart
As per the season (summer, winter and rainy) or festivals (like Ganesh Sthapna, Diwali, Navratri, Gudhipadwa, Rakshabandhan, Valentine’s Day, Christmas, New Year or 31st December, Pongal, Eid, national days, etc.), D-Mart will be having a separate section with special boards to attract customers towards those counters. There are more categories like fruits and vegetables, kids section (as per brands and size), gents and ladies garments, different categories of home appliances, plastic containers, crockery, staples, footwear, toys, bags, dairy or frozen products. By analysing all these requirements D-Mart will be modifying or adding their own brand’s products. As a marketing strategy, they give discounts and different offers (like buy 2 get 1 free) on various products. For example, if people will buy Christmas products then D-Mart can give offers on a small Santa clause cap. Product segmentation will be done for bulk inventory management. D-Mart has followed its own building and short distributor channel, as shown in Fig. 1.4. As a marketing strategy (analysed or finalized by having data analysis of different categories of consumers) they will have offer coupons for a certain period of time so that they can frequently keep visiting to store [16]. They target the audience type as per consumer’s income, interest and need. They do promotional activities of seasonal and festive offers through a newspaper or different social sites or different public places. Data science helps or boosts such kind of retails business to grow up in a smart way with unbelievable profit.
1.3.2 Narrow Artificial Intelligence (AI) One of the best applications of data science is narrow AI. It is dedicated to a specific task or we can say it is very good at solving a specific problem and it has limitations that it fails when this system need to perform another task than the assigned one. But we can say that it is the best performance factor that it can perform very efficiently to the assigned task without any queries or errors. It analyses and interprets data to give remarkable accurate results [17]. Narrow AI in short is depicted in Fig. 1.5. Whenever we use the term narrow AI it always gives us the comparison to think of general AI too. Narrow AI is in short the complicated or high-level maths based on computational statistics on steroids. To fulfil the requirements of the thinking machine, data scientists have invented many sorts of technologies. These technologies together are used and we have the term narrow AI. As specified above, we are able to perform the specified single task using narrow AI, as data scientists
1.3 Applications of Data Science
9
Fig. 1.5 Narrow AI. Alt Text: Fig. 1.5 presents a narrow AI which is dedicated to specific tasks
are using various technologies. Narrow AI works well in Google search where we are searching for something to get with only one sentence or with a single word. Also, it works very efficiently in video recommendations like YouTube and Netflix. Nowadays many people are using Alexa and Siri to get service with a single instruction, where human efforts to go through many procedures are reduced and their life becomes very smart to perform various tasks by sitting at one position [18]. Depending upon the usage there are categories of narrow AI: 1.
Symbolic AI:
As in AI we are having different types of datasets. So in this category, we are using supervised data, i.e. we have predictable environment and all the rules or we can say instructions are known. Nowadays we mostly use all the rule-based systems. So, we do implementation on such an intelligent machine which will be familiar with the environment and known with all the rules to be followed. 2.
Machine learning:
In this type of narrow AI, intelligent systems will be developed through examples. First of all the intelligent machine will be designed and then it will be trained by giving various examples. Machine learning algorithms will process these examples and create mathematical characterization and perform different prediction and classification tasks. It is used in many fields like face recognition, voice recognition and intelligent games like chess due to its expertise in doing specific tasks or rule-based tasks. Chatbots are the machine bots which help people for a specific domain and is capable of simulating the conversation with human as a 24/7 customer care support. One of the chatbots is explained in detail in Chap. 5.
1.3.3 Trustworthy Artificial Intelligence (AI) Suppose the data scientists are building a trustworthy AI for a corporate company. It helps to keep a trustworthy work environment in the company. Each and everything must be clearly explained and have better ethics so that it should not face any legal issues. There will be a proposal from ethics groups to create trustworthy AI model building [19].
10
1 Introduction to Data Science
There will be four major parts to cover for building trusted AI: 1.
Assumptions and background
AI transforms the business at extreme two levels depending on the ethics to be followed or not? It will give a number of benefits if we follow the trusted AI and it may bring risks if we don’t or couldn’t follow those. So, it is very important to have all of the assumptions legal, ethical and robust to build a trusted AI framework. One must be aware of the local laws which need to be thoroughly followed every time. It also includes the human values, as it may happen the assumptions and legal laws to be followed at one country and it may change if we expand the business in other company. So that time it must follow the human values which are robust everywhere in the world. Because of that people working with this organization are always reliable and trust the work they do. So before moving ahead with a specific framework, they first implement or assume a general framework for all AI systems and then fine-tune it as per the requirements. Sample trustworthy AI to be followed in any kind of framework design is depicted in Fig. 1.6. 2.
Systems expectations
We need to consider the foundation of democracy where we will be following individual human rights. Suppose the employee working in a responsible position is
Fig. 1.6 Trustworthy AI. Alt Text: Fig. 1.6 presents trustworthy AI which is maintained using ethics of algorithm using respect for human autonomy, prevention of harm, fairness and explicability, ethics of data using human-centred, individual data control, transparency, accountability and equality and ethics of practice using responsibility, liability, and codes and regulations
1.3 Applications of Data Science
11
connected with many other people from different organizations. For example, a consumer needs to apply for a loan and go through a proper procedure without having an unfair biased treatment. So everyone will be satisfied and motivated to follow the procedure given. One more example we can consider is patients in a queue for getting treatment from a doctor need to follow the same procedure irrespective of the disabilities and the age group. So everyone will get equal rights and everyone will be treated equally. There will not be power asymmetry in the work environment, and hence it will lead us to have meaningful work at the organization. Human-centric, technical and legal things to be followed to resolve all kinds of tensions in all kinds of stakeholders for trusted AI. The system must be robust in a way that it should prevent different attacks; it should be accurate and reproducible. A system should be transparent in all aspects, meaning its behaviour should be independently auditable. 3.
Implementation on consideration
In this phase, we should gather relevant data. Here data will be cleaned and feature engineering will be done. This data is then trained with various machine learning modules using best-performing parameters. And then the test data will be validated on the same machine learning model [20]. The model should be built with all privacy constraints with proper access policies. Alerts will be there on some of the unexpected behaviour. Code of conduct will be applied to all stakeholders. 4.
Checklist for trustworthiness
Based on all these legal ethics, a proper trusted AI model/framework will be built. Data scientists will build a validated checklist to build a trustworthy system.
1.4 Summary Data science is an interdisciplinary concept that extracts knowledge using scientific methods, applying algorithmic techniques using structured, semi-structured or unstructured data. This chapter gives an overview of what is data science, how it is useful for solving problems of various domains and helps to examine a lot of raw data to discover different machine learning models that will help improve an organization. The evolution of data science is discussed in this chapter for various aspects. As nowadays people are using various electronic gadgets, it generates a lot of data through a number of applications and social sites. So this big data will be processed using 5 Vs, like value, volume, velocity, variety and veracity. Nowadays we do have a number of organizations and in that, we have a number of employees, so these employees should be retained with proper analysis of their needs and the company’s progress, which is done with the help of data science. Youngsters or all age group people or a variety of organizations are using Facebook for various purposes. All these activities can be observed and analysed with a descriptive, prescriptive and predictive way to get quick analysis.
12
1 Introduction to Data Science
There are various applications of data science in various domains. In this chapter we have discussed the data science application of D-Mart where we have discussed the e-commerce, analysis of customers, maintain or improve the frequency of visits to the store of customers with various offers and discounts, retail management and finally the profits that are being analysed. Narrow AI is one of the best applications of data science which helps in performing a single dedicated task very efficiently. Some of the examples are discussed here in this chapter. Symbolic AI and machine learning are the types of narrow AI, which have limitations only if there is an unexpected or out of rule-based situation arises. One more data science application we have discussed in this chapter is trustworthy AI. It mainly takes care of all human rights and all legal aspects into consideration, which guides and maintains the ethical and trustworthy environment in the organization and it improves the efficiency of stakeholders of that organization to work positively for progress and profits.
References 1. Pappalardo, L., Grossi, V., Pedreschi, D.: Introduction to the Special Issue on Social Mining and Big Data Ecosystem for Open, Responsible Data Science. Springer (2021) 2. Benjamin, T., Hazena Christopher, A., Booneb Jeremy, D., Ezellc, L., Allison, J.-F.: Data Quality for Data Science, Predictive Analytics, and Big Data in Supply Chain Management: An Introduction to the Problem and Suggestions for Research and Applications. Elsevier (2014) 3. Anderson, P., Bowring, J., McCauley, R.: An Undergraduate Degree in Data Science: Curriculum and a Decade of Implementation Experience. ACM (2014) 4. Monroe, B.L.: The five Vs of big data political science introduction to the virtual issue on big data in political science political analysis. Polit. Anal. (2013). 5. Bates, J., Cameron, D., Checco, A., Clough, P.: Integrating FATE/Critical Data Studies into Data Science Curricula: where are we going and how do we get there?. ACM (2020) 6. Kogan, M., Halfaker, A., Guha, S., Aragon, C.: Mapping Out Human-Centered Data Science: Methods, Approaches, and Best Practices. ACM (2020) 7. Wilkerson, M.H., Polman, J.L.: Situating Data Science: Exploring how Relationships to Data Shape Learning. Taylor & Francis (2020) 8. Tahsin, A., Hasan, M.M.: Big Data & Data Science: A Descriptive Research on Big Data Evolution and a Proposed Combined Platform by Integrating R and Python on Hadoop for Big Data Analytics and Visualization. ACM (2020) 9. Sarker, I.H., Kayes, A.S.M., Badsha, S., Alqahtani, H.: Cybersecurity Data Science: an Overview from Machine Learning Perspective. Springer (2020) 10. Willis-Shattuck, M., Bidwell, P., Thomas, S.: Motivation and Retention of Health Workers in Developing Countries: a Systematic Review. Springer (2008) 11. Mitchell, T.R., Holtom, B.C., Lee, T.W.: How to keep your best employees: Developing an effective retention policy. journals.aom.org (2001) 12. Kimmons, R., Rosenberg, J., Allman, B.: Trends in Educational Technology: What Facebook, Twitter, and Scopus can tell us about Current Research and Practice. Springer (2021) 13. George, S.R., Kumar, P.S., George, S.K.: Conceptual Framework Model for Opinion Mining for Brands in Facebook Engagements Using Machine Learning Tools. Springer (2021) 14. Landge, I., Satopay, H.: Data Science and Internet of Things for Enhanced Retail Experience. Springer (2021) 15. Kumar, M.R., Venkatesh, J., Rahman, A.M.J.M.Z.: Data Mining and Machine Learning in Retail Business: Developing Efficiencies for Better Customer Retention. Springer (2021)
References
13
16. Akter, S., Wamba, S.F.: Big Data Analytics in E-commerce: a Systematic Review and Agenda for Future Research. Springer (2016) 17. Sukhobokov, A.A., Gapanyuk, Y.E.: Consciousness and Subconsciousness as a Means of AGI’s and Narrow AI’s Integration. Springer (2019) 18. Davenport, T., Guha, A., Grewal, D.: How Artificial Intelligence will Change the Future of Marketing. Springer (2020) 19. Cohen, R., Schaekermann, M., Liu, S.: Trusted AI and the Contribution of Trust Modeling in Multiagent Systems. ACM (2019) 20. Kumar, A., Braud, T., Tarkoma, S.: Trustworthy AI in the Age of Pervasive Computing and Big Data. IEEE (2020)
Chapter 2
Data Collection and Preparation
Data science is the process that extracts insights from data, say big data. A large volume of data is generated due to the digital era and analysing these data is crucial for business growth [1]. Nowadays various recommendation and forecasting/prediction systems are part of day-to-day life and these systems are based on data analysis. A recent example of data analysis is COVID-19. The world is facing COVID-19 and trying to take precautionary measures by predicting the spread rate [2]. Before diving into the data analysis process, it is important to know exactly the meaning of data and what are the various types of data. The data analysis process depends on the types of data we are dealing with.
2.1 Types of Data We are living in the data age, and petabytes of data are generated every day due to advancements of the digital world, i.e. Internet, social media, mobile phones, the Internet of Things (IoT) etc. Data is generated from every domain, like educational firms, businesses, hospitals, society etc. There are two types of data available; one is traditional data and the other is big data. Traditional data are usually in a structured format like tabular; however, big data may be in an unstructured format. Analysis of big data can help to progress the business of every field [3]. The type of data plays a crucial role in the data science process as analysis differs as per the types of data [4, 5]. A huge size of heterogeneous data is generated due to online social networks and the Internet of Things. These data may be homogeneous, heterogeneous, structured or unstructured, and taking out sense from such data needs a systematic approach. Traditionally, data taken from any computerized projects was in a structured format, like tabular format; hence analysis of such data was not a complex task. However, data retrieved from websites can be of any form, i.e. it can be numerical, in the form of images, videos or audio, or HTML files. Hence analysis of such data is complex.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 P. N. Mahalle et al., Foundations of Data Science for Engineering Problem Solving, Studies in Big Data 94, https://doi.org/10.1007/978-981-16-5160-1_2
15
16
2 Data Collection and Preparation
Fig. 2.1 Types of data. Alt Text: Fig. 2.1 presents types of data, i.e. categorical and numerical. Categorical data are further classified into three categories, i.e. ordinal, nominal and binary. Numerical data is further categorized into two types, i.e. discrete and continuous
Data in data science can be categorized majorly into two types as numerical and categorical data [6], as depicted in Fig. 2.1. • Numerical data: Data that can be measured is the numerical data, e.g. the weather data like the temperature in degrees, students marks in examination etc. Most of the data available are in numerical type and data analysis is done only on the numerical type of data. Numerical data can be classified further into discrete and continuous values. The discrete data take fixed values and continuous data can take any value. For example, days in a week and the price of a house. The days in a week can have only seven values; however, the price of the house can have any value. • Categorical data: The data that cannot be measured is categorical data; it represents based on some shared features; e.g. gender, blood group. This can be further categorized into three types, viz., nominal, ordinal and binary. Nominal data doesn’t have any specific ordering or sequencing. For example, gender data is classified based on labels, and labels are formed without any qualitative or ordering measures. These are just for naming purposes. The ordinal data can classify based on the labels that are formed on some specific ordering, e.g. answers of the survey, i.e. strongly agree, agree and disagree. The binary data can take only two values, e.g. student performance in the examination can have only two values, either “pass” or “fail”.
2.2 Datasets Table 2.1 Sample dataset
17 Name
Gender
Salary
Shopping status
Yashita
Female
80,000
Yes
Atharva
Male
80,000
No
Ruchita
Female
70,000
Yes
Vedant
Male
90,000
No
2.2 Datasets In the above section we have discussed the type of data; it may be categorical or numerical and the format in which data generated is not always structured. It may be in the form of images, audio and video format with different file formats. However, traditionally most of the data are available in a structured format. A structured dataset is nothing but the collection of records or observations or samples or events. Each event/observation is identified by specific characteristics and that is called features. In a dataset rows are observations and columns are features or dimensions. Let us take an example, in the below dataset, in Table 2.1, four observations and four features are recorded that are gender and salary shopping status. Each record has different values for each feature and we can make some conclusions by analysing the dataset. This dataset we may get from shopping mall management as their software generates details of buyers in the form of rows and columns. However, most of the data may not be regenerated in the form of rows and columns but maybe in the form of images, videos, audios or raw files. Hence for analysis of such data, different techniques will be required. Furthermore, in the example, only four records and four features are there. Hence analysis is very easy; however, it may not be accurate. For accuracy, a large number of observations may be required. What if a number of features/columns are large in number? Then analysing the dataset is going to be very difficult as we need to make some relation between features and results/recommendations. Finding features out of all features which are playing important role in deciding results/recommendations/predictions is difficult and it can’t be done by just scanning the dataset. The machine learning algorithms are required in such scenarios where observations and dimension/columns/features of the dataset are large in number.
2.3 Taxonomy of Dataset In data science, the understanding of a dataset is required for analysis. Based on the analysis forecasting/recommendations are suggested. Hence understanding the taxonomy of datasets is required. Dataset can be categorized in various ways based on its sources, features, structure etc. [7]. One should ask a few questions about the dataset before starting the analysis process; for example
18
2 Data Collection and Preparation
Fig. 2.2 Taxonomy of datasets. Alt Text: Fig. 2.2 presents the taxonomy of datasets. Dataset is categorized in various ways based on its sources, features, nature and structure
• • • •
How the dataset is generated? What is the nature of the features of the dataset? What is the structure of the dataset? What are the sources of the dataset? Whether dataset contains time series data?
Based on answers to these questions the analysis process varies, and the taxonomy of the dataset is presented as follows (depicted in Fig. 2.2): • Generated: Static and real-time datasets The dataset can be generated from real-time data using web scrapping tools, and this data is changing at every moment. For example, Twitter data. Every moment Twitter feeds are there, hence this data definitely requires different processing than data which is static; e.g. employee details of any private firm. The data cleaning and pre-processing required for real-time data are more complex than static data. • Feature: Persistent and non-persistent: While analysis it should be taken into consideration that few features in the dataset may be non-persistent, it may change with time, i.e. physiological features of human beings, for example hairs, and the persistent features which may not change or very less change with time. Generally, after a specific age, there are no visible changes in physiological features, which include height. Hence while processing dataset knowledge about the features may be helpful for improving the accuracy of the model. • Structure: Structured and non-structured Traditionally, data is in a structured format; however, nowadays due to advancements in digital communication most of the data generated is non-structured. The structure of the dataset directly affects the type of model we are going to select for analysis. An analysis process is more straightforward and easy for structured data, however, for unstructured data complex pre-processing and analysis processes are required.
2.3 Taxonomy of Dataset
19
• Sources: Mobile and Internet The sources from which the dataset is generated can affect the pre-processing and analysis process. The dataset fetched from mobile phones is more target-oriented while in other cases dataset fetched from the internet is not target-oriented. It may contain more noise than mobile data. Hence pre-processing and cleaning of data are required more in internet data than mobile data. • Nature: Time series and non-time series The time series data means data is available only for that specific time window; on the other hand, non-time series data does not follow any specific time window, e.g. COVID-19 data. This data is surely not available before 2019. The data that is time series may require a different type of analysis and understanding of time window may play an important role in the analysis. In the early days of COVID-19 very few records were available, hence predicting the number of death cases or spread rate was difficult, and duration was an important factor in the analysis of COVID-19 [4].
2.4 Statistical Perspective Knowledge about data may help in making data ready for analysis, so understanding of data can be done by taking a closer look at data with the help of statistics. Statistics help us to find the pattern of big data, which may help to handle missing values and outliers (discussed in the next section). Statistical analysis is done in two ways, viz., quantitative and qualitative analysis. In quantitative analysis, data can be represented in the form of graphs. Using graphs the pattern of data can be understood. In qualitative statistical analysis, data is considered in the form of text and hidden patterns are searched. We can take a closer look at the data by answering few questions as below: • • • • •
What are the features of data? What are the values of these features? Are these attribute values discrete or continuous? How these values are spread? Can we identify outliers?
We can understand more about values of features by calculating its mean, i.e. average value of that particular feature; median, i.e. middle value for particular feature; and mode, i.e. the most frequently appearing value of the particular feature [8]. The knowledge of statistics may help to pre-process the data by filling missing values, reducing inconsistencies and identifying outliers. Outliers mean the value that is separated from the other values of particular features. Data visualization helps to understand more about data by plotting various graphs. Plots can show the central tendency of data, whether it is symmetric or skewed and dispersion of data [9]. The
20
2 Data Collection and Preparation
mean, median and mode are the measures of central tendency of data, i.e. the middle of data values. The variance and standard deviation are the measures to know how data is spread out, i.e. data dispersion. Mean: It is the most common and effective measure of data. It can show the average or centre value of the particular feature. Let us take the example of a dataset of employee salary. Suppose a dataset having records of “N” employees, then we would like to know what is an average salary given to the employee? This can be calculated by taking the column of mean of salary in the employee dataset, by using the average formula. However, many times, the mean value cannot describe the centre of data due to outliers. If the dataset is having few outliers then the mean can deviate; for example as there may be very few employees at the higher post with a high salary, however, while calculating average salary it will make the average larger than the actual. In such cases median can be used to find the centre of data. Median: For calculating median, data values must be in sorted order and based on the number of data values calculation of median is different. If an odd number of records then the middle value is the median of data. If even number of records then the average of two middle values are taken as the median. The median value can be the true middle value if the number of outliers is less in number; however, calculating the median value can be expensive. Mode: This is also one of the measures to calculate the centre of data. It is the value that occurred in a higher frequency. For some features it may be possible that one value occurred in more frequency or more than one value is present in higher frequencies. Data with single-mode value is called unimodal, data with two-mode values is called bimodal and data with multiple mode values is called multimodal. The mean, median and mode values will be the same. At the centre for symmetric data, the data which is positively skewed the mode value is smaller than the mean, and the mode value is greater than the mean value for data that is negatively skewed. Based on the mean, median and mode values symmetry of data can be understood and that can help for handling missing values. The variance and standard deviation are the measures of data dispersion. How data is spread out to its mean value can be understood by these measures and it can help to find and handle outliers. Variance and Standard Deviation: Variance is the measure of data dispersion and is calculated by taking an average of the squared difference between values and mean. From the variance, it can be observed that how data values are separated from the mean of the data. If the variance is low it means the data values are close to the mean value, while high variance means data is spread out of its mean value. Standard deviation is calculated by taking the square root of variance.
2.5 Dataset Pre-processing
21
2.5 Dataset Pre-processing Data analysis is a step-by-step process where data collection is the first step, and then further steps are pre-processing, data exploration, the formation of the model and the last step is recommendation/forecasting based on the model, as depicted in Fig. 2.3. Data can be collected using various ways, like taking interviews, surveys, downloading data from online social media, government sites, private offices and many more. As aforementioned, data obtained from such sources may be in different formats; may be structured or unstructured, complete data or incomplete; it may be structured tables, video or audio or image files. Hence, taking out the meaning from such raw data requires a systematic approach and there is a need of pre-processing such data to make it ready for data analysis [10]. After pre-processing data exploration is done for a clear understanding of data. This can be done using tools like charts and graphs. Based on the domain knowledge and pattern of the dataset model formation is done. Model is built based on answers to few questions like: • What type of problem we are solving? • What is the pattern of data? • Which learning algorithm can be best suited for data? Data pre-processing is required as it provides benefits as follows: • Pre-processed data is more accurate than unprocessed data, as in data few values may be missing or incorrect due to human error or some other reasons and preprocessing can make data complete and accurate by handling missing and wrong values. • Consistency of data affects majorly the analysis results. Owing to duplicate and inconsistent data, the results of the analysis can degrade. Pre-processing makes data consistent and improves the analysis results.
Fig. 2.3 Steps of data processing. Alt Text: Fig. 2.3 presents the steps of data processing. Data collection is the first step and then further steps are pre-processing, data exploration, the formation of the model and the last step is forecasting based on the model
22
2 Data Collection and Preparation
Fig. 2.4 Data pre-processing. Alt Text: Fig. 2.4 presents steps of data pre-processing methods: data pre-processed by cleaning data, transforming data and data reductions methods
• Pre-processing makes data complete and consistent. Owing to this data is easier to read and interpret. Data scientists work maximum time on data pre-processing than building the actual model. As per the Forbes survey, 80% of the time data scientist spend on data preparation. The data pre-processing is done in three steps, as depicted in Fig. 2.4. These steps are as follows [11–13]: • Data cleaning • Data transformation • Data reduction
2.6 Data Cleaning Data is retrieved from multiple heterogeneous sources and is huge, and these are the basic reasons for inconsistency in data. The poor quality of data can result in poor accuracy of the analysis. The analysis results are dependent on the completeness and correctness of data. The aim of data cleaning is to prepare data for data analysis by removing noise, filling missing values and removing all unnecessary data. Incompleteness in data happens due to a lack of knowledge of features at the time of data entry level. It may possible that important features for analysis are not included as mandatory features, hence data may not be provided by the users. Privacy is also
2.6 Data Cleaning
23
another reason for incomplete data values. A user may not want to reveal few values which are sensitive. Data cleaning is done by filling in missing values and removing noisy data.
2.6.1 Handling Missing Values It may be possible that few values are missing for some observations, then this can lead to inaccurate analysis. Hence handling missing values is the prerequisite of data analysis. Following are few ways of filling missing values: • Remove the record/observation: Missing values can be handled in various ways based on how many number of feature values are missing for particular observations. If for some specific observation a maximum number of feature values are missing, then it is advisable to delete that observation/record from the dataset instead of filling in missing values. Removing records is a feasible option if a number of records with missing values are less in number; otherwise, there will be very little data remaining for data analysis and with less data, analysis can be inaccurate. • Imputation: As removing records are not always the solution for handling missing values, hence imputation is the best way to handle missing values. Deciding which value is used to impute the missing value depends on the dataset. These can be done using some default value or using mean/mode/median [14]. – Manual filling: In this method, the missing values are filled manually. To do so, there is a need to check the complete dataset and manually fill missing values by the appropriate value. In this, some specific default values can be used to fill missing values by observing the values of the particular column. This is the simplest method; however, this is not a feasible method as datasets are huge in size and may have more number of missing values. – Using a measure of central tendency: The mean, mode and median are measures of central tendency of data values. If the data is distributed symmetrically then the missing values are filled using the “mean” value; if the data is skewed then the median value is used for filling missing values. In some datasets filling values by frequently appeared value can be most appropriate; in such cases missing values are filled with “mode”. It is very important to have domain knowledge before handling missing values, as there is a need to check for which feature missing values are needed to be filled. Based on that values should be selected for substitution.
24
2 Data Collection and Preparation
2.6.2 Removing Noisy Data The meaningless data values are called noise. Duplicate values, incomplete data and outliers are considered as noise in datasets. There are various visualization methods. By that, we can take a closer look at data, i.e. histogram, scatter graph, box plot etc. These methods represent data values in a graphical way and it is easier to identify outliers using a graphical representation. We can say that outliers are the noisy values in datasets. Other ways to identify outliers are standard deviation, percentile and outlier removal clustering. In clustering if few data points are distinct from the clusters, then these can be considered as outliers. Whether to remove outliers or correct them depends on the type of dataset and how outliers are affecting the results. If outliers are not affecting the results, then these can be ignored. Otherwise, the outliers can be removed easily by detecting them using a box plot or z-score and can simply delete the values which are detected as outliers. Outliers can be corrected by replacing them with mean/median values. Duplicate values are also called as noisy data as it leads to inaccurate analysis. Owing to human error, it may be possible that values are submitted twice, which creates duplicates in datasets. Duplicates can be handled by removing them from the dataset. Binning method is also used for handling noisy data. In this data values are sorted and divided into a few number of bins. In each bin all values are replaced by the mean/median/boundary of the bin. Using binning data are smoothened and continuous data can be converted into categorical form.
2.7 Data Transformation In data transformation data is converted in the form from which the machine learning model can learn. There are possibilities that data have different ranges or data may be in the categorical form, then analysis becomes difficult. Most analysis algorithms work only on numerical data, hence data transformation becomes a necessary step for analysis. The dataset in which values are of different ranges, the data can be prepared for analysis by scaling/normalization. Categorical data can be converted into numerical data by applying encoding methods.
2.7.1 Normalization As discussed above, there may be a possibility that data is available in different ranges, hence comparing and analysing these data may be difficult and can lead to a wrong result. Take an example, for some projects, we would like to analyse the population and predict some results. Then for such analysis population for each city should be present in the same unit, e.g. in millions or thousands. If few cities’ population data
2.7 Data Transformation
25
are available in millions and other cities are in thousands, then analysing such data will be very difficult. Take another example of age and salary. As age ranges up to a maximum value of 70, however, salary can be in lakhs, then how to compare these two features. Hence transformation of data in one range is required and that can be done using normalization. There are three types of normalizations that can be used for data transformation: min–max normalization, z-score normalization and decimal scaling [15, 16]. In normalization, data values are scaled in the range of 0–1. Min–Max Normalization: In min–max normalization, values are rescaled based on the maximum and minimum values for the respective feature. Min–max normalization is done using the following formula [16]: x =
x − min(x) max(x) − min(x)
where x is a dataset value that needs to transform using the min–max method x is a new transformed value min(x) is the minimum value of that specific feature/column max(x) is the maximum value of that specific feature/column Min–max normalization rescales data values; however, it may not be able to handle outliers with the same accuracy as z-score. Z-Score Normalization: In this normalization is done using z-score; z-score is also called a standard score. It gives how many standard deviations are near to population mean. To calculate z-score we need to know the standard deviation and mean of a particular feature. The z-score is calculated using the following formula: z=
x −μ σ
where z is a z-score x is the feature value μ is the mean value σ is the standard deviation Outliers can be handled very well using z-score normalization, however, normalized data may not be of the same scale, like min–max normalization. Decimal Scaling: It is the simplest way of rescaling data. In this decimal points are moved based on the maximum value of the respective feature. Each value in the dataset of the particular column is divided by the maximum absolute value.
26
2 Data Collection and Preparation
2.7.2 Encoding In the dataset most of the features have categorical values and these are represented in the form of strings. Most data analysis algorithms work only on numerical data. The categorical values are converted into numerical by using encoding methods. There are various methods of encoding, i.e. label encoding, dummy encoding, onehot encoding, binary encoding, effect encoding, hash encoding, BaseN encoding and target encoding [17, 18]. • Label Encoding: If data values for specific features are of ordinal type then label encoding is used. In this, the order is important, hence the ordered numbering is assigned for data values. For example, passing grades are given like distinction, first-class, second higher class, second class and pass class. Then in label encoding grades are represented as 0–4, respectively. • One-Hot Encoding: This type of encoding is used when categorical data is of type nominal, where there is no ordering. In this encoding for each category one variable is created and if the dataset is of that category then the value 1 is assigned, if not then 0 is assigned. Take the example of the house type in a property dataset for clear understanding; it may be row house, apartment, duplex. Index
House type
0
Row house
1
Apartment
2
Duplex
3
Apartment
4
Apartment
5
Duplex
6
Row house
In the above table, we can see that there are three categories of house type. These can be encoded using hot encoding as given below: Index
Row house
Apartment
Duplex
0
1
0
0
1
0
1
0
2
0
0
1
3
0
1
0
4
0
1
0
5
0
0
1
6
1
0
0
2.7 Data Transformation
27
• Dummy Encoding: It is very much similar to one-hot encoding. The difference is that in dummy encoding n−1 categories are formed, unlike one-hot encoding in which N variables are formed for N categories. Take a similar example taken for hot encoding in the above paragraph. We have created three variables for three categories, i.e. row house, apartment and duplex. However, if the house type is not a row house and not an apartment then definitely it is of type duplex, as there are only three types. We can observe that for duplex value both variables “Row house” and “Apartment” is 0. Hence encoding can be done using n−1 variables, i.e. two variables only. Index
Row house
Apartment
0
1
0
1
0
1
2
0
0
3
0
1
4
0
1
5
0
0
6
1
0
• Binary Encoding: In binary encoding categorical data values are converted into a binary value. It is like one-hot encoding, however, here only log(base 2)n variables are created for n categories. It creates a very less number of variables than one-hot encoding. • Effect Encoding: This technique is similar to dummy encoding, however, in effect encoding three values are taken 0, 1 and −1. • Hash Encoding: In this type of encoding hash values are calculated for categorical value. This method can be useful when a number of categorical values are large in number as hashing uses low memory. • Target Encoding: It is also called as mean encoding. In this for each category mean is taken of actual target values in the dataset. In this similar approach is used, like label encoding difference is that mean of target data values are used as the new value for a specific category.
2.8 Data Reduction As we have discussed in Chap. 1 that big data is generated every moment due to digital advancements, hence datasets may be huge in size. However, a number of features in such a dataset may be less/large/sufficient in number. If the number of features/columns are less, then the model built may face the problem of underfitting. If the number of features/columns are large, then the model will be overfitted. Hence
28
2 Data Collection and Preparation
the number of features should be moderate for building an analysis model. Here comes the role of dimensionality reduction. It is not about reducing the number of observations/rows, and it is the process of reducing the number of features/columns for analysis. Processing and analysing a huge dataset with a large number of features is a timeconsuming and also expensive process, hence there is a need to reduce the dimension of the dataset for efficient and faster analysis. This can be done using various ways, i.e. attribute feature selection, dimensionality reduction and numerosity reduction.
2.8.1 Attribute Feature Selection If a fewer number of features are used for prediction or building a model, then it will be faster; however, it may be possible that the analysis is not up to the mark. The speed of the analysis model should not be achieved at the cost of accuracy. Hence selecting appropriate features for building a model is done using attribute feature selection. In this multiple features are combined to form a new feature. By this number of features will be less in number and all values are also considered while building the model; hence accuracy is not compromised.
2.8.2 Dimensionality Reduction Reducing the number of features for analysis is called dimensionality reduction. It is possible that few features are redundant, i.e. features having a similar effect of analysis (correlated). Then instead of keeping 2/3 similar features we can keep only one of them. It may also be possible that few features are not contributing to results, then removing such features can make analysis efficient. Dimensionality reduction can be done using two ways, i.e. feature selection and extraction [19] as follows: • Feature Selection: In the feature selection technique the redundant features are removed. In this way number of features are reduced. This can be done by variance and correlation threshold. Features not contributing more to the analysis are identified by calculating variance and the features whose variance are below the threshold value are removed from the dataset. As discussed in the above section, a few features may have a similar effect on results, i.e. highly correlated to each other. Among such features we can keep only one feature and remove others. This is done by calculating correlation in the pairs of features and removing feature which is having a larger correlation with other. • Feature Extraction: In this technique new small set of features is created from the actual features. This can be done using principal component analysis (PCA) and linear discriminant analysis (LDA).
2.8 Data Reduction
29
– Principal Component Analysis (PCA): In this linear combination of all features are generated and the new features which are not correlated are selected based on variance. The feature which has the highest variance is termed as the first principal component 1, then the next feature with higher variance is termed as second principal component 2 and so on. The number of features that should be kept for analysis can be decided based on cumulative variance with some threshold, say 90%. This is an unsupervised method and data should be normalized before PCA as a feature with the highest scale can dominate the selection of principal components. – Linear Discriminant Analysis (LDA): It is a supervised method, hence can be used with the data having labels. It works on the separability between two or more classes. It separates the classes by finding a new axis that can distinguish classes with a minimum number of features. This is done by looking for the maximum distance between the mean of both classes.
2.8.3 Numerosity Reduction In this technique data is represented in a smaller form, and this can be done using two ways, i.e. parametric and non-parametric. In the parametric approach, original data is not stored; instead, parameters of data are stored. This can be done using regression and a log-linear model. In a non-parametric approach reduced form of actual data is stored, and this is can be done using sampling, clustering, histograms etc.
2.9 Web Scrapping Tools In the above section we have discussed data pre-processing techniques. The processing varies based on the type of data available for analysis. As aforementioned, the data retrieved from the internet may require complex pre-processing than data taken from any repository of a private firm. A data scientist can collect data from the internet in two different ways: one is manual retrieving and another is automated data retrieving using APIs and tools/libraries. Manual data retrieval is not feasible as it is a tedious method. Automated data retrieving from websites are also called web scraping/web data extraction/web harvesting. There are various tools for web scraping and few libraries are provided in python. The web scraping tools look for an updated data and download the same with easy access. Various web scraping tools are given below [20]: • import.io: This library is used to create a dataset by importing web pages into CSV format. Thousands of pages can be scrapped in a minute using import.io. However, this is not a freeware source library one need to pay to use it.
30
2 Data Collection and Preparation
• Beautiful Soap: It is a library provided by python and it is an open-source library. Using this library data can be imported from HTML and XML pages. The screen scraping is done using beautiful soap by automatically converting files to Unicode. • Scrapy: This is an open-source framework and written in python language. It can be used for various applications like automated testing, monitoring etc. It provides the facility to write web spiders that import data from websites. • URLLIB: This library is also implemented in python and it is used to open a uniform resource locator (URL). The module urllib.parse is used for reading the URLs. Data can be imported after opening URLs.
2.10 Summary Type of data and sources of data play a crucial role in data science as based on these the data processing differs. In line with this, the chapter presents the types and taxonomy of data. Data processing is a step-by-step process to take out meaningful insights from big data. This chapter presents the data cleaning process with a detailed explanation of handling missing values and outliers. The data transformation is required for accuracy as data may be of different scales and that may affect accuracy. Various encoding and transformation techniques are presented in this chapter. The data is collected from various sources and for this tools are required. In this chapter various web scraping tools are explained in this chapter.
References 1. Provost, F., Fawcett, T.: Data Science for Business: What you Need to Know about Data Mining and Data-Analytic Thinking. O’Reilly Media, Inc. (2013) 2. Shinde, G.R., Kalamkar, A.B., Mahalle, P.N., Dey, N.: Data Analytics for Pandemics: A COVID-19 Case Study. CRC Press (2020) 3. Waller, M.A., Fawcett, S.E.: Data science, predictive analytics, and big data: a revolution that will transform supply chain design and management (2013) 4. Shinde, G.R., Kalamkar, A.B., Mahalle, P.N., Dey, N., Chaki, J., Hassanien, A.E.: Forecasting models for coronavirus disease (COVID-19): a survey of the state-of-the-art. SN Comput. Sci. 1(4), 1–15 (2020) 5. Mahalle, P.N., Sable, N.P., Mahalle, N.P., Shinde, G.R.: Data analytics: Covid-19 prediction using multimodal data. In: Intelligent Systems and Methods to Combat Covid-19, pp. 1–10. Springer, Singapore (2020) 6. Han, J., Kamber, M., Pei, J.: Data mining concepts and techniques third edition. Morgan Kaufmann Ser. Data Manag. Syst. 5(4), 83–124 (2011) 7. Mahalle, P.N., Sonawane, S.S.: Internet of things in healthcare. In: Foundations of Data Science Based Healthcare Internet of Things, pp. 13–25. Springer, Singapore (2021) 8. Whitley, E., Ball, J.: Statistics review 1: presenting and summarising data. Crit. Care 6(1), 1–6 (2001) 9. Potter, K., Hagen, H., Kerren, A., Dannenmann, P.: Methods for presenting statistical information: the box plot. Visual. Large Unstruct. Data Sets 4, 97–106 (2006)
References
31
10. Famili, A.A, Wei-Minb, S., Richardc, W., Simoudis, Famili, E.: Data Preprocessing and Intelligent Data Analysis, 3–23 11. Alasadi, S.A., Bhaya, W.S.: Review of data preprocessing techniques in data mining. J. Eng. Appl. Sci. 12(16), 4102–4107 (2017) 12. Sukumar, P., Robert, L., Yuvaraj, S.: Review on modern Data Preprocessing techniques in Web usage mining (WUM). In: 2016 International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS) (pp. 64–69). IEEE (2016) 13. García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J.M., Herrera, F.: Big data preprocessing: methods and prospects. Big Data Analyt. 1(1), 1–22 (2016) 14. Zhang, Z.: Missing data imputation: focusing on single imputation. Ann. Transl. Med. 4(1) (2016) 15. Patro, S., Sahu, K.K.: Normalization: A preprocessing stage. arXiv preprint arXiv:1503.06462 (2015) 16. Saranya, C., Manikandan, G.: A study on normalization techniques for privacy preserving data mining. Int. J. Eng. Technol. (IJET) 5(3), 2701–2704 (2013) 17. Hancock, J.T., Khoshgoftaar, T.M.: Survey on categorical data for neural networks. J. Big Data 7, 1–41 (2020) 18. Seger, C.: An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing (2018) 19. Khalid, S., Khalil, T., Nasreen, S.: A survey of feature selection and feature extraction techniques in machine learning. In: 2014 Science and Information Conference, pp. 372–378. IEEE (2014) 20. Saurkar, A.V., Pathare, K.G., Gode, S.A.: An Overview On Web Scraping Techniques And Tools. Int. J. Future Revol. Comput. Sci. Commun. Eng. 4(4), 363–367 (2018)
Chapter 3
Data Analytics and Learning Techniques
Data analysis is the process of making sense out of data or can say big data. Data may be structured or unstructured, homogeneous or heterogeneous, text or images etc. Taking out the meaning from such variant, data are very crucial and difficult too. As discussed in Chap. 2, data collection and pre-processing are prerequisites for data analysis. As discussed in Chap. 2, the Machine Learning (ML) model performance greatly depends on the choice of ML algorithm based on the type of dataset and features that are selected for the analysis. In the same line, this chapter presents a data analytics overview, various approaches and techniques of learning.
3.1 Data Analytics Overview The terms data science and data analytics both work on data, however, in a different way. Data science looks into data in a generic way, it tries to find out the relation between the attribute of data. Data analytics is more specific than data science it has specific aims for that it finds out the relation between specific attributes [1]. Data science can be termed as a whole thing and data analytics can be said as part of it. There are four types of analytics depicted in Fig. 3.1 and are as follows: 1.
Descriptive Analytics: this type the analysis data are understood and summarized to find out meaningful information. The features of data are presented with the help of visualization methods for clear understanding. The analysis is done on past data and conclusions are made on that. This is a type of analysis required to find out in a given time what the pattern of events is. Take an example, descriptive analysis can be done on COVID-19 data to understand the number of deaths in each country, region, state, city etc.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 P. N. Mahalle et al., Foundations of Data Science for Engineering Problem Solving, Studies in Big Data 94, https://doi.org/10.1007/978-981-16-5160-1_3
33
34
3 Data Analytics and Learning Techniques
Fig. 3.1 Types of data analytics. Alt Text: Fig. 3.1 presents the types of data analytics; there are four types of analytics, i.e. descriptive, diagnostic, predictive and perspective analytics
2.
3.
4.
Diagnostic Analysis: this type of analysis is majorly required for finding out the reasoning behind the occurrence of an event. The detailed analysis of data pattern is done keeping few aims in mind, e.g. loss of productivity, price drop. This type of analysis is important in business to improve techniques/strategies to reduce the loss and to improve performance. For example, diagnostic analysis can be done on COVID-19 data to find out whether temperature, humidity, wind speed affect the spread rate of COIVD-19 [2]. Predictive Analytics: this type of analysis focuses on predicting the future based on the analysis of past data. In this descriptive analytics, various machine learning techniques are used. Day by day significant advancements are happening in machine learning techniques; hence, the correctness of predictive analytics has been improved. For example, predictive analysis can be done on COVID-19 data to predict the spread rate of COIVD-19 [3]. Perspective Analytics: this type of analytics uses the above three types of analytics techniques to take important decisions of the business, e.g. Netflix, Amazon, Facebook and Google, these industries are using perspective analytics to make important decisions. Perspective analytics provides a plan of action that can help to improve the revenue of business; hence, it plays a crucial role in every business domain. Nowadays, every domain is working to understand clients/users’ liking/approaches to improve the services. For example, perspective analysis can be done on COVID-19 to find out vaccination and lockdown plan of action that can slow down the spread of COIVD-19 [4].
3.2 Machine Learning Approaches
35
3.2 Machine Learning Approaches Machine learning is one of the best approaches for data analytics [5–7], ML techniques can be applied to datasets for prediction. There are various ML techniques and algorithms available, it may one of the algorithms/techniques providing accurate predictions on some datasets than other algorithms. The understanding of which algorithms are best suited to the specific dataset is the crucial as choice of algorithm majorly affects the system performance. There are three types of machine learning [8], i.e. • Supervised learning • Unsupervised learning • Reinforcement learning These ML techniques are used for all four types of data analytics, i.e. descriptive, diagnostic, predictive and perspective analytics.
3.2.1 Supervised Learning In supervised learning, the dataset has input and output variables. Input variables are the features on which the prediction/result of analysis depends and the output variable is the resulting feature, it is also called a label. Supervised learning techniques can be applied only on the labelled dataset. The ML model is trained by providing a dataset with input/output variables. The trained ML model is able to predict the output feature when new inputs are provided. The accuracy of the result is dependent on how the machine is trained. The well-trained machine/algorithm can give results with more accuracy than a machine that is not well trained [9–11]. The accuracy of the ML model depends on. • Dataset size used for training the model • The model used for learning • Cleanness and completeness of data Let us take an example to understand supervised learning, the following dataset is shown in Table 3.1, here, years of experience and salary of employees are recorded. This dataset contains two features, the year of experience is the input feature and salary is the output/predicted feature, i.e. label. There is a need to find the relation between the year of experience and salary of employees to predict the salary of an employee when the year of experience is provided. It is always a good approach to visualize data for better understanding, this can be done by plotting the various graphs, visualization techniques are discussed in Chap. 4. The above dataset is visualized by the plotting graph, shown in Fig. 3.2, we can see that salary is increasing linearly with respect to year of experience.
36 Table 3.1 Sample dataset: salary of employee
3 Data Analytics and Learning Techniques Year of experience
Salary
1
39,343
2
43,525
3
55,794
4
60,150
5
76,029
6
93,940
7
98,273
9
113,812
10
122,391
Fig. 3.2 Visualization of sample dataset. Alt Text: Fig. 3.2 presents the visual representation of sample dataset, on x-axis year of experience and on y-axis salary of employee is plotted
The statistical approach can be used to find the equation between the year of experience and salary. However, the equation by which we estimate that should provide higher accuracy. In a statistical approach to find the relation between two parameters, using only a few samples are easy; however, it may not provide higher accuracy. Hence for estimating the relation between features, a large number of samples are required. The statistical approach is better for smaller datasets, however, for large datasets, ML models are required. The ML algorithm estimates the line that can go through most of the sample points, which is called as best fitted line shown in Fig. 3.3. Fig. 3.3 Best-fitted line. Alt Text: Fig. 3.3 presents the best-fitted line for sample dataset, on x-axis, year of experience and on y-axis, salary of employee is plotted
3.2 Machine Learning Approaches
37
Fig. 3.4 Prediction using the best-fitted line. Alt Text: Fig. 3.4 presents the prediction of salary of an employee who has 8 years of experience based on the best-fitted line
The salary of an employee who has 8 years of experience can be calculated from the best-fitted line, as shown in Fig. 3.4. Once we get the best-fitted line then a prediction can be done with high accuracy, however, finding the best-fitted line is a crucial task. In the above sample dataset, the output/prediction feature, i.e. salary depends only on a single parameter, i.e. years of experience; however, in most applications, it may possible that the predicted feature may depend on multiple features. For example, a dataset of house price details, few samples are shown in Table 3.2, here the type of house, area of the house, style of house and number of bedrooms in the house are the features. The area, house style, number of bedrooms are input features and sale price is output/prediction/label feature. If we have a dataset of house selling prices for the last 10 years then the ML model can be trained from it to predict the selling price of the house. Table 3.2 House price dataset Area
House style
Bedroom
Sale price
8450
2 Story
3
208,500
9600
1 Story
3
181,500
11,250
2 Story
3
223,500
9550
2 Story
3
140,000
14,260
2 Story
4
250,000
10,382
2 Story
3
200,000
6120
1.5 Fin
2
129,900
7420
1.5 Unf
2
118,000
11,200
1 Story
3
129,500
12,968
1 Story
2
144,000
10,652
1 Story
3
279,500
38
3 Data Analytics and Learning Techniques
Fig. 3.5 Various patterns of dataset. Alt Text: Fig. 3.5 presents various patterns of datasets
The owner of a house can estimate how much amount he/she can get for a house with the help of this trained model. However, in this case finding, the best-fitted line is difficult as here input features are more than one. Here house-style feature needs encoding as it is in an alphanumeric format that is needed to convert into the numerical format. The various encoding techniques are discussed in Chap. 2. Moreover, it may possible that every dataset may not increase linearly, samples point may have a different pattern than linear, few patterns are shown in Fig. 3.5. Once the best-fitted line/curve is found then a prediction can be done, the bestfitted line/curve for data in Fig. 3.5 is shown in Fig. 3.6. From the pattern shown in Fig. 3.6, we can say that there are two types of problems that can be solved using supervised learning, i.e. regression and classification [12]. • Regression This is a supervised ML method in which the value of the dependent variable, i.e. Y(input variable) is predicted with help of the independent variable, i.e. X(output variable). In this, linear relation between X and Y is discovered. In regression, quantity of the variable is predicted, this quantity is a continuous type of data, e.g. price of the house, the salary of the employee. In this type of dataset, value of labels is of continuous type (refer to Sect. 2.1 types of data).
3.2 Machine Learning Approaches
39
Fig. 3.6 Best-fitted line/curve. Alt Text: Fig. 3.6 presents the best-fitted line for various patterns of datasets
• Classification In the classification method, a few conclusions/outcomes are made on the dataset. For a given input, ML model will predict value among these outcomes, e.g. dataset of the vehicle insurance company, few samples are shown in Fig. 3.8, gender, age, driving license of the owner, vehicle age, whether the vehicle is damaged, the annual premium that owner needs to pay and the last feature is response, i.e. whether they have taken insurance, this feature is the label of the dataset. The classification ML model will be trained on this dataset, and it will classify data into two classes. The first class is of the people who have taken insurance and another class is of the people who have not taken insurance. Hence for this dataset, we have only two labels “Yes” and “No”. The trained classification model then can predict whether a person can buy an insurance policy, i.e. “Yes” or “No”, when values of input features are provided. In classification, labels are of discrete type.
Age
44
76
47
21
29
24
23
56
24
Gender
Male
Male
Male
Male
Female
Female
Male
Female
Female
1
1
1
1
1
1
1
1
1
Driving_License