Data Management, Analytics and Innovation: Proceedings of ICDMAI 2020, Volume 1 [1st ed.] 9789811556159, 9789811556166

This book presents the latest findings in the areas of data management and smart computing, big data management, artific

296 54 15MB

English Pages XII, 476 [471] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Front Matter ....Pages i-xii
Front Matter ....Pages 1-1
Improving Microblog Clustering: Tweet Pooling Schemes (Nadeem Akhtar, M. M. Sufyan Beg)....Pages 3-18
An IoT-Based Smart Parking Framework for Smart Cities (Rekha Gupta, Neha Budhiraja, Shreya Mago, Shivani Mathur)....Pages 19-32
Open-Source Software Challenges and Opportunities (Ajeet Phansalkar)....Pages 33-42
“Empirical Study on the Perception of Accounting Professionals Toward Awareness and Adoption of IFRS in India” (Neha Puri, Harjit Singh, Vikas Garg)....Pages 43-61
On Readability Metrics of Goal Statements of Universities and Brand-Promoting Lexicons for Industries (Prafulla B. Bafna, Jatinderkumar R. Saini)....Pages 63-72
An Efficient Recommendation System on E-Learning Platform by Query Lattice Optimization (Subhadeep Ghosh, Santanu Roy, Soumya Sen)....Pages 73-86
DengueCBC: Dengue EHR Transmission Using Secure Consortium Blockchain-Enabled Platform (Biky Chowhan, Rashmi Mandal (Vijayvergiya), Pawan Kumar Sharma)....Pages 87-106
Online Credit Card Fraud Analytics Using Machine Learning Techniques (Akshi Kumar, Kartik Anand, Simran Jha, Jayansh Gupta)....Pages 107-120
Identification of Significant Challenges Faced by the Tourism and Hospitality Industries Using Association Rules (Prafulla B. Bafna, Jatinderkumar R. Saini)....Pages 121-129
Front Matter ....Pages 131-131
An Approach of Feature Subset Selection Using Simulated Quantum Annealing (Ashis Kumar Mandal, Mrityunjoy Panday, Aniruddha Biswas, Saptarsi Goswami, Amlan Chakrabarti, Basabi Chakraborty)....Pages 133-146
A Novel Framework for Data Acquisition and Retrieval Using Hierarchical Schema Over Structured Big Data (Neepa Shah)....Pages 147-170
Predicting the Cricket Match Outcome Using ANFIS Classifier for Viewers Opinions on Twitter Data (U. V. Anbazhagu, R. Anandan)....Pages 171-182
Extraction of Tabular Data from PDF to CSV Files (Gresha Bhatia, Abha Tewari, Grishma Gurbani, Sanket Gokhale, Naman Varyomalani, Rishil Kirtikar et al.)....Pages 183-193
Railway Complaint Tweets Identification (Nadeem Akhtar, M. M. Sufyan Beg)....Pages 195-207
Identification of Entities in Scientific Documents (Nadeem Akhtar, Hira Javed)....Pages 209-219
A Machine Learning Model for Review Rating Inconsistency in E-commerce Websites (Sunil Saumya, Jyoti Prakash Singh, Abhinav Kumar)....Pages 221-230
NLIDB Systems for Enterprise Databases: A Metadata Based Approach (M. N. Karthik, Garima Makkar)....Pages 231-241
A Data Science Approach to Analysis of Tweets Based on Cyclone Fani (Wazib Ansar, Saptarsi Goswami, Amit Kumar Das)....Pages 243-261
A GPU Unified Platform to Secure Big Data Transportation Using an Error-Prone Elliptic Curve Cryptography (Shiladitya Bhattacharjee, Divya Midhun Chakkaravarhty, Midhun Chakkaravarty, Lukman Bin Ab. Rahim, Ade Wahyu Ramadhani)....Pages 263-280
Front Matter ....Pages 281-281
An Empirical Analysis of Classifiers Using Ensemble Techniques (Reshu Parsuramka, Saptarsi Goswami, Sourav Malakar, Sanjay Chakraborty)....Pages 283-298
Force of Gravity Oriented Classification Technique in Machine Learning (Pinaki Prasad Guha Neogi, Saptarsi Goswami)....Pages 299-310
Machine Learning Classifiers for Android Malware Detection (Prerna Agrawal, Bhushan Trivedi)....Pages 311-322
Designing a Model to Handle Imbalance Data Classification Using SMOTE and Optimized Classifier (Shraddha Shivaji Nimankar, Deepali Vora)....Pages 323-334
An Analysis of Computational Complexity and Accuracy of Two Supervised Machine Learning Algorithms—K-Nearest Neighbor and Support Vector Machine (Susmita Ray)....Pages 335-347
A Survey on Application of Machine Learning Algorithms in Cancer Prediction and Prognosis ( Deepti, Susmita Ray)....Pages 349-361
Augmented Reality Building Surveillance System (Chahat Bhatia)....Pages 363-378
Interpretation and Segmentation of Chart Images Using h-Means Image Clustering Algorithm (Prerna Mishra, Santosh Kumar, Mithilesh Kumar Chaube)....Pages 379-391
Front Matter ....Pages 393-393
Innovative Techniques for Student Engagement in Cybersecurity Education (Amishi Arora, Amlesh Mendhekar)....Pages 395-406
A Feasibility Study of Service Level Agreement Compliance for Start-Ups in Cloud Computing (T. Lavanya Suja, B. Booba)....Pages 407-417
Smart Waste Monitoring Using Internet of Things (Mitra Tithi Dey, Punyasha Chatterjee, Amlan Chakrabarti)....Pages 419-433
Academic Blockchain: An Application of Blockchain Technology in Education System (Sakthi Kumaresh)....Pages 435-448
QR Code Based Smart Document Implementation Using Blockchain and Digital Signature (Kunal Pal, C. R. S. Kumar)....Pages 449-465
Security Issue in Internet of Things (Ramesh Chandra Goswami, Hiren Joshi)....Pages 467-476
Recommend Papers

Data Management, Analytics and Innovation: Proceedings of ICDMAI 2020, Volume 1 [1st ed.]
 9789811556159, 9789811556166

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Advances in Intelligent Systems and Computing 1174

Neha Sharma Amlan Chakrabarti Valentina Emilia Balas Jan Martinovic   Editors

Data Management, Analytics and Innovation Proceedings of ICDMAI 2020, Volume 1

Advances in Intelligent Systems and Computing Volume 1174

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong

The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. ** Indexing: The books of this series are submitted to ISI Proceedings, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **

More information about this series at http://www.springer.com/series/11156

Neha Sharma Amlan Chakrabarti Valentina Emilia Balas Jan Martinovic •





Editors

Data Management, Analytics and Innovation Proceedings of ICDMAI 2020, Volume 1

123

Editors Neha Sharma Society for Data Science Pune, Maharashtra, India Valentina Emilia Balas Department of Automatics and Applied Software, Faculty of Engineering University of Arad Arad, Romania

Amlan Chakrabarti A.K.Choudhury School of Information Technology West Bengal, India Faculty of Engineering University of Calcutta Kolkata, India Jan Martinovic IT4Innovations VSB-Technical University of Ostrava Ostrava, Czech Republic

ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-981-15-5615-9 ISBN 978-981-15-5616-6 (eBook) https://doi.org/10.1007/978-981-15-5616-6 © Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

These two volumes constitute the proceedings of the International Conference on Data Management, Analytics and Innovation (ICDMAI 2020) held from 17 to 19 January 2020 at United Service Institution of India, New Delhi. ICDMAI is a signature conference of the Society for Data Science (S4DS), which is a non-profit professional association established to create a collaborative platform for bringing together technical experts across Industry, Academia, Government Labs and Professional Bodies to promote innovation around Data Science. ICDMAI is committed to create a forum that brings data science enthusiasts on the same page and envisions its role towards its enhancement through collaboration, innovative methodologies and connections throughout the globe. Planning towards ICDMAI 2020 started around 14 months back, and the entire core team ensured that we surpass our own benchmark. The core committee had taken utmost care in each and every facet of the conference, especially regarding the quality of the submissions. Out of 514 papers submitted to ICDMAI 2020, only 12% (62 papers) were selected for the oral presentation after a rigorous review process. This year the conference witnessed participants from 8 countries, 25 industries, 80 international and Indian universities (IITs, NITs, IISER, etc.). Besides paper presentations, the conference also showcased workshops, tutorial talks, keynote sessions and a plenary talk by the experts of the respective field. We appreciate the bonhomie and support extended by IBM, Wizertech, Springer, Ericsson, NIELIT-Kolkata, IISER-Kolkata and REDX-Kolkata. The volumes cover a broad spectrum of topics in Data Science and all the relevant disciplines. The conference papers included in these proceedings are published post-conference, and are grouped into the four areas of research such as —Data Management and Smart Informatics, Big Data Management, Artificial Intelligence and Data Analytics, Advances in Network Technologies. All the four tracks of the conference were very relevant to the current technological advancements and received the Best Paper Award in each track. A very stringent selection process was adopted for paper selection, from plagiarism check to technical chairs review to double-blind review, every step was religiously followed. We compliment all the authors for submitting high-quality conference papers to ICDMAI v

vi

Preface

2020. The editors would like to acknowledge all the authors for their contributions and also the efforts taken by reviewers and session chairs of the conference, without whom it would have been difficult to select these papers. We appreciate the unconditional support from the members of the National and International Program Committee. It was really interesting to hear the participants of the conference highlight the new areas and the resulting challenges as well as opportunities. This conference has served as a vehicle for a spirited debate and discussion on many challenges that the world faces today. Our heartfelt thanks to our General Chairs, Dr. P. K. Sinha, Vice-Chancellor and Director, IIIT, Naya Raipur, India and Prof. Vincenzo Piuri, Professor, Università degli Studi di Milano, Italy. We are grateful to other eminent personalities, who were present at ICDMAI 2020 like Alfred Bruckstein, Technion—Israel Institute of Technology; C. Mohan, IBM Fellow, IBM Almaden Research Center in Silicon Valley; Dinanath Kholkar, Tata Consultancy Services; Anupam Basu, National Institute of Technology, Durgapur; Biswajit Patra, Intel; Lipika Dey, Tata Consultancy Services, New Delhi; Sangeet Saha, University of Essex, UK; Aninda Bose, Springer India Pvt. Ltd.; Kranti Athalye, IBM India University Relations; Mrityunjoy Pandey, Cognizant; Amit Agarwal and Ishant Wankhede, Abzooba; Kaushik Dey, Ericsson; Prof. Sugata Sen Roy, University of Calcutta; Amol Dhondse, IBM Master Innovator; Anindita Bandyopadhyay, KPMG; Kuldeep Singh, ODSC, Delhi Chapter; Rita Bruckstein, Technion, Israel; Sonal Kukreja, TenbyTen and many more, who were associated with ICDMAI 2020. The conference of this magnitude was possible due to the consistent and concerted efforts of many good souls. We acknowledge the contribution of our advisory body members, technical programme committee, people from industry and academia, reviewers, session chairs, media and authors, who have been instrumental in making this conference possible. Our special thanks go to Janus Kacprzyk (Editor-in-Chief, Springer, Advances in Intelligent Systems and Computing Series) for the opportunity to organize this guest-edited volume. We are grateful to Springer, especially to Mr. Aninda Bose (Senior Publishing Editor, Springer India Pvt. Ltd.) for the excellent collaboration, patience and help during the evolvement of this volume. We are confident that the volumes will provide state-of-the-art information to professors, researchers, practitioners and graduate students in the area of data management, analytics and innovation, and all will find this collection of papers inspiring and useful. Pune, India West Bengal, India Arad, Romania Ostrava, Czech Republic

Neha Sharma Amlan Chakrabarti Valentina Emilia Balas Jan Martinovic

Contents

Data Management and Smart Informatics Improving Microblog Clustering: Tweet Pooling Schemes . . . . . . . . . . . Nadeem Akhtar and M. M. Sufyan Beg

3

An IoT-Based Smart Parking Framework for Smart Cities . . . . . . . . . . Rekha Gupta, Neha Budhiraja, Shreya Mago, and Shivani Mathur

19

Open-Source Software Challenges and Opportunities . . . . . . . . . . . . . . Ajeet Phansalkar

33

“Empirical Study on the Perception of Accounting Professionals Toward Awareness and Adoption of IFRS in India” . . . . . . . . . . . . . . . Neha Puri, Harjit Singh, and Vikas Garg

43

On Readability Metrics of Goal Statements of Universities and Brand-Promoting Lexicons for Industries . . . . . . . . . . . . . . . . . . . . Prafulla B. Bafna and Jatinderkumar R. Saini

63

An Efficient Recommendation System on E-Learning Platform by Query Lattice Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subhadeep Ghosh, Santanu Roy, and Soumya Sen

73

DengueCBC: Dengue EHR Transmission Using Secure Consortium Blockchain-Enabled Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Biky Chowhan, Rashmi Mandal (Vijayvergiya), and Pawan Kumar Sharma

87

Online Credit Card Fraud Analytics Using Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Akshi Kumar, Kartik Anand, Simran Jha, and Jayansh Gupta Identification of Significant Challenges Faced by the Tourism and Hospitality Industries Using Association Rules . . . . . . . . . . . . . . . . 121 Prafulla B. Bafna and Jatinderkumar R. Saini

vii

viii

Contents

Big Data Management An Approach of Feature Subset Selection Using Simulated Quantum Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Ashis Kumar Mandal, Mrityunjoy Panday, Aniruddha Biswas, Saptarsi Goswami, Amlan Chakrabarti, and Basabi Chakraborty A Novel Framework for Data Acquisition and Retrieval Using Hierarchical Schema Over Structured Big Data . . . . . . . . . . . . . . 147 Neepa Shah Predicting the Cricket Match Outcome Using ANFIS Classifier for Viewers Opinions on Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . 171 U. V. Anbazhagu and R. Anandan Extraction of Tabular Data from PDF to CSV Files . . . . . . . . . . . . . . . 183 Gresha Bhatia, Abha Tewari, Grishma Gurbani, Sanket Gokhale, Naman Varyomalani, Rishil Kirtikar, Yogita Bhatia, and Shefali Athavale Railway Complaint Tweets Identification . . . . . . . . . . . . . . . . . . . . . . . . 195 Nadeem Akhtar and M. M. Sufyan Beg Identification of Entities in Scientific Documents . . . . . . . . . . . . . . . . . . 209 Nadeem Akhtar and Hira Javed A Machine Learning Model for Review Rating Inconsistency in E-commerce Websites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Sunil Saumya, Jyoti Prakash Singh, and Abhinav Kumar NLIDB Systems for Enterprise Databases: A Metadata Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 M. N. Karthik and Garima Makkar A Data Science Approach to Analysis of Tweets Based on Cyclone Fani . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Wazib Ansar, Saptarsi Goswami, and Amit Kumar Das A GPU Unified Platform to Secure Big Data Transportation Using an Error-Prone Elliptic Curve Cryptography . . . . . . . . . . . . . . . 263 Shiladitya Bhattacharjee, Divya Midhun Chakkaravarhty, Midhun Chakkaravarty, Lukman Bin Ab. Rahim, and Ade Wahyu Ramadhani Artificial Intelligence and Data Analysis An Empirical Analysis of Classifiers Using Ensemble Techniques . . . . . 283 Reshu Parsuramka, Saptarsi Goswami, Sourav Malakar, and Sanjay Chakraborty

Contents

ix

Force of Gravity Oriented Classification Technique in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Pinaki Prasad Guha Neogi and Saptarsi Goswami Machine Learning Classifiers for Android Malware Detection . . . . . . . . 311 Prerna Agrawal and Bhushan Trivedi Designing a Model to Handle Imbalance Data Classification Using SMOTE and Optimized Classifier . . . . . . . . . . . . . . . . . . . . . . . . 323 Shraddha Shivaji Nimankar and Deepali Vora An Analysis of Computational Complexity and Accuracy of Two Supervised Machine Learning Algorithms—K-Nearest Neighbor and Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . 335 Susmita Ray A Survey on Application of Machine Learning Algorithms in Cancer Prediction and Prognosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 Deepti and Susmita Ray Augmented Reality Building Surveillance System . . . . . . . . . . . . . . . . . 363 Chahat Bhatia Interpretation and Segmentation of Chart Images Using h-Means Image Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 Prerna Mishra, Santosh Kumar, and Mithilesh Kumar Chaube Advances in Network Technologies Innovative Techniques for Student Engagement in Cybersecurity Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Amishi Arora and Amlesh Mendhekar A Feasibility Study of Service Level Agreement Compliance for Start-Ups in Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 T. Lavanya Suja and B. Booba Smart Waste Monitoring Using Internet of Things . . . . . . . . . . . . . . . . 419 Mitra Tithi Dey, Punyasha Chatterjee, and Amlan Chakrabarti Academic Blockchain: An Application of Blockchain Technology in Education System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 Sakthi Kumaresh QR Code Based Smart Document Implementation Using Blockchain and Digital Signature . . . . . . . . . . . . . . . . . . . . . . . . . 449 Kunal Pal and C. R. S. Kumar Security Issue in Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 Ramesh Chandra Goswami and Hiren Joshi

About the Editors

Neha Sharma is working with Tata Consultancy Services and is a Founder Secretary, Society for Data Science, India. She has previously served as Director of ZIBACAR, Pune, India. Holding a Ph.D. from the prestigious Indian Institute of Technology, Dhanbad, she is a senior IEEE member and Executive Body member of the IEEE Pune Section. She has received a “Best Ph.D. Thesis Award” and “Best Paper Presenter at International Conference Award” at the National Level from the Computer Society of India. Her research interests include data mining, database design, artificial intelligence, big data, cloud computing, blockchain and data science. Amlan Chakrabarti is a Full Professor at the School of IT, University of Calcutta. He was a Postdoctoral Fellow at Princeton University, USA, from 2011 to 2012. With nearly 20 years of experience in Engineering Education and Research, he is a recipient of the prestigious DST BOYSCAST Fellowship Award in Engineering Science (2011), JSPS Invitation Research Award (2016), Erasmus Mundus Leaders Award from the EU (2017), and a Hamied Visiting Professorship from the University of Cambridge (2018). He is an Associate Editor of the Journal of Computers and Electrical Engineering, senior member of the IEEE and ACM, IEEE Computer Society Distinguished Visitor, Distinguished Speaker of the ACM, Secretary of the IEEE CEDA India Chapter and Vice President of the Data Science Society. Prof. Valentina Emilia Balas is currently a Full Professor at the Department of Automatics and Applied Software, “Aurel Vlaicu” University of Arad, Romania. The author of more than 300 research papers, her research interests include intelligent systems, fuzzy control, soft computing, smart sensors, information fusion, modeling and simulation. She is the Editor-in-Chief of the Inderscience journals IJAIP and IJCSysE. She is the Director of the Department of International Relations and Head of the Intelligent Systems Research Centre at Aurel Vlaicu University of Arad.

xi

xii

About the Editors

Jan Martinovic is currently Head of the Advanced Data Analysis and Simulation Lab at IT4Innovations National Supercomputing Center, VSB-TUO, Czech Republic. His research activities focus on HCP, cloud and big data convergence, HPC-as-a-Service, traffic management and data analysis. He is the coordinator of the H2020 ICT EU-funded project LEXIS (www.lexis-project.eu). He has previously coordinated contracted research activities with international and national companies and been involved in the H2020 projects ANTAREX and ExCAPE. He has been a HiPEAC member since January 2020.

Part I

Data Management and Smart Informatics

Improving Microblog Clustering: Tweet Pooling Schemes Nadeem Akhtar and M. M. Sufyan Beg

Abstract Performing machine learning and natural language processing task on Twitter data is challenging due to the short and noisy nature of tweets. These tasks perform well on long documents like news articles, research papers but perform poorly when applied on short text like tweets. One way of improving the results is tweet pooling, i.e., to combine the related tweets to make longer coherent input documents. In this work, several new tweet pooling schemes are proposed based on the two tweet auxiliary information—user mentions and URLs. The proposed tweet pooling schemes are evaluated for clustering quality on the clusters/topics obtained by using standard Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF). Using the tweet labels of the tweet dataset, purity, and Normalized Mutual Information (NMI) measures are used to evaluate the clustering quality. Empirical results show that proposed tweet pooling schemes always outperform the existing schemes by significant margin when more than one tweet auxiliary information is used for pooling. Keywords Latent Dirichlet Allocation · Non-negative Matrix Factorization · Tweet pooling · Hashtag · Tweet mention · Tweet URL

1 Introduction Understanding short texts like tweets, news headlines, and web snippets is important for many text analytics and natural language processing applications such as friend recommendation, event detection, author attribution and verification, questionanswering, and sentiment analysis. In many such applications, unsupervised machine learning tasks such as clustering, topic modeling, etc. need to be applied. Clustering N. Akhtar (B) · M. M. S. Beg Department of Computer Engineering, Zakir Husain College of Engineering, Aligarh Muslim University, Aligarh, & Technology, Aligarh, India e-mail: [email protected] M. M. S. Beg e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1174, https://doi.org/10.1007/978-981-15-5616-6_1

3

4

N. Akhtar and M. M. S. Beg

and topic modeling uncover the structure of a document corpus by finding the semantically similar group of documents. Topic modeling [1] finds coherent hidden topics, i.e., group of similar words by projecting the documents from word space to topic space. Latent Dirichlet Allocation views the documents as probabilistic distribution over hidden topics, where each topic is a probability distribution over words in document corpus vocabulary. Latent Dirichlet Allocation (LDA) [2] and its variants provide powerful tools to uncover hidden topics in the document corpus. Nonnegative Matrix Factorization (NMF) [3] is another way of finding the inherent topics or clusters of words by factorizing the document word matrix. Non-negative Matrix Factorization is identical to the Probabilistic Latent Semantic Analysis (pLSA) [4], when error function used is Kullback-Leibler divergence [5]. Latent Dirichlet Allocation is generalization of Probabilistic Latent Semantic Analysis and similar to it when a uniform Dirichlet prior is used instead of a sparse Dirichlet prior. Latent Dirichlet Allocation and Non-negative Matrix Factorization have provided great results on long documents. But unfortunately, they fail to produce satisfactory results in short texts. Short texts are usually characterized by the small size, informal, and noisy nature. In this work, we focus on tweets posted on Twitter. The size of the tweets is very short, i.e., only 140 characters or less. Recently, the maximum limit on the size has been doubled to 280 characters. The tweets may also contain some contextual data such as hashtags, twitter mentions, URLs. Moreover, use of abbreviations, slangs, informal language is widespread in tweets. Due to the short size of the tweets, the word co-occurrence relationships are very sparse which results in poor performance in clustering and topic modeling [6]. Since both LDA and NMF rely on the word co-occurrence relationships, they find incoherent and random topics and clusters when applied to tweets. One straightforward method to handle data sparseness problem is tweet pooling, i.e., to merge the related tweets using some criterion to produce longer documents to improve the word co-occurrence relationships. To find related tweets for merging, contextual and auxiliary information of tweets are used. Hashtags are part of the tweet body and provide contextual clues about the tweet topic. Mehrotra [7] used hashtags to pool those tweets into a single document which contains same hashtag. Tweets auxiliary information, for example, user of the tweet, posting time of the tweets, replied tweets of the tweets are also used for finding related tweets. Hong [8] used tweet users for tweet pooling. Mehrotra [7] also used tweet posting time for hour-based and burst-based tweet pooling schemes. Alvarez-Melis [9] used tweets reply to form tweet conversation tree and merging the tweets of each conversation tree into a single document. These tweet pooling schemes provide improved results for topic modeling using Latent Drichlet Allocation. The problem with the above tweet pooling schemes is that the reduction in data sparsity depends on the availability of some common elements (i.e., tweets contextual and auxiliary information) in abundance. For example, if small number of tweets in the dataset contains hashtags, tweet pooling will have no significant effect on topic modeling results. Only those tweets containing hashtags will be pooled into merged documents, and a large number of tweets not containing hashtags will not be merged into any larger documents. They will still be treated as short individual

Improving Microblog Clustering: Tweet Pooling Schemes

5

documents. Similarly, if the tweets are not frequently posted in conversations, the conversation-based tweet pooling scheme will have little effect on topic modeling results. To resolve the above common element unavailability problem, we propose multilevel tweet pooling schemes that produce longer documents even if the number of one kind of common element is small. Two or more common elements (i.e., contextual or auxiliary information) are used for pooling in multilevel tweet pooling scheme. At the first level, all the tweets are pooled using one common element. Next, the remaining unpooled tweets are pooled using the second common element. Remaining unpooled tweets may be pooled using the third common element. The number of merged documents is increased using multiple common elements for tweet pooling, improving the word co-occurrence relationships. We also propose to use more contextual and auxiliary information for tweet pooling. Besides hashtags, a tweet may also contain two more contextual information—tweet mentions and URLs. We propose to use tweet mentions and URLs for pooling related tweets. The motivation behind the URL-based tweet pooling is that the tweets referencing the same URL will have same topic. Similarly, tweets which mention same users will tend to have same topics. We show that the pooling schemes based on user mentions and URLs also provide better results. We compare and evaluate the proposed tweet pooling schemes against the userbased, hashtag-based, and conversation-based tweet pooling schemes proposed previously. Evaluation is performed in terms of clustering quality using standard LDA model and Non-negative Matrix Factorization (NMF). Clustering quality is measured using purity and normalized mutual information. The next section describes the literature work carried out for generating longer pseudo-documents for improving clustering and topic modeling for short texts. Section 3 describes the proposed tweet pooling schemes. Section 4 presents the experimental setup and experiments. Section 5 shows the results and discussion. Section 6 concludes and discusses future directions.

2 Related Works Several remedial have been proposed to better model the short tweets for topic modeling. The earlier work to improve the performance of topic models can be divided into three types. • Term Expansion of the short tweets by adding related words • Modifying the standard LDA to better fit the short tweets • Pooling of similar tweets into longer tweet documents. Term expansion techniques are used to expand the short text to append additional terms using either an external data source or tweet data corpus itself. External data dictionary or knowledge bases such as WordNet [10], Wikipedia [11], domainspecific knowledge base, and World Wide Web [12] have been used to enrich the

6

N. Akhtar and M. M. S. Beg

tweet text to reduce the sparsity. Sahami et al. represented tweet texts with Wikipedia concepts. They used Wikipedia redirections to handle synonymy problem and outlinks to address sparsity problem. Their results on tweet dataset outperformed both VSM and LDA model in the clustering task. The techniques from the Information Retrieval and Query expansion are used to find words similar to the tweets words and added to the tweets. Alternatively, additional terms are generated from using the co-occurrence term relationships in the tweet corpus. Co-Frequency Expansion (CoFE) [13] used the cooccurrence frequency of terms to enrich the tweets assuming the words that co-occur have high probability to fall in the same topic. Bicalho [14] defined a Distributed Representation-based Expansion (DREx) based on the concept of word embedding to model word similarities. They defined a new framework based on metric space to generate large pseudo-documents that are more suitable for topic modeling. Second method to improve the topic modeling for short texts is to modify the LDA in such a way that suits short text clustering requirements. Bi-term Topic Model (BTM) [15] considers bigrams for topic modeling instead of unigrams as considered in standard LDA. BTM aggregates all bigrams in the same pseudo-document reducing the sparsity problem. BTM is proved to be effective for short text topic modeling. Latent Feature Latent Dirichlet Allocation (LFLDA) [16] introduces a latent feature for each topic which is vector representation of topics learned after each Gibbs sampling iterations. To generate a word, either the topic distribution or its latent feature is chosen for modeling. Instead of having a multinomial mixture of topics as in Latent Dirichlet Allocation (LDA), Dirichlet Multinomial Mixture (DMM) [17], and Twitter-LDA [6] models assume short text document to have been sampled from a single topic distribution and this kind of assumption is reasonable for the short text, which solves the data sparsity problem to some extent. Another topic model tailored for short texts, GPU-DMM, use auxiliary word embeddings to enrich short text from large corpora combined with DMM model. GPU-DMM [18] promotes semantically related words under the same topic during the sampling process using Generalized Polya Urn (GPU) model [19]. Hashtag Graph Topic Model (HGTM) [20] used hashtag co-occurrence relationship to supervise the topic modeling. In HGTM, tweets are considered as distributions over hashtags, and each hashtag is assumed to be a distribution over hidden topics. In the Word Network Topic Model (WNTM) [21], a word co-occurrence network is formed and used for finding latent word groups. These latent word groups are treated as topics. For each word, the distribution over the latent word groups is learnt using inference. Some topic models combine related short text documents into a single one to increase the document length. Author Topic Model (ATM) [22] models each document to have originated from a single author. This leads to the aggregation of all tweets written by an author into a single document. Self-Aggregation-based Topic Model (SATM) [23] integrates clustering and topic modeling to present a general framework for short text aggregation during topic inference based on the topical affinity. SATM assumed each short document to have sampled from a long pseudo-document. Embedding-based Topic Model (ETM) [24] learns inherent topics from short texts using word embedding that finds semantically similar words in the corpus. ETM

Improving Microblog Clustering: Tweet Pooling Schemes

7

solves the sparsity problem by aggregating short text into long pseudo-documents and uses Markov Random Field regularized model that gives correlated words a better chance to be put in same topic. Some methods have used a combination of first and second methods. Dual Latent Dirichlet Allocation (DLDA) [25] model uses topically related external long text as auxiliary data for topical modeling of short text. DLDA uses transfer learning to learn two sets of topics from short and auxiliary long text and couple the topic parameters to synchronize between clustering on short and long texts. In [26], Guo has enriched tweets using related newswire documents using NLP tasks. The third method to handle the sparsity problem in the short texts is to aggregate the short documents into larger documents using some heuristics based on contextual information in the short documents. In the main text, Tweets have some contextual informations-hashtags, user mentions, tweet URLs. Besides that some more contextual information can be retrieved from the social interaction of the tweets reply tweets, location, users. Tweets are aggregated on the basis of these contextual information and standard LDA [2] model is used with aggregated tweets. Three popular tweet aggregation methods are aggregation by hashtags [7], users [8], and conversations [9]. Firstly, Hong et al. use the tweet users to aggregate all the tweets of a user into a single document before applying the standard LDA. This method is similar to the Author Topic Model. Mehrotra et al. proposed burst-score wise pooling, temporal pooling, and hashtag-based pooling. Burst-score wise pooling aggregate tweets based on the bursts, i.e., when the number of tweets per unit time is very high. Temporal pooling aggregates tweets posted in a single hour of time. Hashtags-based pooling aggregates those tweets which share the same hashtags. Empirical evidences have shown that the hashtag-based aggregation method improves topic modeling results on standard LDA considerably, but burst-score wise and temporal pooling schemes do not provide significant improvement. Alvarez-Melis [9] aggregates those tweets into a single document which are part of a single conversation. They used in_reply_to_status_id field to identify the conversation tree and merged all the tweets in the conversation tree into a single document. Conversation-based pooling scheme is claimed to perform better than hashtag-based pooling scheme and take considerably less time to train.

3 Tweet Pooling Schemes Tweet pooling not only provides longer documents but also provide topically coherent documents, which results in better performance on training topic models. Tweets have several types of contextual and auxiliary information associated with them—timestamp, user, user mention, hashtag, URL, replied tweet, and location. Other auxiliary information may be derived from these information. For example, tweet burst can be derived from timestamps. Similarly, conversations can be derived from replied tweets. The general idea of pooling is to use some contextual or auxiliary information to find related tweets and pool them.

8

N. Akhtar and M. M. S. Beg

Several ways are possible for pooling related tweets based on the choice of auxiliary information. The use of hashtags, users, and conversations has been used effectively for tweet pooling earlier. Other methods have also been proposed but empirical results of their benefits over unpooled tweets are not comparable. Mehrotra et al. [7] have proposed to use timestamps to bins on hourly basis for pooling tweets in the same bin. They have also used automatic hashtag labeling for enriching hashtag labels to further improve hashtag-based pooling scheme. Now, we define the tweet pooling schemes used in our work.

3.1 Unpooled Tweets In this scheme, tweets are considered individually without any kind of pooling. This method is considered as baseline.

3.2 User-based Tweet Pooling In this scheme [8], all the tweets posted by a user are pooled into a single document creating as many documents as the number of users. This pooling scheme provides similar results as to Author Topic Model (ATM) [22]. This scheme is shown empirically better than unpooled tweets.

3.3 Hashtag-based Tweet Pooling All the tweets containing a hashtag are pooled into a single document resulting in as many documents as the number of hashtags in the dataset [7]. There may be more than one hashtags in a single tweet, so one tweet may be pooled into more than one document. Tweets which do not contain any hashtags are treated as individual documents. This scheme also shows better performance than unpooled scheme.

3.4 Conversation-based Tweet Pooling Conversation-based pooling scheme [9] combines all the tweets in a conversation tree. A conversation tree involves a seed tweet, all tweets written in reply to it, replies of the replies, and so on. Tweets which don’t have any replies or are not in reply to any tweets are considered individual documents. This scheme also shows considerable improvement in the performance of LDA and ATM over user and hashtag-based pooling schemes.

Improving Microblog Clustering: Tweet Pooling Schemes

9

3.5 URL-based Tweet Pooling Users sometimes include URLs of external sources in the tweet body. The content of the URL itself may provide additional information about the topic of tweet. For example, the URL https://parameterless.com/uber_chooses_expedia_bos… may be used to infer tweet topic to be related to Uber Expedia Boss. But this type of URL is very less. Most of the tweets have URLs, from which no additional information can be extracted. For example, http://dlvr.it/Pj5hvd and http://www.bbc.co.uk/news/bus iness-41070364. The motivation for URL-based pooling is to combine all those tweets sharing the same URL text. The tweets sharing the same URL are expected to be more topically related to the topic of the URL. Those tweets which do not contain any URL are treated as individual document.

3.6 Mention-based Tweet Pooling Those tweets which include same user mentions are combined into a single document. There may be more than one user mentions in a single tweet, so one tweet may be pooled into more than one document. Tweets which do not contain any user mention are treated as individual documents. The motivation for the mention-based pooling is that if two or more tweets mention the same user name, those tweets are topically related. This is similar to the user-based pooling scheme where all tweets from a user are pooled into the same document.

3.7 Mention-Hashtag Tweet Pooling This scheme firstly uses mentions for tweet pooling and after that remaining unpooled tweets are pooled using hashtags. The motivation for this kind of serial tweet pooling is that after first pooling scheme, a large number of tweets still remain unpooled, which may be combined using some other tweet auxiliary information.

3.8 Mention-URL Tweet Pooling First, mentions are used for pooling. After that, URLs are used.

10

N. Akhtar and M. M. S. Beg

3.9 URL-mention Tweet Pooling This scheme uses URLs for pooling first and mentions are used next for pooling the remaining tweets.

3.10 Tag-URL Tweet Pooling In this scheme, hashtags are used for pooling before using URLs for pooling remaining tweets.

3.11 Mention-Tag-URL Tweet Pooling This scheme uses three auxiliary tweet information for tweet pooling sequentially. Firstly, user mentions are used for pooling. Remaining tweets are pooled using hashtags in the second step. The unpooled tweets after the second step are pooled using URLs in the third step. The unpooled tweets after the third step are added as individual documents. The last five tweet pooling schemes are examples of several possible pooling schemes using two or more auxiliary information sequentially.

4 Experimental Setup This section presents the dataset characteristics, i.e., the distribution of various contextual information. The tweet dataset is preprocessed first and divided into training and test set. The training set is pooled according to various pooling schemes. Next, LDA and NMF are run on the pooled training dataset. Finally, results are evaluated using purity and NMI clustering quality measure.

4.1 Dataset Description The dataset is collected using Twitter Streaming API on August 28, 2017. We choose ten disjoint trending topics on twitter. For each of the trending topics, a certain number of tweets are fetched using keyword. Total number of tweets in the dataset is 11238. The statistics for the dataset are shown in Table 1. The topic labels and corresponding tweet count for dataset are shown in Table 2.

Improving Microblog Clustering: Tweet Pooling Schemes Table 1 Dataset description

Table 2 Tweet topic distribution in the dataset

11

S no

Tweet entity

Number of unique entities

1

Hashtags

1451

6683

2

Users

9119

11238

3

Replies

182

193

4

User mentions

862

1769

5

URLs

2453

6565

S no

Topic label

1

Arakan

2

Bawanabypoll

1055

3

Darakhosrow

787

4

Doklam

929

5

Houston

1976

6

Jackkirby

7

kbc9

1517

8

Mondaymotivation

1701

Number of tweets

Number of tweets 43

578

9

Nabykeita

1445

10

Ramrahim

1207

4.2 Preprocessing Firstly, the data collected from Twitter are processed to remove the quirks of the social nature of Twitter. A raw tweet may have any sort of slangs, punctuations, and hyphenation. All kinds of contextual information, i.e., hashtags, URLs, user mentions are removed in all tweets. After tokenizing, all stop-words and expressions are removed in the Tweet text. HTML characters are escaped and slangs are modified using a slang dictionary. Only those tweets are retained which has more than three tokens after preprocessing. During preprocessing step, contextual and auxiliary information about the tweets are extracted and stored along with the tweets.

4.3 Data Split Dataset is randomly divided into 70% training set and 30% test set. Topic label distribution in both training and test set is preserved. Training set is used to train Latent Dirichlet Allocation and Non-negative Matrix Factorization model.

12 Table 3 Average document length after tweet pooling

N. Akhtar and M. M. S. Beg S no

Pooling scheme

Average length

1

Unpooled

6.68

2

Hashtag

7.32

3

URL

8.42

4

User

7.78

5

Mention

7.02

6

Reply

6.69

7

Mention-tag

7.34

8

Mention-URL

7.35

9

URL-mention

10

Tag-URL

11.50

8.72

11

Mention-tag-URL

11.50

4.4 Tweet Pooling Using the contextual and auxiliary information extracted in the preprocessing step, document sets are generated according to the tweet pooling schemes defined in Sect. 3. The average document length after tweet pooling for each scheme is shown in Table 3.

4.5 Algorithm Configuration We use standard Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) for evaluating the various tweet pooling schemes for clustering quality. Test tweets are compared with topics obtained in LDA using cosine similarity measure and assigned to the topic with maximum similarity. Similarly, Test tweets are compared with cluster centers obtained in NMF using cosine similarity measure and assigned to the cluster with maximum similarity. We have performed experiments for three different values for the number of topics—10, 20, and 30. For each tweet pooling scheme, LDA model is trained with 5000 iterations with 1500 burn-in iterations. Beta is set equal to 0.01. Alpha is set 50/number of topics. For NMF, maximum iterations are set to 50 and epsilon is set to 0.00001. K-mean algorithm is used for initialization.

Improving Microblog Clustering: Tweet Pooling Schemes

13

4.6 Evaluation Criterion To avoid directly mapping the obtained clusters with the ground truth clusters as required in Precision and Recall, we use Purity and Normalized Mutual Information (NMI) [27]. Purity is computed by assigning each cluster to the most frequent class in the cluster. Accuracy of this assignment is defined by the percentage of correctly assigned documents in the chosen clusters averaged over the number of documents. For set of clusters C and set of golden labels L, purity is defined as Purit y(C, L) =

  1  max j ck ∩ l j  N k

where ck is the set of tweets in kth cluster and lj is set of tweets assigned to the jth label. N is the number of test tweets. Purity measure is biased toward large number of clusters. Maximum purity value can be achieved when each document is assigned to a different cluster. To tradeoff between the clustering quality and the number of clusters, Normalized Mutual Information (NMI) is used. Normalized Mutual Information is defined as N M I (C, L) =

I (C, L) (H (C) + H (L))/2

where I(C, L) is the mutual information and defined as follows:     ck ∩ l j  N |ck ∩ l j | I (C, L) = log N |ck ||l j | k j H(C) and H(L) are the entropy for cluster set C and label set L, respectively, and are defined as follows: H (C) = −

 |ck |

log

|ck | N   l j 

N    l j  log H (L) = − N N j k

NMI is normalized by dividing the mutual information of cluster set C and Label set L by average entropy of set C and L.

14

N. Akhtar and M. M. S. Beg

5 Results and Discussion In order to give insight into how the proposed tweet pooling schemes effected Latent Drichlet Allocation, top words from the learned topics using LDA are shown in Table 4 when number of topics is set to 10. Topical words for only three pooling schemes are shown due to space limitation. Note that all the topics except ‘arakan’ learned by LDA can be easily labeled manually using the top words. LDA cannot learn ‘arakan’ topic using any pooling scheme because the number of tweets for ‘arakan’ topic is very less (30). Topical words in the unpooled category have some unrelated words, which are removed by tweet pooling strategies. Hashtag-based tweet pooling scheme also has some unrelated words, e.g., ‘retweet’ in ‘darakhosrow’, ‘god’ in Table 4 Top ten words discovered by LDA for each topic Topic

Unpooled

Hashtag

Darakhosrow

Dara khosrowshahi ceo Dara khosrowshahi ceo Darakhosro god thing faith struggle find retweet boss chief khosrowshahi dara ceo encourag viral expedia ram gurmeet rahim thing expedia leader trump boss pick

Bawana

Win aap bjp vote congratul lead final wait round land

Bypoll

Keita nabi liverpool Keita nabi agre deal deal agre sign join club liverpool sign summer summer leipzig join club medic

Nabykeita

Chines America trump human indian hero patrol rise issu singl

Doklam

Rape jail live gurmeet Jail rape mondai start save chief sentence ram week case message singh baba lakh wait celebr

Ramrahim rape jail gurmeet media singh ram sentenc case victim

Ramrahim

jack india china kirbi birthdai comic happi celebr troop king

Jack kirbi happi birthdai todai power feel king comic born

Jackkirbi jack kirbi comic birthdai todai happi celebr king book

Jackkirby

todai medic morn underw wilmslow affect teach find bless commun

Love live thought affect Monday moti love proud save commun teach america human texa hope trump mondai hero young week rise

aap win bjp vote congratul lead round elect evm volunt

Mention-Tag

Bawanabypo aap win bjp congratul vote lead round elect start Nabykeita keita nabi agre deal liverpool sign join club leipzig

China india troop Doklam india china chines bless god chines troop win big continu withdraw modi modi continu indian boat

Mondaymotivation Houston love flood Media rescu flood Houston thought live send show million rescu indian big houston flood gui texa million thought harvei relief show hard god judiciari rescu send show Houston

Art brile coach baylor hire cfl job cat tiger hamilton

Art brile coach job hire kbcaror art brile coach cfl kaepernick colin baylor hire cfl job cat baylor scandal tiger

Improving Microblog Clustering: Tweet Pooling Schemes

15

‘doklam’, ‘start’ and ‘week’ in ‘ramrahim’, etc. Mention-tag-based tweet pooling which provides the best results includes only highly relevant words for each topic. The purity and normalized mutual information values for Latent Dirichlet Allocation and Non-negative Matrix Factorization are shown in Table 5. Both LDA and NMF results are obtained for three values of number of topics/clusters—10, 20, and 30. The purity and NMI values are average values. For Non-negative Matrix Factorization, purity increases with the number of topics/clusters. But, NMI values decrease with the number of topics/clusters. Table 5 Purity and NMI values for various tweet pooling schemes for NMF and LDA NMF Number of clusters 10

20

30

Purity

10

20

30

NMI

Unpooled

0.6180

0.6914

0.6952

0.4946

0.5032

0.4624

Hashtag

0.7053

0.7384

0.7213

0.5506

0.5369

0.5001

URL

0.6693

0.6779

0.6949

0.5229

0.4991

0.4795

User

0.7173

0.7141

0.7251

0.5890

0.5272

0.4941

Mention

0.6325

0.6848

0.7059

0.4997

0.4737

0.4824

Conversation

0.6564

0.6908

0.7141

0.5288

0.4933

0.4790

Mention-tag

0.8364

0.7841

0.7800

0.6846

0.5710

0.5452

Mention-URL

0.6867

0.7768

0.8049

0.6029

0.5567

0.5635

URL-mention

0.6738

0.6835

0.7173

0.5419

0.5035

0.4919

Tag-URL

0.6331

0.6536

0.7103

0.5091

0.4918

0.5080

Mention-tag-URL

0.5556

0.6690

0.6467

0.4947

0.4626

0.4436

LDA Number of topics 10

20

30

Purity

10

20

30

NMI

Unpooled

0.6069

0.7075

0.7330

0.4732

0.5069

0.5056

Hashtag

0.6681

0.7198

0.7210

0.5437

0.5429

0.5039

URL

0.6388

0.6965

0.7466

0.5142

0.4963

0.5233

User

0.7128

0.6807

0.7459

0.5328

0.5042

0.5102

Mention

0.6835

0.7081

0.7267

0.5402

0.5148

0.5241

Conversation

0.6788

0.6640

0.7305

0.5182

0.4973

0.4994

Mention-tag

0.7705

0.7708

0.7614

0.6007

0.5724

0.5384

Mention-URL

0.7321

0.7087

0.7620

0.5928

0.5364

0.5318

URL-mention

0.7314

0.7264

0.7213

0.5796

0.5377

0.5046

Tag-URL

0.6813

0.7116

0.7387

0.5302

0.5139

0.5257

Mention-tag-URL

0.6536

0.6835

0.7207

0.5304

0.5016

0.5034

16

N. Akhtar and M. M. S. Beg

Both the purity and NMI values for all tweet pooling schemes are better than unpooled tweets. Hashtag- and user-based pooling schemes outperform mention, URL, and conversation-based pooling schemes when only one auxiliary information is used. Mention-tag and mention-URL-based pooling schemes perform better than hashtag- and user-based pooling schemes. Mention-based pooling scheme performs worst among all pooling schemes for NMF. But, when it is combined with hashtags, it performs the best. For LDA, URLbased pooling scheme performs worst but when URLs are combined with mentions, it performs better than both hashtag- and user-based pooling scheme. One intuitionistic reason for the above results is that the more number of tweets are combined in pooling, better results for purity and NMI are obtained. In hashtag and user-based tweet pooling schemes, almost all the tweets in the dataset are pooled. User-based pooling provides slightly better results than hashtag-based pooling. For Latent Dirichlet Allocation (LDA), the purity increases and NMI decreases with the number of topics/clusters as the case with NMF. The performance of tweet pooling schemes is slightly better than unpooled scheme for LDA. User-based pooling scheme outperforms other when only one tweet auxiliary information is used. Mention- and reply-based pooling scheme performs comparable with hashtag-based scheme. URL-based pooling performs worst. As in the case with NMF, mention-tag and mention-URL pooling schemes outperform all the tweet pooling schemes. Strangely, when three common elements—mentions, tag, URL— are used for pooling, results become worst. For NMF, mention-tag-URL pooling scheme provides poorer results than unpooled scheme.

6 Conclusion This paper presents several tweet pooling schemes to be performed as preprocessing step to improve the results of topic modeling and clustering for tweets. Two new pooling schemes based on URLs and user mentions are proposed. Empirical results show that the proposed schemes perform poorer than the existing hashtag- and user-based pooling schemes but perform slightly better than conversation-based pooling scheme for NMF. Mention-based pooling scheme performs slightly better than hashtag-based pooling but poorer than user-based pooling for LDA. We also show that several other tweet pooling methods can be used by using more than one tweet contextual and auxiliary information. Results show that mention-tag and mention-URL pooling schemes outperform all tweet pooling schemes which use single tweet auxiliary information. Mention-tag performs better than mention-URL pooling scheme. For future work, more tweet pooling schemes can be evaluated by combining other tweet auxiliary information. Moreover, co-occurrence relationship of contextual information can also be used for tweet pooling.

Improving Microblog Clustering: Tweet Pooling Schemes

17

References 1. Blei, David M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84. 2. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2002). Latent dirichlet allocation. Advances in Neural Information Processing Systems. 3. Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM. 4. Hofmann, T. (1999). Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc. 5. Joyce, J. M. (2011). Kullback-leibler divergence. International Encyclopedia of Statistical Science, pp. 720–722. Springer, Berlin, Heidelberg. 6. Zhao, W. X., et al. (2011). Comparing twitter and traditional media using topic models. In: European Conference on Information Retrieval. Springer, Berlin, Heidelberg. 7. Mehrotra, R., et al. (2013). Improving lda topic models for microblogs via tweet pooling and automatic labeling. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM. 8. Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in twitter. In: Proceedings of the First Workshop on Social Media Analytics. ACM. 9. Alvarez-Melis, David, & Saveski, Martin. (2016). Topic modeling in twitter: Aggregating tweets by conversations. ICWSM, 2016, 519–522. 10. Tang, J., et al. (2012). Enriching short text representation in microblog for clustering. Frontiers of Computer Science in China, 6(1), 88–101. 11. Tang, G., et al. (2014). Clustering tweets using wikipedia concepts. LREC. 12. Sahami, M., & Heilman, T. D. (2006). A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th International Conference on World Wide Web. ACM. 13. Pedrosa, G., et al. (2016). Topic modeling for short texts with co-occurrence frequency-based expansion. In: 2016 5th Brazilian Conference on Intelligent Systems (BRACIS). IEEE. 14. Bicalho, P., et al. (2017). A general framework to expand short text for topic modeling. Information Sciences, 393, 66–81. 15. Yan, X., et al. (2013). A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web. ACM. 16. Nguyen, D. Q., Billingsley, R., Du, L., & Johnson, M. (2015). Improving topic models with latent feature word representations. Transactions of the Association for Computational Linguistics, 3, 299–313. 17. Holmes, Ian, Harris, Keith, & Quince, Christopher. (2012). Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE, 7(2), e30126. 18. Li, C., et al. (2016). Topic modeling for short texts with auxiliary word embeddings. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM. 19. Caron, F., Davy, M., & Doucet, A. (2012). Generalized Polya urn for time-varying Dirichlet process mixtures. arXiv preprint. arXiv:1206.5254. 20. Wang, Y., et al. (2014). Hashtag graph based topic model for tweet mining. In: Data Mining (ICDM), 2014 IEEE International Conference on. IEEE. 21. Zuo, Yuan, Zhao, Jichang, & Ke, Xu. (2016). Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, 48(2), 379–398. 22. Rosen-Zvi, M., et al. (2004). The author-topic model for authors and documents. In: Proceedings of the 20th Conference On Uncertainty In Artificial Intelligence. AUAI Press. 23. Quan, X., et al. (2015). Short and sparse text topic modeling via self-aggregation. IJCAI. 24. Qiang, J., et al. (2017). Topic modeling over short texts by incorporating word embeddings. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Cham.

18

N. Akhtar and M. M. S. Beg

25. Jin, O., et al. (2011). Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM. 26. Guo, W., et al. (2013). Linking tweets to news: A framework to enrich short text data in Social Media. ACL (1). 27. Manning, C. D., Raghavan, P., Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.

An IoT-Based Smart Parking Framework for Smart Cities Rekha Gupta, Neha Budhiraja, Shreya Mago, and Shivani Mathur

Abstract Urbanization has risen the levels of economic, social, and political changes that have led to various serious socioeconomic problems. The unplanned growth of population and development for meeting the needs of the society has made the situation worse have led to such an impactful situation. Urbanization had sown seeds for a sheer growth of transport vehicles on roads with constant or reduced parking spaces in Indian cities. The parking situation worsens in metropolises, where the land is limited and expensive. Shortage of parking spaces leads to traffic congestion, disproportionate demand and supply, saturated parking space, inappropriate tariffs, on-street, off-street parking, cruising, environmental pollution, and degradation are associated problems. Because of the poor management and policies of India, the parking solutions have become extremely overcrowded. The proposed smart parking system envisions smart cities by blending cutting-edge Internet of Things (IoT) based low-cost sensor technology with smartphones enabled payment systems along with real -time data analytics. The collected information can be suitably displayed on billboards, information boards, and other smart boards at various crucial junctions of the city thereby guiding, streamlining, and controlling traffic and congestion. The parking system allows the users to reserve an advance slot, pick a desirable empty location, and cancel the slot if required. The model also proposes to develop a 3-dimensional imagery of the entire solution on display. The smart parking solution would help generate efficient city landscapes by creating a seamless and urban

R. Gupta (B) · N. Budhiraja · S. Mago · S. Mathur Lal Bahadur Shastri Institute of Management, Dwarka, Delhi 110075, India e-mail: [email protected] N. Budhiraja e-mail: [email protected] S. Mago e-mail: [email protected] S. Mathur e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1174, https://doi.org/10.1007/978-981-15-5616-6_2

19

20

R. Gupta et al.

mobility system for effective user experience. However, the solution demands effective data visualization, standardization and analytics, smartphone integration with IoT technology along with stakeholders cooperation and coordination. Keywords Internet of Things (IoT) · Smart parking IoT system · Smart cities

1 Introduction With the advent of urbanization, small nuclear families, better disposal income, ownership of a vehicle stands a status symbol. This has led to the rapid motorization of metropolitan cities with often urban households holding multiple vehicles at times. Often the attainability in a city is determined by the extent of its vehicles, road, and parking space to support it. This development induced motorization had induced demands for facilities like parking. With metropolitan cities with high-density population, often the parking lots struggle with the issue of being too small. The same has led to problems like accidents, traffic congestion, disproportionate demand, and environmental hazards. Parking tends to become a problem for people. This shows the decrease in the quality of life for residents living nearby that could hurt their local business. Parking thus comes out to be an important instrument to regulate the city’s accessibility and vehicle flow and also to reduce environmental pollution.

2 Problems of Inadequate Parking The problems associated are twofold: the physical and the psychological ones. The physical ones impact the city’s impression and accessibility in terms of its transport line. The problems associated with a shortage of parking space [1] are that of limited parking spaces, of unstructured costs, preference given to on-street parking over offstreet parking, the problem of parking in residential areas, problem of parking in case of special events and environmental impact of parking. Also, associated issues are parking at workplace and safe parking. While the other impacts the mental well-being attributing to the comfortable, sound, and healthy living style. The psychological impacts of parking problems are not far behind. The frustration of drivers looking for empty slots, the endless circling of public spaces, the rising tempers and feuds and simultaneously wasting petrol, money, and time depict the vicious impacts of it. In a survey conducted by the British Parking Association, the drivers said they wasted their time looking for a parking slot, termed the experience as stressful, felt frustrated toward lack of parking space, and felt angry with bad drivers for occupying multiple parking slots. Often, the so-called sophisticated posh urban dwellers are engaged in feuds with neighbors over the so-called parking problems.

An IoT-Based Smart Parking Framework for Smart Cities

21

3 The Challenge The concept of smart cities further aggravates the demand for an efficient and effective parking solution. The challenge ahead of smart cities is to remake cities into gentle, smart, and efficient places with sufficient and efficient mobility with improved infrastructures and public transport flow. The smart cities advocate for smart electric, nonpolluting, autonomous vehicles for reducing traffic and noise pollution along with the impactful ecological parking which ensures quick, safe parking with minimal energy consumption. It advocates for technological dynamic solutions for ease of guiding the drivers toward their ultimate destination––the parking. Over the decades, more than two dozen strategies for tackling parking menace have been devised. Most of the strategies are well established and practiced in metropolitan areas but their potential is yet to be fully exploited. Not one strategy works as best fit everywhere. Hence, many of them have been at the forefront over the years. Some of the notable mentions include shared parking, regulating and pricing policies, using off-site parking, encouraging alternative modes of transportation, improving availability and accessibility of public transport, improving walking and cycling conditions, Improving parking facility design and operation, devise campaigns (oddeven, one house-one car, car only when parking available), etc., heavy taxation on a new purchase of new vehicles campaigns in schools, colleges, TV, newspapers, etc. Technology nowadays is significantly used to provide better implementation to the above-mentioned parking strategies where the solutions range from typical controlling, monitoring, and surveillance system. Some notable mentions include parking guidance technology, parking count system, parking entry barriers, parking area protectors, automatic number plate recognition system (ANPR), long-range RFID systems. These solutions are standalone in their respective domain areas automating some of the processes but do not represent a holistic intelligent parking solution.

4 Proposed Framework Parking systems in smart cities allow operators to effectively manage parking systems and operators to generate more revenue. Benefits envisaged by the smart IoT parking solution [2, 3] are optimized parking, reduced traffic, and pollution, consolidated payments, better safety, enhanced user experience, real-time trend analysis, reduced operational cost. The authors thus present a framework on Smart Parking system that would involve the use of low-cost IoT-based sensor technology, real-time analytics and systems which allow people to view parking spaces. As per [4], IoT can be defined as the interconnection of smart devices or objects that are attached with sensors or embedded systems and can communicate and interact with other machines, environment, and objects. The thing in our case refers to analog devices such as cars. The car information is collected by the smart sensor to connect directly or via gateways to the cloud. The cloud passes the information to the car

22

R. Gupta et al.

parking system that comes with the central server and the analytics software. These perform real-time data management with data prepossessing, processing, consolidation along with data modeling and visualization. The processed information is then sent to data centers for further advanced analytics, mining, and data storage. This entire flow essentially represents the data flow. The smart parking system can also be integrated with other mobile cloud payment apps for payment, mobile cloud public transportation apps, information apps for add on value-added services. Nowadays, built-in IoT platforms are available for seamless integration for the value-added services. The central server is responsible for passing the control information to all its stakeholders like the mobile app driver, parking lot controllers, peripherals display systems, and regulators like civic authorities. These peripherals directly connect with the server for all later communications (Fig. 1). The framework uses technical components like Mobile App, sensor, Smart Parking Meters or Smart Parking Payment Systems, Central Server on Cloud, and Analyzing Software. The framework provides the user with a mobile app for using the smart parking solution. The app allows the user to gain parking information of a parking bay of his requirement. The options proposed are on-street parking, indoor parking, or multilevel parking. The framework also proposes a 3D view of the desired information. The app then facilitates the parking of a preferred slot/optimized slot for a desirable time along with payment details shown by the system. The payment can be done via the app itself. The app would provide specific authorizations and discounts if available. The app would also facilitate advance booking and extension of booking along with cancelations in stipulated time periods. The app would further guide the car toward the slot to assist in parking via an optimized routing method. Also, in the case of all occupied slots, it would re-route the driver to secondary slots or public transportation channels for better management of traffic.

Fig. 1 Proposed framework

An IoT-Based Smart Parking Framework for Smart Cities

23

(a) Sensors—Sensors observe parking spaces and display active status to the gateway, which sends the live information to cloud platforms, that show realtime parking status which can be viewed on various devices. They generally weigh less than 500 g with a load resistance capacity of 1200–1400 kg with a battery life of 10 years. They can be both wired and wireless although the latter is preferred for being a simple and cost-reliant solution having easy deployment in parking slots without expensive cable replacements. The sensors are in line with easier maintenance and are more vandal-proof. These wireless sensors are also in line with the aesthetics of the city and can be connected with a radio device that can communicate with the Bluetooth of the smartphone devices. The sensors come fitted with filters for filtering out interference or noise like current coming from power lines, obstruction coming from trains, or from passing traffic, that can prompt false parking events. They could also be installed for counting the vehicles in a lot. Sensors could be classified according to the place where they are positioned or on the basis of technology they use. On the basis of position, there are flushmounted, surface-mounted, or overhead-mounted sensors. Sensors available are generally flush-mounted IoT parking sensors affixed right into the ground, unnoticeable, yet highly efficient or surface mount parking sensors such as surfaces sheathing cabling, layers, rooftops, and wharfs. Overhead indicators are used in multilevel parking systems to guide the drivers toward their destination by giving high-visibility, colored LED overhead indicators that can be controlled dynamically. Based on the type of technology, there are passive infrared, active infrared, ultrasonic sensor magnetic, optical sensors [5–8]. The Passive Infrared Sensors (PIR) use a pair of sensors to detect heat changes and notify when signal differential occurs. The Active Infrared sensors have an emitter and receiver and notify when the receiver is unable to receive the emitted beam because of obstruction. The infrared sensors are sensitive and hence their accuracy is an issue. Ultrasonic sensors make use of sound waves which makes them suitable for stable detection of the objects in the environment. These sensors emit sound waves and a receiver processes the reflected waves and calculates the distance between the vehicle and the object. Optical sensors detect a change in light when light is obstructed by a vehicle/car. The magnetic parking [9] sensors use the earth’s electromagnetic field to ascertain the existence of the vehicle to the central server and are the most preferred ones. Some parking systems also deploy secondary mechanisms such as signals received from the sensors to ascertain the status of parking. The cloud receives a weak signal when the car is parked. (b) Gateway—The wireless sensor network (WSN) is connected by a gateway [10] to the wireless connectivity beyond. The gateway acts as a connector between the WSN, digital boards, parking indicators, and the cloud. They act as a common IoT gateway platform, which is flexible to include a wide range of connectivity requirements and rules in city environments ranging from common Ethernet and Wi-Fi supporting devices to standards like Sigsbee/802.15.4 [11], LoRaWAN

24

R. Gupta et al.

[12], 3G, 4G, and upcoming 5G. They often use secure communication with the Cloud over Ethernet, Wi-Fi, 3G, and 4G. They also have provisions for over the air configuration setup and firmware updation, which means that the devices are kept updated with the latest software versions. Once the smart parking system is in place, the users have a compatible device with the potential for handling services like lighting, Wi-Fi, and surveillance. (c) Cloud—These generally provide intelligent IoT services platform [13–15], using open web interfaces. These clouds provide a holistic suite of parking solutions that include sensor and counters management and maintenance along with that of signage and digital displays. They also include parking operational services for organization, enforcement, guidance, payment, and other informational services. They further provide for data management, visualization, and analytical reporting. The clouds nowadays are compatible with OpenID and OAuth2 standards [16] for advanced Identity and Access Management (IAM) capability. Cloud supports both direct and federated user account access login mechanisms, while also enforcing data and functional security. Often the data stored in the database is rule based and predictive using back-chaining and forwardchaining methods. Also, the rule-based systems are self-learning models that incorporate newer knowledge into the database. All the data in the database, whether in flight or at rest needs to be encrypted. Data availability, security, and resilience are to be ensured by the cloud-based IoT platform. Cloud adheres to OpenAPI standards [16] and comes with a comprehensive and well-developed libraries API conforming with the standards. The APIs cover the entire cloud platform services ranging from low-level data events to advanced analytics events. The architecture [17] preferred for the cloud should support availability, security [18], dynamic scaling, optimal efficiency, and effective continuous delivery. Cloud should also provide a layer of abstraction to business managers and analysts to define or redefine their business requirements on the spot rather than to deal at the developer level to make changes. The cloud also provides dashboards [19, 20] that are general purpose as well as customized for information visualization. The administrator is able to view all controls, the parking controller is able to view all slots, the driver is able to view the parking layout, his allotted bay, traffic situation, etc., the data manager is able to view data layouts, etc. These are often displayed using HTML5 on any modern web browser. Analyzing software [21] would analyze the data collected by the server condensing and making it available for the stakeholders. This information is then made available on varied types of devices such as computers, phones, tablets and can be personalized for any specific stakeholder. Advanced dataanalytic [22–24] software can also assist in cash management, sanctioning built and revenues, violations, etc. Personalized messages can be curated for the maintenance department, notifying other terminals, and therefore making it feasible to direct the work in the areas in which the current violations are huge (Fig. 2).

An IoT-Based Smart Parking Framework for Smart Cities

25

Fig. 2 The wireless sensor network for the proposed framework

5 Perceived Benefits Smart Parking Solutions envisage benefits for all its involved stakeholders, city administration, parking management companies, and drivers. For the city, it represents redefining the city’s accessibility and its well-articulated transportation lifelines. It redefines the user experience by improving the city’s brand, image, and impression. It helps the city’s administration by regulating the traffic, optimize parking spaces, and efficient management of parking problems. For drivers, it reduces the search time for parking spaces. The navigation provided by the solution allows the drivers to take stock of the traffic situation and the optimized routing helps in the significant reduction in the driving distance, thereby reducing time, effort, and money. Efficient parking systems of a city drive for better business and trade opportunities by being the preferred model destinations. These solutions serve as a data basis for city planning and development of future traffic strategies. Further, they serve the bigger purpose of serving as a green solution for reducing carbon emissions. Besides, they also advocate a reduced stress experience advocating safe and guided parking thereby contributing toward a healthier lifestyle of the residents of the city.

26

R. Gupta et al.

6 Challenges I.

Security—Like any other technical solution, smart parking solutions may face numerous challenges on its operational side. The most vulnerable facet identified by the authors is its terms of security challenges [25, 26] the entire framework faces on its server/client-side operations. Notable mentions are weak server-side controls mainly due to lack of Secure Socket Layer and low-grade encryption. Mobile app represents the second vulnerable dimension as they are susceptible to weaknesses associated with cross-platform development. It a critical concern in mobile apps security that could result in confidential data theft, brand and trust damage, frauds, revenue losses, etc. Binary hardening techniques could be utilized for countering this. The application should also follow secure coding techniques for jailbreak detection controls, checksum controls, certificate pinning controls, and debugger detection controls. Lack of secure data transportation and insecure storage are other associated issues. Ensuring Transport-Layer Security mechanisms are key to solving this. Further, to secure the data storage across platforms is to build an additional layer of encryption over the base-level encryption provided by the OS. This gives a massive boost to mobile apps security and reduces your dependence on the default encryption. Poorly designed authorization and authentication could also prove fatal in losing personal and other information. Some of the viable solutions are allowing login only in online mode thereby avoiding offline authentication and using longer passwords rather than 4-digit PIN as in use nowadays. (a) Lack of Binary Protections—In the absence of binary protection, an adversary can reverse engineer the code of app to inject a malware or redistribute the pirated application possibly with a threat. It is a critical concern in mobile apps security as it can result in confidential data theft, brand and trust damage, frauds, revenue losses, etc. To avoid this, it is important to use binary hardening techniques. Under binary hardening, the binary files are analyzed and modified to protect against common exploits. This allows for the fixing of vulnerabilities in the legacy code itself without the need for source code. The application should also follow secure coding techniques for jailbreak detection controls, checksum controls, certificate pinning controls, and debugger detection controls. (b) Insecure Data Storage—Another common mobile apps security loophole is the lack of secure data storage. A common practice among the developers is to depend upon the client storage for the data. But client storage is not a sandbox environment, where security breaches are not possible. In the event of an acquisition of the mobile by an adversary, the data can be easily accessed, manipulated, and used. This can result in identity theft, reputation damage, and external policy violation (PCI). The best way to secure your data storage across platforms is to build an additional layer of encryption over the base level encryption provided by the OS. This gives

An IoT-Based Smart Parking Framework for Smart Cities

27

a massive boost to mobile apps security and reduces your dependence on the default encryption. (c) Insufficient Transport-Layer Protection—Transport layer refers to the route through which the data is transferred from the client to the server and vice versa. In the case of an insufficient transport layer, a hacker can gain access to the data and modify or steal it on his will. This results in frauds, identity threats, etc. A common practice is to use SSL and TLS to encrypt the communication. (d) Poor Authorization and Authentication—Poor or missing authentication allows an adversary to anonymously operate the mobile app or backend server of the mobile app. This is fairly prevalent due to a mobile device’s input form factor. The form factor encourages short passwords that are usually based on 4-digit PINs. Unlike in the case of traditional web apps, mobile app users are not expected to be online throughout their sessions. Mobile internet connections are not as reliable as traditional web connections. Hence, mobile apps may require offline authentication to maintain the uptime. This offline requirement can create security loopholes that developers must consider when implementing mobile authentication. An adversary can brute force through the security logins in the offline mode and make operations on the app. In the offline mode, apps are usually unable to distinguish between users and allow users with low permissions to execute actions that are only allowed for admins or super admins. In order to prevent operation on sensitive information, it is best to limit login only in the online mode. If there is a specific business requirement to allow for offline authentication, then you can encrypt the app data that can be opened only with specific operations. (e) Security Decisions via Untrusted Inputs—Developers generally use hidden fields, values, or functionality to distinguish between higher and lower level users. An attacker might intercept the calls and mess with such sensitive parameters. Weak implementation of such hidden functionalities leads to improper app behavior resulting in higher level permissions being granted off to an attacker. The technique used to exploit these vulnerabilities is called hooking. A mobile application maintains communication between clients and servers using an interprocess communication (IPC) mechanism. IPC is also used to establish communication between different apps and accepting data from various sources. An adversary can intercept this communication and interfere with it to steal information or introduce malware. Here are some tips related to IPC mechanisms: • In order to satisfy a business requirement for IPC communication, the mobile application should restrict access to only selected whitelisted applications. • User interaction should be required before performing any sensitive action through the IPC entry points. • Strict input validation is necessary to prevent input-driven attacks.

28

R. Gupta et al.

• Avoid passing sensitive information through IPC mechanisms, as it can be susceptible to being read by third party applications under certain scenarios. II. Costing—Smart parking systems demand smart governments of their financial commitments. The costing involves hardware, software, deployment, maintenance, and upgradation costing. Costing [27, 28] also involves manpower and technical skill costing. Such budgets would require executive approval and commitment from the top management to see the approvals granted. III. Scalability—The main novelty of the proposed approach lies in the context of scalability [29], and in particular the reduced costs for obtaining it and the ease of introducing a new element to the existing infrastructure. The sustainability of hardware and software components present yet another challenge to the smart parking systems. Yet, with advancements in technology, availability of cheaper cloud storage, built-in IoT platforms, it looks like a reality that is not far from been hugely replicated in our lives shortly.

7 Case Study Manipal Hospital is a super specialty healthcare facility, which delivers world-class health care, at an appropriate cost. Their mission is to deliver economical, precise, and approachable health care to the society, without bias. With renowned doctors from across the world, best-in-class infrastructure in radio-diagnosis and clinical practices, with the latest technological evolution. The prominent problem they are facing today is that of parking spaces within the hospital campus. The current issue highlighted are either insufficient according to the requirements of people or these spaces are not properly allocated. They tried all possible ways to tackle the issue of hospital parking. Parking in a hospital situation has to be improved. Issues in parking resulted in user inconvenience. Parking of vehicles needs parking policies for safety and security measures. These are some picture scenarios of parking space in Manipal Hospital, Dwarka (Figs. 3, 4 and 5). The method for managing parking in hospitals can be done by improving the use of existing parking spaces, by intimating the user about available parking lots and directing them accordingly. These days, there is increasing popularity and feasibility of internet-enabled smartphones and due to data availability round the clock we can initiate measures to solve the parking problem. Using an Android smartphone can enable users to virtually carry the internet with them and find the measurable step to solve the issue. We have come up with an idea of developing a cost-effective and automated parking management system for the hospital. For this, we will divide the parking area of the hospital into blocks (A, B, etc.) and its slots (A-1, B-3, etc.). We will develop an Android application that can be used by the Visitor (employee and patients), Operator, Guard appointed at that parking location.

An IoT-Based Smart Parking Framework for Smart Cities

29

Fig. 3 The parking space

Fig. 4 The congested parking for ambulances

• It provides a 3- dimensional view of the parking area. • The app will be providing the parking according to the car size and its dimension that would lead to space optimization. • It will show which parking slots are available, and which are booked. • It will provide navigation to the parking slot selected. • The charges will be applied according to the number of hours parked in the slot. • The operator can see the available, unavailable parking slots, and the whole parking view.

30

R. Gupta et al.

Fig. 5 The parking area of Manipal Hospitals at ground level

• The guard will be making sure that if the car parked is in the same slot that has been allotted. The intended system is an Android Application which consists of two parts: Administrator and Booking. The user can select a parking slot that is closest to their destination after being logged in the application. After the user books a selected spot the admin refurbishes the status of that parking slot to “RESERVED”. If the user does not reach the parking area within the given span of time, their reservation will be canceled and the status will be refurbished to “EMPTY”. Intelligent Parking System (IPS) is reliant on the client–server model. It is economically favorable because it does not need any heavy infrastructure. It is not sensitive to temperature change and air turbulence. The main purpose of Intelligent Parking System (IPS) is to give the following: • Intelligent, pervasive, user-friendly, and automated parking system application that reduces user’s time and avoids traffic congestion in urban cities. • To guarantee safe and secure parking slots in limited space, which is of most urgency. The modules of the application accorded with the user interface. It furnishes users with the workability of registering, logging in, reserving, and making the payment. If the user is not logged into the application then, the user should register themselves in the application by providing their details. After the user registers, the user can sign in using the user-id and password. Once the user logs in, then the user finds the parking space available, then confirms that parking space by making an online payment. Then, the parking slot is calculated by the metric of the car as per the size and availability of slots available. If the user tries to park the car outside the allotted slot, the IoT sensors would send the notification regarding the same to the administrator module. IoT deployments can involve costs of between Rs. 465 and 650 per sensor, with setup cost and can be estimated as the most large-scale environment that may require budgets under Rs. 1,00,000 and above for IoT projects.

An IoT-Based Smart Parking Framework for Smart Cities

31

8 Conclusion The need to have a smart city planned with smart facilities is a topmost priority in today’s technology-driven world. In the past years, a large amount of work has been done toward excelling and making the dream true toward intelligent technology. The growth of various IoT and cloud computing techniques and technologies have given rise to new driven ideas that have given rise to new possibilities of a vision of smart cities. Smart parking and its required facilities are a concern in our country, due to the increasing pollution and a lot of projects and ideas have already been deployed for a smart IoT parking system. We have addressed the issue and the viable solution that could solve the problem in an optimized way, using IoT and cloud-based services. The issues in this paper are not standalone in seclusion and hence the system that we have proposed provides us with a real-time information of the parking slot available. Users can remotely login from the provided mobile application and the updated available slot if occupied is updated to the admin and the acquired guard in the premises. Users can remotely book a location as per their choice through the application. The efforts made in this paper is to provide users a much convenient and easier way for the people to reduce their efforts to find a spot for parking and improve the conditions and welfare of the users, so that it becomes feasible for the customers to park their cars in a safe and economical parking space hassle-free.

References 1. Nagar, A. Planning tank. Issues and challenges of parking in Indian metropolises [Online]. Retrieved May 15, 2019, from https://planningtank.com/dissertation/issues-challenges-par king-indian-metropolises. 2. Plasma. 10 benefits of a smart parking solution [Online]. Retrieved April 12, 2019, from http:// www.plasmacomp.com/blogs/benefits-of-smart-parking-solution. 3. Basavaraju, S. R. (2015). Automatic smart parking system using internet of things (IOT). 5(12), 629–632. 4. Lin, T., Rivano, H., & Mouël, F. L. (2017). A survey of smart parking solutions. IEEE Transactions on Intelligent Transportation Systems, 18(12), 3229–3253. 5. Larisis, N., Perlepes, L., Kikiras, P., & Stamoulis, G. (2012). U-Park: Parking management system based on wireless sensor network technology. In Proceedings on International Conference on Sensor Technologies and Applications (pp. 170–177). 6. Moguel, E., Preciado, M., & Preciado, J. (2014). Smart parking campus: An example of integrating different parking sensing solutions into a single scalable system. ERCIM news smart cities (pp. 29–30). 7. Chinrungrueng, J., Sunantachaikul, U., & Triamlumlerd, S. (2007). Smart parking: An application of optical wireless sensor network. In Proceedings on International Symposium on Applications and the Internet Wksp (p. 66). 8. Data collection-the scope of parking in Europe. European Parking Association (2013). www. europeanparking.eu. 9. Kumar, P., & Siddarth, T. (2010). A prototype parking system using wireless sensor networks. In Proceedings on International Joint Journal Conference on Engineering and Technology.

32

R. Gupta et al.

10. Vishnubhotla, R., Rao, P., Ladha, A., Kadiyala, S., Narmada, A., Ronanki, B., & Illapakurthi, S. (2010). Zigbee based multilevel parking vacancy monitoring system. In Proceedings on IEEE International Conference on Electro/Information Technology (pp. 1–4). 11. Tang, V., Zheng, Y., & Cao, J. (2006). Intelligent car park management system based on wireless sensor networks. In Proceedings on International Symposium on Pervasive Computing and Applications (pp. 68–72). 12. Ji, Z., Ganchev, I., O’Droma, M., Zhao, L., & Zhang, X. (2014). A cloud-based car parking middleware for IoT-based smart cities: Design and implementation. Sensors, 14(12), 22372– 22393. 13. Pham, T., Tsai, M., Nguyen, D., Dow, C., & Deng, D. (2015). A cloud-based smartparking system based on internet-of-things technologies. Special Section in IEEE Access—Emerging Cloud-Based Wireless Communications and Networks, 3, 1581–1591. 14. Inaba, K., Shibui, M., Naganawa, T., Ogiwara, M., & Yoshikai, N. (2001). Intelligent parking reservation service on the internet. In Symposium on Applications and the Internet Wksp (pp. 159–164). 15. Dillon, T., Chen, W., & Chang, E. (2010). Cloud computing: Issues and challenges. In Proceedings on 24th IEEE International Conference on Advanced Information Networking and Applications (AINA) (pp. 27–33). 16. Liang-Jie, Z., & Zhou, Q. (2009). CCOA: Cloud computing open architecture. In Proceedings on IEEE International Conference on Web Services. IEEE. 17. Okuhara, M., Tetsuo, S., & Suzuki, T. (2010). Security architecture for cloud computing. Fujitsu Science Technology Journal, 46(4), 397–402. 18. Gounder, M. S., Iyer, V. V., & Al Mazyad, A. (2016). A survey on business intelligence tools for university dashboard development. In Proceedings on 3rd MEC International Conference on Big Data and Smart City (ICBDSC). IEEE. 19. Quintuna, X. (2014). System and method for implementing dynamic access control rules to personal cloud information. U.S. Patent No. 8 (pp. 441–914). 20. Khan, R. H., Ylitalo, J., & Ahmed, A. S. (2011). OpenID authentication as a service in OpenStack. In Proceedings on 7th International Conference on Information Assurance and Security (IAS). IEEE. 21. Abadi, D. J. (2009). Data management in the cloud: Limitations and opportunities. IEEE Data Engineering Bulletin, 32(1), 3–12. 22. Sakr, S., Liu, A., Batista, D. M., & Alomari, M. (2011). A survey of large scale data management approaches in cloud environments. IEEE Communications Surveys & Tutorials, 13(3), 311– 336. 23. Demirkan, H., & Delen, D. (2013). Leveraging the capabilities of service-oriented decision support systems: Putting analytics and big data in cloud. Decision Support Systems, 55(1), 412–421. 24. Singh, J., Pasquier, T., Bacon, J., Ko, H., & Eyers, D. (2015). Twenty security considerations for cloud-supported internet of things. IEEE Internet of Things Journal, 3(3) (pp. 269–284). 25. Botta, A., De Donato, W., Persico, V., & Pescapé, A. (2016). Integration of cloud computing and internet of things: A survey. Future Generation Computer Systems, 56, 684–700. 26. Ray, P. R. (2016). A survey of IoT cloud platforms. Future Computing and Informatics Journal, 1(2), 35–46. 27. Gubbi, J., Buyya, R., Marusic, S., & Palaniswami, M. (2013). Internet of things (IoT): A vision, architectural elements, and future directions. Future Generation Computer Systems, 29(7), 1645–1660. 28. Ren, J., Guo, H., Xu, C., & Zhang, Y. (2017). Serving at the edge: A scalable IoT architecture based on transparent computing. IEEE Network, 31(5), 96–105.

Open-Source Software Challenges and Opportunities Ajeet Phansalkar

Abstract The availability of certain software namely Open-Source Software (OSS), to freely use, modify brings with it a set of challenges and opportunities, which needs to be addressed with caution. This paper looks into these challenges and opportunities from the perspective of application of these softwares in general and specific to Data Management and Analytics Domains. The paper looks at the various myths and definitions surrounding Open-Source Software and how these can be addressed before beginning the journey of adopting Open-Source Software within Enterprise Information Technology landscape and development of Software Products. This paper dwells into an opportunity, of building a platform, named Decision Fabric™ (DF), and built using Open-Source Software, providing abilities to build industry-specific analytical applications. Decision Fabric™ is a solution, which can stitch together various open-source components to build a very compelling analytical solution based on various industry requirements. Decision Fabric™ provides analytical solution for both Human Centric Analysis and Machine Centric Analysis using a combination of deterministic and probabilistic methods. Keywords Open-source software · Challenges · Opportunities · OSS platform · Automation

1 Introduction According to various studies, worldwide the losses due to failure of Information and Communications Technology (ICT) programs are in billions of dollars [1]. Open-Source Software is built upon licenses which allow the usage of the software, distribution of the software in specific ways. The licenses give different kinds of rights to the user to modify the software source code [1]. Open-Source Software has the ability to save trillions of dollars per year to the ICT industry by virtue of its business model. The business model does this by lowering the cost of software production, A. Phansalkar (B) Tata Consultancy Services (TCS), Mumbai, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1174, https://doi.org/10.1007/978-981-15-5616-6_3

33

34

A. Phansalkar

deployments, and ongoing maintenance over the entire software lifecycle [1]. Thus, Open-Source Software has been a very significant innovation within the ICT industry. Open-Source Software has also created a new factor that the developers and users represent a continuum [1]. In addition, studies have indicated that the proprietary model limits innovation [1]. By democratizing innovation, the Open-Source Software has used social model instead of the industrial model to achieve its productive potential. It was also found by many studies that the defects produced by OpenSource Software were very less compared to the Closed-Source Software [1]. Thus, Open-Source software has propelled innovation with community development and intellectual capital [1]. Interoperability is a major requirement for software to coexist in an ecosystem of softwares in order to have greater collaboration, transparency, growth. Open Standards are built on principles of openness, transparency. Contrary to what people assume Open Standards is not the same as Open-Source Software. Open Standards define specifications while Open-Source Software provides the best implementation mechanism for the same. Thus, interoperability with Open-Source Software allows diverse technologies to talk to each other [2].

2 Background 2.1 Definition of Open-Source Software (OSS) A software created with open collaboration between various parties honoring the Open Source Initiatives license conditions [3].

2.2 Business Forces for Open-Source Software Adoption Adoption of any software within the business depends on various factors like features provided by software, viz., business requirements, the cost associated with the software, support provided by the software beyond the development cycle. The key business forces for making a decision on usage of Open-Source Software in an enterprise are categorized as follows:

Open-Source Software Challenges and Opportunities

2.2.1

35

Financial

Lower Total Cost of Ownership (TCO) The Total Cost of Ownership is the cost of building and maintaining software across the lifecycle of software inclusive of Development, Testing, and Production. Typically, the operating model allows the OSS to be used at no/nominal cost for development and test environment. The production environment is charged as a subscription model. This allows for the Total Cost of Ownership of the software to be low compared to Commercial closed-source softwares, which require licenses for Development, Test, Production environments. These cost savings provide a compelling reason for companies to adopt the technology.

2.2.2

Technical

Security The OSS softwares are maintained by a community of technologists working constantly to improve any defects found on the security of software. This makes the software more secure than closed-source softwares. Transparency The OSS softwares being available with source code to a larger community allow transparency for understanding and enhancing the codebase. Perpetuity The software is available with source forever is a comforting factor to the implementers of the software. Interoperability The ability for the software to interact with various other softwares in the ecosystem helps to increase the reach of the Open-Source Software. Minimum vendor lock-in With the software source available for modification at reasonable prices, the users are not locked into a particular vendor. The software built upon Open Standards allows the software to integrate with other systems thus reducing dependency on vendors.

2.3 Categories of Open-Source Software Licenses [3] The categories of Open-Source Software licenses as per the Open Source Initiative are Copyleft and Non-Copyleft [3]. Copyleft licenses allow derivative works but the derivative work needs to use the same license as the original work. Example of a Copyleft license is GNU General

36

A. Phansalkar

Public License, Affero GPL [3]. This license requires that someone modifying the source of the Copyleft license software needs to use the license of the original software while releasing the software [3]. A Non-Copyleft license allows the release of software with other licenses including proprietary license [3]. Example of Non-Copyleft license includes Apache, BSD. These licenses are also called as Permissive Licenses [3]. Open-Source Licenses are Open Source Initiative licenses which allow the software to be freely explored for the user needs. Some of the common licenses used are [3] • • • • •

Apache License 2.0 BSD 3-Clause, BSD 2-Clause GNU General Public License MIT License Mozilla Public License.

OSS software allows usage for commercial purpose provided it meets the license requirements. OSS software and free software are two terms for the same thing. The OSS name was given by Open Source Initiative foundation while the Free Software Foundation (FSF) coined “free”. The FSF uses a four-point definition for “free”.

3 Challenges The OSS software adoption faces multiple challenges before it is adopted. The challenges are categorized as follows.

3.1 Need for Legal Support [4] Enterprises, that are to adopt OSS, need support from the legal and compliances team to validate the terms and conditions associated with the varied licenses. The legal teams should be able to provide guidance on an enterprise-level policy creation for adoption of OSS. They should also be able to support the monitoring of these policies at a project, portfolio levels. Organizations may also have concerns regarding the Intellectual Property and Rights. Firms like Open Logic, Black Duck provide help with such policy creation and monitoring so that the enterprise does not fail the compliance of license terms.

Open-Source Software Challenges and Opportunities

37

3.2 Need for Commercial, Community Support [4] Enterprises, which procure software, always need the software to be supported by the company making software. With Open Source that can be a challenge given that, the software is produced in Open collaboration. However, there are firms who work in collaboration with the makers of the software and provide support for software. Some firms, which provide such services, include Open Logic, Black Duck. If changes are made to the product and they are not shared to the community, they need to be maintained by the community making the changes. This requires additional resources to maintain these changes. Maintenance of these forks becomes a responsibility for the developers of these changes. Making changes to the products and having it accepted by the community is also a challenge. OSS Development process mandates that the changes made be thoroughly scrutinized. Any kind of changes or proposal for changes is subject to review. The result of this could be that the changes might not be accepted which would leave the maintenance to the developers of the change. It is also observed that the community members would want to have more say in the product developed by a company which challenges the control the company can have on its line of authority.

3.3 Design and Architecture [4] OSS is continuously being worked upon to add new features. If a company uses a certain version of the product and the product changes over multiple versions the new version may not be backward compatible. This would challenge the company using the product if they are using a particular feature or API. Any organization will have to address this problem by using either the latest version of the product/component. There could be a possibility that the feature could be deprecated or there are architectural changes. One possible way to mitigate this challenge is to use products versions by a supporting package company. It is also possible for the product selected to be architecturally incompatible with existing architectures. There could be scenarios of dependencies on conflicting libraries. The architectural incompatibility can have serious effects on the cost, schedule of the development. There could be a possibility of mismatch of programming language, software, and hardware required for running OSS components and may not be aligned to the existing architecture. One way to address this kind of integration issue is to use middleware, virtual machines, or Model-Driven architectures.

38

A. Phansalkar

3.4 Migration [4] Migration from existing products to the OSS products and subsequent training to users/developers is also a challenge, which needs to be considered if the organization is adopting OSS instead of Commercial Off the Shelf Product (COTS).

3.5 Handling Security Issues [4] Another challenge is the perception that given the free availability of source code it is easy for someone with a bad intent to be able to introduce security vulnerabilities within the software. Firms providing support address this challenge. The community also addresses this challenge by fixing the vulnerabilities. It has been observed that having a large community to review the code allows for earlier vulnerability reduction.

3.6 Product Selection [4] Due to large number of products available for usage, identifying the right product to use for a given set of requirements is a challenge. Quality of the selected products also is a question. Due to this a variety of evaluation mechanisms have been built. Some examples of these mechanisms are Cap Gemini’s Open Source Maturity Model (OSMM) [4], Navica’s OSMM [4], OpenBQR [4]. The usage of the models is limited in spite of proving such a mechanism. Because of the number of products available, it is also a difficult challenge to evaluate these softwares for their features available. Many times the time available for project becomes a challenge to do the evaluation of the product. When an OSS product decides to fork a product due to difference in requirement, identifying which fork to use also becomes a challenge. There is a significant pressure from community to not fork but if it happens, there are concerns related to the future of the product itself.

3.7 Documentation [4] The unavailability of good documentation is a challenge. Having good documentation is necessary for others to understand the design, architecture, code. However, there is a feeling in community that having the code comments is enough. Even in the case of documentation being available, the amount of features being added to the software the documentation may be inadequate.

Open-Source Software Challenges and Opportunities

39

4 Opportunities and Methodology The various challenges with respect to adoption of OSS also provide an opportunity to build applications, products, and platforms by gainfully managing the challenges. Decision Fabric™ is a platform built leveraging OSS. The platform is a cognitive automation solution, which is primarily focused on providing advanced automation capabilities leveraging artificial intelligence techniques. Applications for varied domains can be built using the components of DF in a plug and play kind of mechanism. The different components can be plugged as per the business requirements. The different components are from areas like ingestion of structured and unstructured data formats, advanced Optical Character Recognition (OCR), Optical Mark Recognition (OMR), robust and feature-rich Natural Language Processing (NLP) engine, providing Reinforced Learning, advanced analytics that supports schema-less storage with highly interactive and advanced visualization driven signal detection. Decision Fabric™ is a cognitive automation solution based on the different disciplines of Artificial Intelligence, where the solution has the ability to learn like a human being and assist in decisions. It is a fabric of different technological solutions that need to be brought together involving people, processes, and data, etc. Decision Fabric™ aims to establish a right balance between deterministic and probabilistic approach and hence, it is a fabric of both the approaches. The building blocks of Decision Fabric™ contain components from different branches of Artificial Intelligence such as Perception, Natural Language Processing (NLP), Machine Learning and Deep Learning, Reasoning, and Knowledge Representation. Most of the components in these building blocks are plug and play components. Hence, 3rd party components can also be integrated based on the need and convenience of the business. The following figure shows the conceptual view that forms the core of the Decision Fabric™ platform (Fig. 1). L0: In technical terms, this is the Data Management layer. The data storage facilitates different formats. L1: Observe and Learn, where decision engines process transactional or interaction specific information. L2: Understand/Interpret, Enrichment of Entity Information and Correlation with Entity master happens here. L3: Evaluate, Contains Machine Centric and Human Centric Decisions making components. L4: Decide, used when quality of Information is poor and requires SemiSupervised or Unsupervised Learnings. DF applied to build a Data and Analytics application for a Life Sciences domain that provided multiple benefits. The challenges faced included a. Varying sources of Data: The application required extracting data from varying sources of structured data and unstructured data like forms, portals, call center applications, mobile applications, social media, publications, email/faxes, litigations, peer and regulatory databases.

40

A. Phansalkar

Fig. 1 Solution components: trademark of TCS

b. Accuracy and Precision: Processing required high level of accuracy, precision, and consistency. c. Faster Processing: Processing times of requests needed to be of faster order in spite of the increase in data volumes. d. Learning: Continuous Learning needed to retain the knowledge within the enterprise and understand the hidden patterns in the data being analyzed. The DF Solution to address the above challenges included the following a. Ingestion—various Readers: This comprises components for reading data from various formats like XML, Documents, Emails, images. b. Optical Character Recognition: This provides intelligent reading of images, including embedded and scanned images, extract tiles from images to localized text, extract text from images and tiles, extraction into digitized XML documents. c. NLP: Provides a features rich open-source NLP engine with strong Lexical Analysis and support for Name Entity recognition. It also provides an algorithm to imitate Human Reading. d. Enrichment: Provides support dictionary for spelling correction, medical product identification, domain-specific fields like Dates, Person/Org names, Reactions, Drug details, Narratives, Locations, Test Reports, Derived fields, Causality assessment, Multiple choices for specific terms, E2B creation for capturing data across dimensions. Each enrichment and decision is supported by judgement trail. e. Ontologies: Platform will be seeded with ontologies covering diseases, medical events and procedures, drugs.

Open-Source Software Challenges and Opportunities

41

f. Continuous Learning: The Machine centric decision-making focuses on correlation or grouping of entities. This involves supervised learning with sufficient datasets to build hypothesis. The typical hypotheses are built around Unlisted Events for Drugs, relation between Event and Medical History, Causality assessment. It uses right balance of rules, case-based reasoning, and deep learning to maximize the accuracy. The knowledge graphs can be built to show these correlations so that the verification and analysis can be done at later point in time. These new patterns are fed back to ontologies and rules. g. Front End: The front end enables monitoring of the case processing batches and gives user interface for review of the processed case along with the confidence indicators for different information and decision components.

5 Results The efficiency of the solution is based on the following results achieved for the Life Science Use Case: a. Demonstrated that 42% of the load can be done with Straight Through Processing, i.e., Without manual intervention b. Achieved significantly high accuracy of >90% across main domains and >90% F1 score against manual processing benchmarks of around 80%. c. Key challenges included getting the semantically right accuracy percentage and identifying Special Circumstances cases (25%) d. The processing (except litigation cases with 1000+ pages) duration was brought down to 20 s. Giving huge time for professionals to review, if necessary. e. It is committed to bring cost saving of 20% within 1 year of operations. Decision Fabric™ engine’s existing features and well-defined feature road map ensured the following benefits for the customer: • Scalable and futuristic solution: This will handle volume growth by managing growing volume and handling diverse types of incoming data formats. • Foundation roadmap: It will lay a roadmap for use cases within and beyond the current application • Scalable solution: Cognitive automation-based Decision Fabric™ can improve efficiency and quality of application results. • Throughput: The solution also handles larger number of cases, typically thousands of cases, with less turnaround times thus increasing the Throughput of the applications. When such case comes to a life science company, it is typically outsourced to a Business Process Outsourcing team, handled by the multiple stakeholders, as part of a workflow, which requires accuracy to capture the user inputs and capture the audit trails of the transactions. While handling the case the system does processing on case inputs, applying business rules using Artificial Intelligence techniques. This automates the manual processes with accuracy to reduce the turnaround times for the cases.

42

A. Phansalkar

• Usability Security and Integration: The system has ability to integrate in a plug and play to existing transaction systems in a secure way. The case after coming into the system has the ability to have multiple system users acting on the case to record their findings. The audit trail of the case is kept as to who did what while not compromising on data security aspects of the case. • Backup and Recovery: All aspects of a mature transaction processing system are done with respect to back up and recovery of the data. This is in line with the practices of a pharmaceutical industry. • Supported Data Types: The data types typically are unstructured data, which is not having a fixed template. The data can be forms filled as per the case, telephonic records, emails, or manual inputs. The system allows the ability to handle variety of data types using Human Reading Mimicking techniques, mimicking how a human reads variety of data.

6 Conclusion This paper highlights the various challenges in the adoption of OSS. With proper management of these challenges, OSS can be used for building platforms, products, solutions at a low Total Cost of Ownership. With right set of Open-Source components, transformative analytical applications can be built which can accelerate the journey of digital transformation across companies. The Decision Fabric™ components will continue to be enhanced in order to provide transformative analytical application building capabilities. Trademarks: Decision Fabric™ (DF): is a trademark of Tata Consultancy Services (TCS). Acknowledgements Thanks goes to Mahesh Kshirsagar, Tata Consultancy Services (TCS) Mumbai, India as he is a mentor of this project.

References 1. Tiemann, M. (2011). How open source software can save the ICT industry one trillion dollars per year. 2. Almeida, F., Oliveira, J., & Cruz, J. (2011). Open standards and open source: Enabling interoperability. 3. https://opensource.org/faq. 4. Stol K. J., & Ali Babar, M. (2010). Challenges in using open source software in product development: A review of literature.

“Empirical Study on the Perception of Accounting Professionals Toward Awareness and Adoption of IFRS in India” Neha Puri, Harjit Singh, and Vikas Garg

Abstract Indian corporations provide financial statements in a unique way according to the nation’s necessity. Making uniformity in accounting standards will arise as a huge job. This has fixed by the implementation of IFRS, as IFRS is a popular accounting language for the companies located worldwide. IFRS across the globe is comprehensible and similar. Within the modern globalization, technology in which the Indian financial system has also flourished, adopting IFRS would render Indian business enterprise on an equal footing with other global businesses. This will lead to widening the scope of the Indian market’s foreign financing. To converge with IFRS, India also decided by creating an Ind-AS standard. This study examines the perception of accounting professionals toward the awareness and adoption of IFRS. Keywords IFRS · Awareness · Adoption · Accounting standards · Perception

1 Introduction “International Financial Reporting Standards” (IFRS) been developed and framed by the “International Accounting Standard Board” (IASB) adopted or converged by many nations around the globe. The requirements of IFRS disclosure reduce financial information irregularity and improve the financial specialists’ ability to offer more precise forecasts [12]. “International Financial Reporting Standards” (IFRS) are a set of accounting standards developed by an independent, not-for-profit organization N. Puri (B) Amity College of Commerce and Finance, Amity University, Noida Campus, Noida, Uttar Pradesh, India e-mail: [email protected] H. Singh Amity School of Business, Amity University, Noida Campus, Noida, Uttar Pradesh, India e-mail: [email protected] V. Garg Amity University, Greater Noida Campus, Noida, Uttar Pradesh, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1174, https://doi.org/10.1007/978-981-15-5616-6_4

43

44

N. Puri et al.

called the “International Accounting Standards Board” (IASB). The main agenda of IFRS is to provide a global framework for how public companies prepare and disclose their financial statements. IFRS provides broad regulation for preparating financial statements, rather than creating rules for industry-specific reporting. IFRS has its relevance for the large companies who have their subsidiaries located in different countries. The adoption of a uniform set of worldwide standards (IFRS) will ease the accounting procedures by permitting the company to use one reporting language in the preparation of financial statements. An uniform standard will also assist the investors and auditors with a consistent view of investments. Around 100 countries have permitted or require IFRS for public companies, with more countries expected to transition to IFRS by 2015. India has not received IFRS Standards for revealing by domestic organizations and has not yet formally dedicated to embracing IFRS Standards. Hence, it is important for the domestic companies to implement International Financial Reporting Standards (IFRS) in order to bring uniformity in the recording of financial transactions. Many studies have established that companies fail to meet the terms with the disclosure requirements of IFRS [11, 19]. The salient features of International Financial Reporting Standards (IFRS) have an impact on the implementation of financial reporting which helps to make financial information comparable, fair value measurement, reliability, transparency, accountability, efficiency to the financial markets that exist in different countries. The goal of IFRS is to make international comparisons as easy as possible. Although the standards have a critical impact on capital markets, students, and investors know very little about this standard. The emerging trend of the business environment has scrutinized the implication of IFRS in the Indian business environment, thus it is required to analyze the similarities and distinction between IFRS and AS [21]. The adoption of IFRS has raised the standards of maintaining financial statements. The financial reports made with the incorporation of fair value measurement have created new ways of communicating the growth and development of the organization among its stakeholders. The extent to which the organizations have adopted IFRS [6] underlined that the companies face difficulties and opportunities with regard to IFRS and the steps are taken to make the method smooth and transparent. In this study, the data was collected from the accounting professionals with respect to the perception toward awareness and adoption of “International Financial Reporting Standards” (IFRS). Therefore, this study aims to examine and summarizes the outcome from a questionnaire survey on the attributes that form a perception of respondents toward awareness and adoption of IFRS. Various studies have been conducted on awareness and implementation of IFRS in India. It is essential to be well ready for understanding and implementation of IFRS, participants’ perceptions of infrastructure for education, training, and IT. Accounting experts and users have similar views on IFRS implementation’s awareness and preparedness problems [23]. There has been a transitional shift among the academicians relating to IFRS. The majority of the accounting academicians in Konya have awareness and knowledge of IFRS. Thus, IFRS is gaining a lot of importance [18].

“Empirical Study on the Perception of Accounting …

45

IFRS incorporation in over 100 nations a decade ago was an innovation of incredible promise and magnitude once in a lifetime system. There was an excellent reason to expect achievement, mainly based on extensive enthusiasm for global norms and, behind that, recognition of globalization’s powerful forces. There have been dangers, however, and a priori there has been restricted proof to guide the decision-makers. This is still the case a decade later. Globalization remains a potent political and economic force, trying to drive accounting demand for globalization [4].

2 Related Background 2.1 Awareness and Adoption of IFRS In order to identify the SMEs, the assessment being carried out in the case of Romania, the writer [9] conducted an empirical analysis on the primary components. Several items recognized to perform the evaluation in this regard; the writers noted that the inflexible legislation and the link between accounting and tax norms make reporting compulsory at least for tax-related purposes. On the other hand [10], an empirical analysis of the technical elements of the application of IFRS for small and medium-sized enterprises in Europe was carried out in the situation of Romania as well. The authors found that the Romanian industry provides other technical alternatives as well as IFRS 897’s proposal for norms for SMEs, one of the primary causes of these variations is the tax accounting connection. A study focused out that even among countries that have already adopted the same version of IFRS, has suggested that two factors, which are important to be considered, are national culture and language translation. It could undermine the laborious clarification and solicitation of IFRS and lead to a lack of comparability across countries. The main objective of this article is to focus on two significant footraces that impede the dependable interpretation and application of converged standards: the inspiration of national culture on the explanation of standards and the difficulty of translating standards into other languages [25]. Exploratory research is on the potential application of IFRS in Romania for Small and Medium Enterprises. The authors acknowledged one of the primary barriers in the implementation of the existing emphasis on tax compliance [2]. A study conducted focused on the “similarities and differences a comparison of IFRS, US GAAP, and India GAAP.” The main objective of this study recognizes that there are indeed many advantages arising from convergences to various stakeholders [8]. Liu [15] conducted a study on the changes in the difference between IFRS and US GAAP. The study also aimed to identify the worth relevance of IFRS system. Ramanna [20] focused on the evidence that the possibility of IFRS adoption at first increases and then decreases in the quality of countries’ domestic governance institutions, consistent with IFRS being adopted when governments are proficient of taking timely decisions and exploring the opportunities. An author conducted a study

46

N. Puri et al.

entitled of IFRS in India: challenges and opportunities. The study focused on to what extend IFRS was adopted by the firms, what about the challenges, and opportunities faced by the companies regarding IFRS. The study aimed on the awareness and adoption of IFRS in India [6]. The work conducted by Mahender [16] in the field of “IFRS and India–Its problems and challenges in”. The main aim of this study is to analyze the information available on IFRS adoption process in India. It also gives due consideration on the IFRS adoption procedure in India and the utility for India in adopting IFRS. The results drawn by Jones [13] from the survey of employers related to the general awareness of global financial reporting standards. The author focused on the knowledge of graduating seniors to define and describe IFRS, compare and contrast principles- and rules-based approaches to accounting standards, and understand IFRS financial statements well enough to reconcile to U.S. GAAP. Khan [14] studies have shown that most countries have set timelines for convergence or adoption of IFRSs in the near future. The paper focused on the Group of 20 Leaders (G20) who, at their September 2009 meeting in Pittsburgh, USA, called on international accounting bodies to step up their efforts to achieve this goal in their independent standardization process. IFRS implementation will improve the implementation of a single worldwide accounting language to guarantee the significance, completeness, comprehensibility, reliability, timeliness, neutrality, verifiability, consistency, comparability, and accountability of financial statements, resulting in a qualitative shift in accounting reporting that will improve trust and empower investors and others users of accounting information around the world. Satyanarayana [22] discussed IFRS as a common global language for the business transactions so that financial transactions made are easy to understand and can be compared across the world. The author emphasized on the introduction of “International financial reporting standards” (IFRS) which will bring drastic change from close economy to open economy. Sharad Sharma [23] studied the issues of execution in India and suggested solutions to the issues. Cost, lack of knowledge among investors about IFRS, and no uniformity in accounting guidance issued by multiple regulators (“SEBI, IRDA, RBI”) in India are the major implementation issues. The suggestions provided adequate training and training for IFRS accounting experts and staff about IFRS; Government has to reform the taxation system to match with IFRS. The obligatory evolution to “International Financial Reporting Standards” was much more than a modification in accounting regulations. The primary issues of companies were to know the extent to which accounting variations could influence their reported performance between the domestic GAAP and IFRS. This paper seeks to tackle this problem by providing empirical evidence in its nature [7].

“Empirical Study on the Perception of Accounting …

47

2.2 Research Methodology The study employs the exploratory research and a quantitative approach to measure the perception of accounting professionals toward the adoption and awareness of IFRS. The Google form uses a medium to retrieve the respondents’ information. Thus, the data has been exported from google as docs editor and hence used in SPSS for the analysis.

2.3 Research Instrument The questionnaire is designed using simple and impartial words to make it convenient for the respondent to understand them. The first variable designed for the awareness of IFRS has been adapted from Akhter [1], Satyanarayana [22].

2.4 Sample Selection Accounting professionals have been the population of the study. The method used for sample selection is judgmental/purposive sampling. The details of the accounting professionals have been gathered from different companies and practicing chartered accountants. The reason for conducting this research is to understand the relevance of the awareness and adoption of IFRS in India.

2.5 Data Analysis Methods This research employs descriptive statistics, chi-square testing, and independent ttesting to study the data by utilizing SPSS 20 software for validating and testing the hypothesis.

2.6 Data Analysis Descriptive Analysis The information were gathered from the accounting experts for the current research. The sample size for accounting experts collected information is 199 answers. The demographic summary of the respondents is discussed in Table 1. The frequency distribution of respondents indicates that 67.8% of respondents belong to the category of male respondents and 32.2% of respondents belong to the category of female

48

N. Puri et al.

Table 1 Demographic overview of the respondents Demographic variables

Categories

Frequency

Percentage

Gender

Male

135

67.80

Female

64

32.20

Age (in years)

20–30

96

48.20

103

51.80

84

42.20

30 and above Education level

Graduate Post-graduate and above

Monthly income

Occupation

115

57.80

Less than |50,000

69

34.60

|50,000–|100,000

48

24.10

|100,000–|200,000

40

20.10

|200,000 and above

39

21.20

98

49.24

101

50.76

Business and professional Service and other

respondents. This shows that male respondent is more aware than female respondents about International Financial Reporting Standards (IFRS). The frequency distribution of age indicates that 48.2% of respondents belong to the category of 20–30 years, 40.2% of respondents belong to the category of 40–50 years, 9.0% of respondents belong to the category of 40–50 years, and rest 2.6% of respondents belong to the category of above 50 years of age. The frequency distribution of education level indicates that 42.2% of respondents have attained the degree of graduation, whereas 57.8% of respondents have attained the degree of post-graduation and above. The frequency distribution of monthly income indicates that 34.6% of respondents belong to the category of less than |50,000, 24.1% of respondents belong to the category of |50, 000–|100,000, 20.1% of respondents belong to the category of |100,000– |200,000 and rest 21.2% of respondents belong to the category of |200,000 and above. The frequency distribution of occupation indicates that 49.24% of respondents belong to the category of business and professional and rest 50.76% of respondents belong to the category of service and others.

3 Hypothesis Development 3.1 Hypothesis Testing for Determining Awareness of IFRS H1: Gender does not affect the awareness level for International Financial Reporting Standards. H2: Age does not affect the awareness for International Financial Reporting Standards.

“Empirical Study on the Perception of Accounting …

49

H3: Educational Qualification does not affect the awareness for International Financial Reporting Standards. H4: Monthly Income does not affect the awareness for International Financial Reporting Standards. H5: Occupation does not affect the awareness for International Financial Reporting Standards.

3.2 Accounting Standards and IFRS Mean One and Same the Thing H6: Gender does not affect the Accounting Standards and IFRS mean one and same the thing. H7: Age does not affect the Accounting Standards and IFRS mean one and same the thing. H8: Educational Qualification does not affect the Accounting Standards and IFRS mean one and same the thing. H9: Monthly Income does not affect the Accounting Standards and IFRS mean one and same the thing. H10: Occupation does not affect the Accounting Standards and IFRS mean one and same the thing.

3.3 Awareness About Number of IFRS Effectively Present H11: Gender does not affect the Accounting Standards and IFRS mean one and same the thing. H12: Age does not affect the Accounting Standards and IFRS mean one and same the thing. H13: Educational Qualification does not affect the Accounting Standards and IFRS mean one and same the thing. H14: Monthly Income does not affect the Accounting Standards and IFRS mean one and same the thing. H15: Occupation does not affect the Accounting Standards and IFRS mean one and same the thing.

3.4 One-Way Chi-Square Tests Table 2 illustrates the attributes forming perception of respondents toward awareness of IFRS.

50

N. Puri et al.

Table 2 Results of chi-square test to check the association between and awareness of IFRS and demographic variables S. No

Awareness of IFRS

Chi-square statistics

P-value

Remark

1

Gender

27.639

0.000

Significant association found

2

Age (in years)

1.113

0.291

No significant association found

3

Education level

10.853

0.001

Significant association found

4

Monthly income

35.092

0.000

Significant association found

5

Occupation

1.432

0.231

No significant association found

4 Association Between Awareness of IFRS and Demographic Variables 4.1 Association Between Gender and Awareness of IFRS The two categories under awareness of IFRS between “male” and “female” respondents considered for the present study were “Yes” and “No.” Cross tabulation was performed to determine the percentages of the row and column. The value of chi-square was also calculated. The chi-square value in Table 3 indicates that it is not significant at 0.01 levels, which means that there is significant difference between the genders. So far as, their awareness is concerned, it is further found that 96.30% are male respondents who are aware of IFRS as compared to female respondents (70.30%). It is clearly indicated that the awareness level is more in male respondents. Table 3 Association between gender and awareness of IFRS Cross tabulation Awareness of IFRS

Gender Male

Female

Total

Yes

Count % within gender

130 96.3%

45 70.3%

175 87.9%

No

Count % within gender

5 3.7%

19 29.7%

24 12.1%

Total

Count % within gender

135 (100%)

64 100%

199 100%

Chi-square test and p-value = 27.639 (0.000)

“Empirical Study on the Perception of Accounting …

51

Table 4 Association between education level and awareness of IFRS Cross tabulation Awareness of IFRS

Education level Graduate

Post-graduate and above

Total

Yes

Count % within education

69 79.3%

106 94.6%

175 87.9%

No

Count % within education

18 20.7%

6 5.4%

24 12.1%

Total

Count % within education

87 100%

112 100%

199 100%

Chi-square test and p-value = 10.853 (0.001)

4.2 Association Between Education Level and Awareness of IFRS The two categories considered for the current research under IFRS awareness were “Yes” and “No.” Furthermore, the two educational levels were “Graduate” and “PostGraduate and Above.” Cross tabulation was performed to find out the percentages of the row and column. The value of chi-square was also calculated. The chi-square value in Table 4 indicates that it is significant at 0.01 levels, which means a significant difference between the graduate and post-graduate respondents’ views. As far as, their awareness is concerned, it found that 94.6% of respondents belonging to post-graduate and above category are aware of IFRS as compared to graduate respondents (79.3%). It indicates that the awareness level is more in case of post-graduate respondents and the difference in two percentages is significant. In case of awareness of IFRS, it is found that the respondents who are post-graduate and above have more awareness of IFRS.

4.3 Association Between Monthly Income and Awareness of IFRS The two categories considered for this research under IFRS awareness were “Yes” and “No.” Furthermore, there were four levels of monthly income “Less than |50,000,” “|50,000–|100,000,” “|100,000–200,000,” and “|200,000 and above.” Cross tabulation determined the percentages of the row and column and calculated chi-square value. In case of monthly income, it is found that respondents belonging to the category of “|50,000–|100,000” and “|200,000 and above” are more aware of IFRS.

52

N. Puri et al.

The chi-square value in Table 5 indicates that it is significant at 0.01 levels, which means that there is no significant difference between the views of business and professionals and service and other respondents. As far as, their awareness is concerned, it is further found that 100% of respondents belonging to “|50,000– |100,000,” and “|200,000 and above” category are aware of IFRS as compared to those who belong to “less than |50,000” and “|100,000–|200,000” (69.6 and 92.5%). It indicates that the awareness level is more in case of respondents who belong to the category of monthly income and the difference in two percentages is significant. Table 6 represents the results of chi-square test to check the association between awareness about number of IFRS effectively present. Table 5 Association between monthly income and awareness of IFRS Cross tabulation Awareness of IFRS

Monthly income Less |50,000–|100,000 |100,000–|200,000 |200,000 Total than and |50,000 above

Yes

Count 48 % 69.6% within income

48 100%

37 92.5%

42 100%

No

Count 21 % 30.4% within income

0 0.00%

3 7.50%

0 0.00%

Total

Count 69 % 100% within income

48 100%

40 100%

42 100%

175 Chi-square 87.9% test and p-value = 35.092 (0.000) 24

199 100%

Table 6 Results of chi-square test to check the association between awareness about number of IFRS effectively present S. No

Awareness about number of IFRS effectively

Chi-square statistics P-value

Remark

1

Gender

8.191

0.004

Significant association found

2

Age

0.188

0.665

No significant association found

3

Education level

14.948

0.000

Significant association found

4

Monthly income

23.814

0.000

Significant association found

5

Occupation

25.258

0.000

Significant association found

“Empirical Study on the Perception of Accounting …

53

5 Association Between Demographic Variables and Awareness About Number of IFRS Effectively Present 5.1 Association Between Gender and Awareness About Number of IFRS Effectively The two categories under awareness of IFRS between “Male” and “Female” considered for the present study were “Yes” and “No.” Cross tabulation determined the percentages of the row and column and calculated chi-square value. The chi-square value in Table 7 indicates that it is significant at 0.01 levels, which means that the opinions of male and female participants differ significantly. So far as, their awareness is concerned, it is further found that 63.7% of respondents belonging male respondents are aware about the number of IFRS effectively present as compared to those who female respondents (42.2%). It is clearly indicated that the awareness level is more in case of male respondents and the difference in two percentages is significant.

5.2 Association Between Education Level and Awareness About the Number of IFRS Effectively The two categories under awareness about the number of IFRS effectively present deliberated for the present study were “Yes” and “No.” In addition, the two levels of education were “Graduate” and “Post-Graduate and Above.” Cross tabulation determined the percentages of the row and column and chi-square value was calculated. In case of education level, the respondents belonging to the category of post-graduate and above have awareness of IFRS. Table 7 Association between gender and Awareness about number of IFRS effectively present Cross tabulation Awareness about number of IFRS effectively present

Gender Male

Female

Total

Yes

Count % within gender

86 63.70%

27 42.2%

113 56.8%

No

Count % within gender

49 36.30%

37 57.8%

86 43.2%

Total

Count % within gender

135 100%

64 100%

199 100%

Chi-square test and p-value = 8.191 (0.004)

54

N. Puri et al.

Table 8 Association between education level and awareness about the number of IFRS effectively Cross tabulation Awareness of IFRS

Education level Graduate

Post-graduate and above

Total

Yes

Count % within education

36 41.4%

77 68.8%

113 56.8%

No

Count % within education

51 58.6%

35 31.20%

86 43.2%

Total

Count % within education

87 100%

112 100%

199 100%

Chi-square test and p-value = 14.948 (0.000)

The chi-square value in Table 8 indicates that it is significant at 0.01 levels, which means that there is considerable difference between the views of graduate and postgraduate respondents. So far as, their awareness is concerned, it is further found that 68.8% of respondents belonging to the category of post-graduate and above are aware about number of IFRS effectively present as compared to those who are in the category of graduate (41.1%). It indicates that the awareness level is more in case of respondents who belong to the category of graduates and the difference in two percentages is significant.

5.3 Association Between Monthly Income and Awareness About the Number of IFRS Effectively The two categories under IFRS consciousness considered for this research were “Yes” and “No.” And the four monthly incomes were “Less than |50,000,” “|50,000– |100,000,” “|100,000–|200,000,” and “|200,000 and above.” Cross tabulation was performed to determine the percentages of the row and column. The value of chisquare was also calculated. In case of monthly income, the respondents belonging to the category of “|200,000 and above” have awareness about the number of IFRS effectively present. The chi-square value in Table 9 indicates that it is significant at 0.01 levels, which shows a significant difference between the views of different income group’s respondents. So far as, their awareness is concerned, it is further found that 78.6% of respondents belonging to the category of |200,000 and above are aware about the number of IFRS effectively present as compared to those who are in the category of less than |50,000 (34.8.1%). It is clearly indicated that the awareness level is more in case of respondents who belong to the category of |200,000 and above and the difference in two percentages is significant.

“Empirical Study on the Perception of Accounting …

55

Table 9 Association between monthly income with number of IFRS effectively present Cross tabulation Awareness about the number of IFRS effectively present

Monthly income Less |50,000–|100,000 |100,000–|200,000 |200,000 Total than and |50,000 above

Yes

Count 24 % 34.8% within income

32 66.7%

24 60%

33 78.6%

No

Count 45 % 65.2% within income

16 33.3%

16 40%

9 21.4%

Count 69 % 100% within income

48 100%

40 100%

42 100%

Total

113 Chi-square 56.8% Test and p-value = 23.814 (0.000) 86 43.2%

199 100%

5.4 Association Between Occupation and Awareness About the Number of IFRS Effectively The two categories considered for the current research under IFRS awareness were “Yes” and “No.” Therefore, the two occupational levels were “Business and Professional” and “Service and Others.” Cross tabulation was performed to determine the percentages of the row and column. The value of chi-square was also calculated. In case of occupation, the respondents belonging to the category of business and professional have awareness about the number of IFRS effectively present. The chi-square value in Table 10 indicates that it is significant at 0.01 levels, which indicates a significant difference between the views of business and professionals and service and other respondents. As far as, their awareness is concerned, it is further found that 76.4% of respondents belonging to the category of business and professionals are aware about number of IFRS effectively present as compared to those who are in the category of service and other (40.9%). It indicates that the awareness level is more in case of respondents who belong to the category of business and professional and the difference in two percentages is significant. Table 11 illustrates the awareness about number of IFRS effectively present.

56

N. Puri et al.

Table 10 Association between occupation and awareness about the number of IFRS effectively Cross tabulation Awareness about the number of IFRS effectively present

Occupation Business and professional

Service and other

Total

Yes

Count % within occupation

68 76.4%

45 40.9%

113 56.8%

No

Count % within occupation

21 23.6%

65 59.1%

86 43.2%

Total

Count % within occupation

89 100%

110 100%

199 100%

Chi-square test and p-value = 25.258 (0.000)

Table 11 Results of chi-square test to check the association between Accounting Standards and IFRS mean one and same thing S. No

Accounting Standards and IFRS mean one and same thing

Chi-square statistics

P-value

Remark

1

Gender

12.504

0.000

Significant association found

2

Age

8.249

0.004

Significant association found

3

Education level

2.333

0.127

No significant association found

4

Monthly income

12.921

0.005

Significant association found

5

Occupation

0.181

0.671

No significant association found

6 Association Between Demographic Variables and Awareness About Number of IFRS Effectively Present 6.1 Association Between Gender and Accounting Standards and IFRS Mean One and Same the Thing In case of gender, the respondents belonging to the category of post-graduate and above have awareness about the Accounting Standards and IFRS mean one and the same thing.

“Empirical Study on the Perception of Accounting …

57

Table 12. Association between gender and awareness of Accounting Standards & IFRS mean one and same the thing Cross tabulation Accounting Standards and IFRS mean one and the same thing

Gender Male

Female

Total

Yes

Count % within education

38 28.1%

4 6.3%

42 21.1%

No

Count % within education

97 71.9%

60 93.8%

157 78.9%

Total

Count % within education

135 100%

64 100%

199 100%

Chi-square test and p-value = 12.504 (0.000)

The chi-square value in Table 12 indicates that it is significant at 0.01 levels, which reflects a substantial difference of opinion between male and female participants. So far as, their awareness is concerned, it is further found that 28.1% of respondents belonging to the male respondents who are aware about Accounting Standards and IFRS mean one and the same thing as compared to those female respondents (6.3%). It is clearly indicated that the awareness level is more in case of male respondents and the difference in two percentages is significant.

6.2 Association Between Age and Accounting Standards and IFRS Mean One and Same the Thing The two categories under awareness of IFRS considered for the present study were “Yes” and “No.” In addition, the two levels of age group were “20–30” and “30 and above.” Cross tabulation was performed to determine the percentages of the row and column. The value of chi-square was also calculated. In case of age group, the respondents belonging to the category of 30 and above have awareness about the Accounting Standards and IFRS mean one and the same thing. The chi-square value in Table 13 indicates that it is significant at 0.01 levels, which shows a significant difference between the views of different age groups. As far as, their awareness is concerned, it is further found that 29.1% respondents belonging to the age group of 30 and above years are aware about the number of IFRS effectively present as compared to those who are in the age group of 20–30 years (12.5%). It is clearly indicated that the awareness level is more in case of respondents who belong to the age category of 30 and above years and the difference in two percentages is significant.

58

N. Puri et al.

Table 13 Association between age and Accounting Standards and IFRS mean one and same the thing Cross tabulation Accounting Standards and IFRS mean one and same the thing

Age (in years) 20–30

30 and above

Total

Yes

Count % within age

12 12.5%

30 29.1%

42 21.1%

No

Count % within age

84 87.5%

73 70.9%

157 78.9%

Total

Count % within age

96 100%

103 100%

199 100%

Chi-square test and p-value = 8.249 (0.004)

6.3 Association Between Monthly Income and Accounting Standards and IFRS Mean One and Same the Thing To test the hypothesis, “there is no association between monthly income and Accounting Standards and IFRS mean one and same the thing.” Using cross tabulation, the chi-square test was performed. The four categories considered for this research under IFRS consciousness were “Yes” and “No.” Moreover, the four levels of monthly income were “Less than |50,000,” “|50,000–|100, 000,” “|100,000– |200,000,” and “|200,000 and above.” Cross tabulation was performed to determine the percentages of the row and column. The value of chi-square was also calculated. In case of monthly income, the respondents belonging to the category of |100,000–|200,000 have awareness about the Accounting Standards and IFRS mean one and the same thing. The chi-square value in Table 14 indicates that it is significant at 0.01 levels, which means that there is momentous difference between the opinions of different income group’s respondents. So far as, their awareness is concerned, it is further found that 32.5% of respondents belonging to the category of |100,000–|200,000 are aware about number of IFRS effectively present as compared to those who are in the category of Less than |50,000 (7.2%). It is clearly indicated that the awareness level is more in case of respondents who belong to the category of |100,000–|200,000 and the difference in two percentages is significant.

Count % within income

Count % within income

Count % within income

Yes

No

Total

Awareness of Accounting Standards and IFRS mean one and same thing

Cross tabulation

69 100%

64 92.8%

5 7.20%

Less than |50,000

Monthly income

48 100%

36 75%

12 25%

|50,000–|100,000

40 100%

27 67.5%

13 32.5%

|100,000–|200,000

42 100%

30 71.4%

12 28.6%

|200,000 and above

199 100%

157 78.9%

42 21.1%

Total

Table 14 Relationship of monthly income with awareness of Accounting Standards and IFRS means one and same thing

Chi-square test and p-value = 12.921 (0.005)

“Empirical Study on the Perception of Accounting … 59

60

N. Puri et al.

7 Conclusion and Discussion IFRS awareness and adoption play a substantial part in the growth and prosperity of the business organization. It has been observed in the study that 96.3% of male respondents who are working in different organizations have an awareness of IFRS. Awareness and adoption of IFRS is beneficial to attract foreign investors focused by (Muniraju, [17] which have found in the previous study. The comparison of the financial health and performance of the Indian companies can be made possible if uniform accounting standards (IFRS) are being adopted by the organization. The involvement of execution and benefits, as well as the reasons for acceptance, will largely depend on the accounting and regulatory frameworks, context, and other factors of the countries [5]. “International Accounting Standards Board” (IASB) introduced the “International Financial Reporting Standard” (IFRS) that anticipated at enlightening transparency in the financial statements identified by Sruthiya [24]. Consistency and Comparability identified by Alleyne [3] are a significant factor in the preparation of financial reports due to the rapid growth of Indian economy, which results in attracting more investors from the domestic and foreign territory.

8 Societal and Managerial Implications With the era of globalization, it is significant for the companies to adopt a single set of international standards of higher quality. These standards help to maintain the uniformity in financial statements preparation, which brings inter as well as intra firm comparison. IFRS awareness and adoption helps the management to take economic decisions keeping in view the growth and prosperity of the organization. The adoption of IFRS provides a platform for the methodical review and estimation of financial performance of multi-national companies having its holding and subsidiaries located in the different parts of the world. This results in the prevention of manipulation or frauds in the planning of the financial statements. These standards fetch a quality in the maintenance of the different financial statements, which brings reliability in the disclosure of the financial statements.

References 1. Akhter, A. (2013). Awareness of International Financial Reporting Standards (IFRS): A study of Post-Graduate Students of Commerce & Management in Kashmire. IOSR Journal of Business and Management (IOSR-JBM), 14(5), 16–24. 2. Albu, C. A.-V. (2010). IFRS for SMEs in Europe—Lessons for a possible implementation in Romania. In Proceedings of the 5th WSEAS International Conference on Economy and Management Transformation (Vol. 2, pp. 659–663). 3. Alleyne, T. A.-R. (2017). Indian Accounting standards and the transition to IFRS. International Education and Research Journal.

“Empirical Study on the Perception of Accounting …

61

4. Ball, R. (2016). IFRS-10 years later. Accounting and Business Research, 46(5), 545–571. 5. Bhattacharyya, K. (2012). India’s adoption of International Financial Reporting Standards (IFRS): Advantages and challenges. Journal of Commerce & Management Thought, 3(3), 475–479. 6. Bhutani, A. S. (2012). IFRS in India: Challenges and opportunities. IUP Journal of Accounting Research & Audit Practices, XI(2), 6–32. 7. Cordazzo, M. (2013). The impact of IFRS on net income and equity: Evidence from Italian listed companies. Journal of Applied Accounting Research, 14(1), 54–73. 8. Datta, K. (2009). Similarities and differences a comparison of IFRS. Price Water House Coopers: US GAAP and Indian GAAP. 9. Deaconu, A., Popa, I., Buiga, A., & Fulop, M. (2008). Impact analysis of future accounting regulation for SMEs in Europe. Journal of International Business and Economics, 8(1), 128– 146 10. Deaconu, A. (2009). Accounting models in the post-communism Romanian history—An empirical investigation. In Working Paper, Babes-Bolyai University. 11. Gao, R. (2018). The impact of mandatory international Financial Reporting Standards adoption on investment efficiency: Standards, enforcement, and reporting incentives. Abacus, Accounting Foundation, 54(3), 277–318. https://doi.org/10.1111/abac.12127 12. Hodgdon, C. T. (2008). Compliance with IFRS disclosure requirements and individual analysts‘ forecast errors. 17(1), 1–13. 13. Jones, S. W. (2013). IFRS knowledge, skills, and abilities: A follow-up study of employer expectations for undergraduate accounting majors. Journal of Education for Business, 88, 352–360. https://doi.org/10.1080/08832323.2012.727889. 14. Khan, N. A. (2014). Global convergence of financial reporting. Researchers World-Journal of Arts Science & Commerce, 1(1), 40–45. 15. Liu, C. (2009, September). Are IFRS and US-GAAP already comparable? International Review of Business Research Papers, 5(5), 76–84. 16. Mahender, K. S. (2013). IFRS & India—Its problem and challenges. International Multidisciplinary Journal of Applied Research, 78–82. 17. Muniraju, M. (2016). A study on the impact of International Financial Reporting Standards convergence on Indian corporate sector. Journal of Business and Management, 4, 34–41 18. Naim Ata Atabeya, H. A. (2014). Awareness level and educational efforts of academicians relating to the International Financial Reporting Standards: A research on accounting academicians in Konya. Emerging Markets Queries in Finance and Business, 15, 1655–1662 19. Onali, E. (2017). Investor reaction to IFRS for financial instruments in Europe: The role of firm-specific factors. Finance Research Letters, 21, 72–77 20. Ramanna, K. (2009). Why do countries adopt International Financial Reporting Standards? Harvard Business School Accounting & Management Unit, 102. 21. Sambaru, M., & Kavitha, N. V. (2014). A study on IFRS in India. International Journal of Innovative Research & Development, 3(12), 362–367 22. Satyanarayana, A. J. (2016). Auditors awareness in convergence from Indian GAAP to IFRS. Indian Journal of Applied Research, 6(10), 330–332. 23. Sharad Sharma, M. J. (2017). IFRS adoption challenges in developing economies: an Indian perspective. Managerial Auditing Journal, 32(4/5), 406–426. https://doi.org/10.1108/MAJ-052016-1374. 24. Sruthiya, V. N. (2017). International Financial Reporting standards implementation in india: benefits and problems. IRA-International Journal of Management & Social Sciences, 6(2), 292–297. 25. Tsakumis, G. T., Campbell, D. R., & Doupnik, T. S. (2009). IFRS: Beyond the standards. Journal of Accountancy, 207(2), 34–36, 38–39, 12.

On Readability Metrics of Goal Statements of Universities and Brand-Promoting Lexicons for Industries Prafulla B. Bafna and Jatinderkumar R. Saini

Abstract The statements for vision and mission act as a guide for creating objectives and goals in the organization and to provide a path for individuals. As both of these statement documents are meant to be read by external stakeholders and are supposed to present an image of the organization, it is imperative that they are easily comprehended by the reader. Further, mission statements are required to be designed considering in line with the vision statements. The current paper presents the results of the work on the readability metrics of the vision and mission statements of well-known 25 universities of India. This work recommends about vision and mission statements formulation for emerging universities. Additionally, the readability metrics for mission statements for more than 100 well-known corporate houses were also calculated and analyzed. Further, most commonly used lexicons were also found to be used for branding purposes using Term Frequency–Inverse Document Frequency (TF-IDF) through text mining. This paper advocates the correlation between the found lexicons and the revenues generated by the considered companies. Technically, Pearson’s correlation coefficient and Flesch Readability Index (FRI) are deployed for the calculation of various metrics to form the basis of our conclusions. Keywords Flesch readability index · TF-IDF · Recommendation · Significant lexicons · Statistical measures

1 Introduction The mission and vision statements are very important and they can best be described as a destination and scope or goal defining statements of the organization. Clearly understood vision and mission depicts inimitability from other organizations and supports to plan all future activities. Readability measures the degree of ease of comprehension of the reading text. The written text is only the way to communicate P. B. Bafna (B) · J. R. Saini Symbiosis Institute of Computer Studies and Research, Symbiosis International (Deemed) University, Pune, Maharashtra, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1174, https://doi.org/10.1007/978-981-15-5616-6_5

63

64

P. B. Bafna and J. R. Saini

ideas or concepts to the readers. If the language used in the write up is not easy then nit stops peoples’ engagement [1]. This study assesses the readability index, word count, and TF-IDF values of vision and mission statements of different universities and recommends different parameters for newly establishing universities as well as for existing universities. For example based on the ideal average number of words to be present in mission and vision statement. It will help new organizations to decide the total number of words to be used in vision and mission statements. Average FRI of goal statements of universities is below the ideal value which suggests rephrasing these statements. As a result of low FRI, the study further extends to apply the TF-IDF approach [2] to derive the strength and consistency of the words used in goal statements by text analysis. It recommends words/lexicons to be used to frame goal statements for new universities. The dataset consisting of the companies and their mission statements along with the revenues was downloaded (https://www.missionstatements.com). Readability index of missions of all companies is found out to decide the threshold value for text analysis. Pearson’s correlation coefficient was calculated between the readability index of the mission of each company and its revenue. Flesch–Kincaid Reading Index (FRI) exists on the scale of 0–100 [3]. The greater score indicates the text is easily readable. Companies having readability measures more than 60 were considered for the next analysis. The relation with revenue was identified for these selected companies to find out the most popularly used words for branding to generate more revenue. The approach will be useful to new companies to decide words present in their mission statement or taglines or even to update the current tagline and to include the suggested words. The rest of the paper is organized as follows. The literature review is presented in the next section while the third section depicts the methodology. Results and discussions are presented in the fourth section followed by concluding remarks through the last section of the paper called conclusion.

2 Related Work This section presents a detailed literature review on the readability metrics. It also presents the review of research literature for works on Vision and Mission statements along with text analytics basics. A readability measure expresses the level of schooling or education a reader needs to read a text easily. When a piece of text is too complex to understand, meaning of the text may not be understood. On the other side, too simplistic writing may initiate a feeling of boring. Finally, people engagement is influenced by the readability of the text which intends to give some message [4]. Readability is determined by the characteristics of the text that influence their understanding. Readability scores measure whether the content is likely to be understood by your intended reader. It

On Readability Metrics of Goal Statements of Universities …

65

considers various linguistic parameters such as complexity elements parts of speech used and so on. The readability concept was derived in the 1920s. It helped out in identifying the level of topics to be taught to different levels in education [5]. A readability for poor literacy readers is assessed through a process of text simplification. The readability level is related to the literacy level for the given text. Levels are identified as rudimentary, basic, or advanced. Using classification, regression and ranking text are simplified and set according to the level of target users. The model can be further improved using other cognitive dimensions like mental model [6]. The readability of web pages is estimated by proposing a statistical model [7]. Linguistic attributes of language are used to develop the Unigram language model. Contents of the language are associated with different readability levels. The EM algorithm finds the optimal parameters in this combined model, unlike existing readability. Parameters are tuned for different domains and applications. The web pages are retrieved by Google based on the topic and queries input for the search. The readability of these web pages is assessed. The parameters to decide readability metrics are the searched topic and position of a webpage in the search result. Medical and academic web pages have low readability metrics than games and construction websites. Top pages produced by Google has good readability metrics [8]. Medical documents need to have more readability because it is associated with the life of the people. Readability indices such as Kincaid, Fog, Flesch, and SMOG along with linguistic dimensions such as the average number of words, verbs, simple sentences used, and so on, are used. Comparative analysis of readability metrics for different medical webpages and Simple Wikipedia pages is carried out. Four indices are positively co-related with each other [9]. Clinical trial consent forms were assessed for their readability. (three methods were used to assess readability using mean (±SD) and compared across specialties. Medicinal literacy for readability is not as per standard. Administrators and clinicians need to work on drafting forms for future studies [10]. Awareness of stakeholders, staff, and students about goals, mission, and vision of the management institute is quantified using a questionnaire. Based on the profile of the person’s acceptance level of vision, mission, goals and objectives were derived [11]. The importance of vision and mission statements and their impact on organizational goals are analyzed [12]. Appropriately written vision and mission impacts different activities in the organization. To get expected success in an organization, vision and mission statements should be easily understood. Updated vision and mission as per timeline will able to sustain the organization as unique. Vision and mission statements are considered an important part of the strategic management process for the organization. It applies to all types of organizations such as public, private, profit or non-profit making, multinational as well as small- and mediumscale enterprises. A well-prepared vision and mission statements would distinguish one organization from another by showing unique characteristic that differentiates it from others [13] and will result into checking regularly the levels of compliance and its adequacy and their contribution to the performance of the organization with respect to goals [14].

66

P. B. Bafna and J. R. Saini

The mission and vision statements are very important and they can best be described as a destination and scope of the organization are expressed in terms of the mission and vision statements. Clearly understood vision and mission depicts inimitability from other organizations and supports to plan all future activities. Unawareness of the same is a directionless journey. Well understood vision and mission will inculcate the sense of ownership and employees will be more productive resulting in the organizational growth [15–17]. A vision and a mission statement acts as a guide for creating objectives and goals in the organization and to provide a path for individuals. Some organizations do not have vision and mission statements or they are mixed together and formulated as one. This produces confusion for setting objectives of organization. The vision statement is about future planning, in other words, it is the long-term objectives. It also converses the drive of the institute to the stakeholders, employees and others. A mission presents the state of an organisation is expressed in Mission statement. It depicts the tasks that organization performs and ways to carry out these tasks. Unlike the vision statement, it is short term in nature. The primary and short-term goals based on vision statements are mentioned in the mission statement. Both of the statements provide inspiration to the employees to achieve the target of the organisation. Vision and mission statements are input for setting priorities and allocating optimum resources in the organization. It inspires employees to work with common goals. Universities need to follow guidelines to design goal statements. The words used should have appropriate goal statements which can be suggested through text analysis [18–20]. Preprocessing is the first step toward text analysis which removes words that are not useful . There are different methods to preprocess the text, e.g., stemming and so on. After preprocessing bag of words, TF-IDF techniques are followed to measure the weights of the extracted tokens [21]. To the best of our knowledge and literature review, the present paper is the first of its kind work which deliberates upon the interplay of readability metrics on one side and the goal statements of the university or the brand statements of the corporate houses, on the other side. To summarize, this research work is unique because of I.

identifying the relation between the generated revenue and readability index of brand statements of industries is carried out. II. text analysis is performed to suggest words to be used in the tag line to increase the branding of the product. III. analysis about mission and vision statements of reputed universities being carried out to recommend various parameters.

3 Research Methodology Figure 1 depicts the methodology of the research work. The vision and mission statements of the well-reputed 25 universities were downloaded from their respective

On Readability Metrics of Goal Statements of Universities …

67

Fig. 1 Diagrammed representation of research methodology

websites. FRI and average word count for each vision and mission was calculated. The FRI or Flesch index is then used to conclude about the readability of vision and mission. Range-wise FRI measures for all vision and mission statements are computed to get an overall idea about readability measures of universities’ goal statements. Word count states the standard number of words to be used in vision and mission statements. Further TF-IDF [22] measures of all terms present in the vision and mission statement are calculated to recommend the lexicons used for branding of universities. In addition to this, the dataset consisting of the 101 well-reputed companies and their mission statements along with the revenues was downloaded. Readability index of missions of all companies is found out. The greater score indicates the text is easily readable. Pearson’s correlation coefficient was calculated between the readability index of the mission of each company and its revenue. Flesch–Kincaid reading index exists on the scale of 0–100 [23, 24]. Companies having readability measures more than 60 were considered for the next analysis. TF-IDF measure was applied to the mission of these selected companies to find out the most popularly used words for branding to generate more revenue. The approach will be useful to new companies to decide words present in their mission statement or taglines or even to update the current tagline and to include the suggested words. Table 1 depicts the algorithm to carry out a research study. The dataset used in Step 1 is synthesized using university vision and mission statements. Package quanteda and TM is used to calculate FRI and word count, and preprocessing steps along with TF-IDF measures. TF-IDF threshold values are used to recommend frequent lexicons. Table 2 states the word count of each vision and mission. Further, total word count of visions and missions is calculated. Applied statistical measures further will be helpful to newly staring universities to decide the total number of words to be used while formulating vision and mission statements. Table 1 Algorithm for lexicon recommendation 1. Collect the data 2. Apply FRI and word count, Preprocessing, and TF-IDF on dataset 1 3. Interpret measures 4. Recommend Lexicons and other statistical parameters on goal statements 5. Apply Preprocessing, co-relation coefficient, FRI and TF-IDF on dataset 2 6. Recommend lexicons and other statistical parameters for the branding of universities and industries

68 Table 2 Word count of vision and mission statements

P. B. Bafna and J. R. Saini Sr. No

Word count of visions

Word count of missions

1

64

35

2

25

79

3

25

62

4

71

72

5

35

78





24

22

29

25

13

33

4 Results and Discussions Table 3 depicts the Flesch Index for each vision and mission. Table 4 shows the various number of universities count based on the range of FRI of mission and vision statements from 30 to 90 with a gap of 30. It is clear that only 2 universities are having readability measure greater than 60 and concludes most of the universities are having poor readability for mission and vision statements. The maximum, minimum, and average values for FRIs of visions and missions of universities are shown in Table 5. The average readability index for mission is a negative value and for vision it is a positive value but it is very small, that is, Table 3 FRI for mission and vision statements

Table 4 Range wise count of goal statements of universities

University Id

FRI for mission

FRI for vision

1

0.08

−31.9186

2

38.4933537

−4.66

3

6.2915455

−38.5

4

−22.085

−22.085

5

10.8938953

23.86429







24

16.9517241

3.768636

25

6.7036364

24.44

Sr. No Range of FRI Number of mission Number of vision statements of statements of universities universities 1

−30–0

8

7

2

0–30

12

14

3

30–60

3

3

4

60–90

2

2

On Readability Metrics of Goal Statements of Universities … Table 5 FRI measures of goal statements

69

Sr. No

Measures

FRI for mission

FRI for vision

1

Max

65

70

2

Min

−61.5

−38.5

3

Average

−0.04

0.2

0.2. These goal statements need to be rephrased. The same statistical parameters are applied for the word counts of missions and visions and are stated in Table 6. This will be helpful to newly staring universities to decide the total number of words to be used while formulating vision and mission statements. To confirm on the consistency of the words used to construct goal statements, TFIDF measure is applied to the entire dataset of goals of universities and significant terms are recognized. Terms greater than 50% threshold are termed as significant. Tables 7 and 8 present TF-IDF measures of the significant words for universities’ mission and vision statements, respectively, greater measure of TF-IDF (0.9) not only depicts that individual word’s significance but also the higher strength of the significance according to the entire dataset. It means that universities have used consistently appropriate words for their brand statements but sentence reframing is needed to increase FRI and in turn readability. The words mentioned in the tables can be recommended to use by newly establishing universities, to formulate their goal statements. Readability index and revenue have positive co-relation. Readability may be one of the factors to affect revenue. Companies having readability index greater than 60 were considered and frequent terms are extracted using TF-IDF value. The table shows examples of brand statements of corporate houses with their FRI. To recognize the most significant words used by the company, the preprocessing of brand statements is carried out and the TF-IDF measure is computed. The terms having more than 50% weightage of maximum TF-IDF values are considered to be frequent terms. These terms are recommended to be used for branding purpose. For example, ‘Support’, Table 6 Statistical measures of word count for vision and mission statements of universities

Table 7 TF IDF measures for significant tokens of Mission statements

Sr. No

Measures

Word count mission

Word count vision

1

Max

72

209

2

Min

3

Average

9

29

48

72

Sr. No

Term

TF-IDF measure

1

Quality

0.91

2

Dynamic

0.87

3

Flexible

0.66







70

P. B. Bafna and J. R. Saini

Table 8 TF IDF measures for significant tokens of Vision statements Sr. No

Term

TF-IDF measure

1

Research

0.92

2

Faculty

0.81

3

Academic

0.72







Table 9 Brand statements and their FRI Sr. No Brand/mission statement

FRI

1

“We help people pursue more from life”

84.9

2

“Exxon Mobil Corporation is committed to being the world’s premier petroleum and petrochemical company. To that end, we must continuously achieve superior financial and operating results while simultaneously adhering to high ethical standards”

−2.18

3

“Apple designs Macs, the best personal computers in the world, along with OS X, iLife, iWork and professional software. Apple leads the digital music revolution with its iPods and iTunes online store. Apple has reinvented the mobile phone with its revolutionary iPhone and App store, and is defining the future of mobile media and computing devices with iPad”

34.05





..

Table 10 Terms and TF-IDF weights for brand statements

Sr. No

Term

TF-IDF measure

1

Support

0.67

2

Help

0.44

3

Serve

0.43







‘help’, ‘serve’, ‘people’, ‘trust’, ‘love’, and so on are the most frequent lexicons used by mission statements of selected companies. The company having the highest readability and Revenue has its mission statement as “We help people pursue more from life”. Table 9 shows terms that should be included in branding along with and their TF-IDF weight. Table 10 depicts sample terms extracted from mentioned mission/brand statements and their term weights.

5 Conclusions Recommendation about vision and mission statements of well-known universities of India is presented using readability metrics and word count of the vision and mission statements. It was observed that average FRI for goal statements is very low; however,

On Readability Metrics of Goal Statements of Universities …

71

high value of TF-IDF proves that tokens used in the framing of goal statements are appropriate, only sentence reframing of goal statements is required. Significant tokens and average word count extracted from the goal statements will act as a guideline for newly forming universities. Additionally, most commonly used lexicons were suggested for branding purpose using the TF-IDF measure. The threshold value of FRI and revenues was considered to get top corporate industries among more than 100. In the research world first time correlation between the lexicons and the revenues generated by the selected companies is considered. Technically, Pearson’s correlation coefficient, Felsch Readability Index (FRI) are deployed for the calculation of various metrics to form the basis of our conclusions.

References 1. De Guzman, M. J. J., Estira, K. L. A., Arquillano, N. E., & Ventayen, R. J. M. (2018). Acceptability and awareness of vision and mission of the university, institutional objective and program objective of BS business administration. Asian Journal of Business and Technology Studies, 1(1). 2. Gurley, D. K., Peters, G. B., Collins, L., & Fifolt, M. (2015). Mission, vision, values, and goals: An exploration of key organizational statements and daily practice in schools. Journal of Educational Change, 16(2), 217–242. 3. Raulji, J. K., & Saini, J. R. (2019). Sanskrit lemmatizer for improvisation of morphological analyzer. Journal of Statistics and Management Systems, 22(4), 613–625. 4. Saini, J. R. (2014). Estimation of comprehension ease of policy guides of matrimonial websites using gunning fog, Coleman-Liau and automated readability indices. IUP Journal of Information Technology, 10(4), 19. 5. Antunes, H., & Lopes, C. T. (2019, June). Readability of web content. In 2019 14th Iberian Conference on Information Systems and Technologies (CISTI) (pp. 1–4). IEEE. 6. Bafna, P. B., Shirwaikar, S., & Pramod, D. (2016). Multi-step iterative algorithm for feature selection on dynamic documents. International Journal of Information Retrieval Research (IJIRR), 6(2), 24–40. 7. Bafna, P., Pramod, D., & Vaidya, A. (2016, March). Document clustering: TF-IDF approach. In 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) (pp. 61–66). IEEE. 8. Bafna, P., Shirwaikar, S., & Pramod, D. (2016, October). Semantic clustering driven approaches to recommender systems. In Proceedings of the 9th Annual ACM India Conference (pp. 1–9). ACM. 9. Saini, J. R. (2014). Web text mining through readability metrics for evaluation of understandability of policy guides of matrimonial websites. Submitted for Publication. 10. Si, L., & Callan, J. (2001, October). A statistical model for scientific readability. In CIKM (Vol. 1, pp. 574–576). 11. Sobolewski, J., Bryan, J. N., Duval, D., O’Kell, A., Tate, D. J., Webb, T., et al. (2019). Readability of consent forms in veterinary clinical research. Journal of Veterinary Internal Medicine, 33(2), 350–355. 12. Štajner, S., Evans, R., Orasan, C., & Mitkov, R. (2012). What can readability measures really tell us about text complexity. In Proceedings of Workshop on Natural Language Processing for Improving Textual Accessibility (pp. 14–22). 13. Taiwo, A. A., Lawal, F. A., & Agwu, E. (2016). Vision and Mission in organization: Myth or heuristic device? The International Journal of Business & Management, 4(3).

72

P. B. Bafna and J. R. Saini

14. Tamariz, L., Gajardo, M., Still, C. H., Gren, L. H., Clark, E., Walsh, S., & SPRINT Research Group. (2019). The impact of central IRB’s on informed consent readability and trial adherence in SPRINT. Contemporary clinical trials communications, 15, 100407 … For Flesch index. 15. Teixeira Lopes, C., & Ribeiro, C. (2019, March). Interplay of documents’ readability, comprehension and consumer health search performance across query terminology. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval (pp. 193–201). CM. 16. Alshameri, F., Greene, G. R., & Srivastava, M. (2012). Categorizing top fortune company mission and vision statements via text mining. International Journal of Management & Information Systems, 16(3), 227. 17. Aluisio, S., Specia, L., Gasperin, C., & Scarton, C. (2010, June). Readability assessment for text simplification. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 1–9). Association for Computational Linguistics. 18. Bafna, P., Pramod, D., & Vaidya, A. (2017, August). Precision based recommender system using ontology. In 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS) (pp. 3153–3160). IEEE. 19. Bafna, P., Shirwaikar, S., & Pramod, D. (2019). Task recommender system using semantic clustering to identify the right personnel. VINE Journal of Information and Knowledge Management Systems, 49(2), 181–199. 20. Bafna, P., Pramod, D., Shrwaikar, S., & Hassan, A. (2019). Semantic key phrase-based model for document management. Benchmarking: An International Journal. 21. Baser, P. N., Saini, J. R. (2015). Agent based stock clustering for efficient portfolio management. International Journal of Computer Application (IJCA), 116(3), 35–41, Digital Library ISSN: 0975–8887; ISBN: 973-93-80886-14-5; Foundation of Computer Science, USA. 22. Baser, P. N., Saini J. R. (2013). An intelligent agent based framework for an efficient portfolio management using stock clustering. International Journal of Information & Computation Technology, 3(2), 49–54. ISSN: 0974-2239; International Research Publications House, New Delhi, India. 23. Baser P. N., Saini J. R. (2013). A comparative analysis of various clustering techniques used for very large datasets. International Journal of Computer Science and Communication Networks, 3(5), 271–275. ISSN: 2249-5789; Technopark Publications, Vadapalani, Chennai, India. 24. Baser, P. N., & Saini, J. R. (2014). An optimum cluster size identification for k-means using validity index for stock market data. International Journal of Data Mining and Emerging Technologies, 4(2), 107–110. ISSN: 2249-3212 (eISSN: 2249-3220), Indian Journals, New Delhi, India.

An Efficient Recommendation System on E-Learning Platform by Query Lattice Optimization Subhadeep Ghosh, Santanu Roy, and Soumya Sen

Abstract This research work is on optimizing the number of query parameters required to recommend an e-learning platform. This paper proposes a new methodology for efficient implementation by forming lattice on query parameters. This lattice structure helps to co-relate the different query parameters that in turn form association rules among them. The proposed methodology is conceptualized on an e-learning platform with the objective of formulating an effective recommendation system to determine associations between various products offered by the e-learning platform by analyzing the minimum set of query parameters. Keywords Recommendation system · E-Learning · Lattice of queries · Association rule mining

1 Introduction Recommendation System is inevitable in any e-commerce platform. It refers to a system of recommending products to the customers. A recommender system is a filtering system that suggests a product based on the users’ preference. Nowadays for every possible selling product recommender systems are applicable and used in a variety of business domains such as hotels, electronic goods, household products, news, music, movies, any service-based product, etc. [1]. In this paper, we will consider the E-Learning Platform or Online Learning Hub. Learners also refer to S. Ghosh Tata Consultancy Services Ltd., Kolkata, West Bengal, India e-mail: [email protected] S. Roy Department of Computer Applications, Future Institute of Engineering and Management, Kolkata, West Bengal, India e-mail: [email protected] S. Sen (B) A. K. Choudhury School of Information Technology, University of Calcutta, Kolkata, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1174, https://doi.org/10.1007/978-981-15-5616-6_6

73

74

S. Ghosh et al.

such learning platforms simply as online tutorials. The objective of this work is to propose a data mining methodology that can effectively recommend products in relation to a specific product in an e-learning platform. The following section explores a few required concepts and terminologies.

2 Concepts and Terminologies Required 2.1 Lattice A lattice is a Partially Ordered Set or POSET where each pair of elements has unique Least Upper Bound (LUB) and Greatest Lower Bound (GLB). Lattice is very powerful in modeling many real-life problems. It has the capability to represent all possible combinations of associated parameters or data items. A lattice structure is depicted in Fig. 1 which comprises 3 parameters.

2.2 Lattice of Queries (Query Lattice) In this work, we will focus on query results based on a set of query-predicate values. Then we are doing a reverse calculation or analysis and obtaining all those courses that are also results of queries that involve the same query-predicate value set. A list of such products is inferred as related courses that can be recommended along with our target course. All possible combinations of the < Query Parameter = Predicate Values > generate a structure similar to that of a lattice. Fig. 1. A lattice with three parameters







An Efficient Recommendation System …

75

2.3 Web Analytics and Web Page Visit Logs Web analytics is analyzing web data to understand and optimize web usage by measurement, collection, analysis, and reporting [2]. Earlier it was used for measuring web traffic but now they are used in many ways to do the survey/analysis for business and market research. This not only helps the better usage of the website but also contributes to analyzing the relevant products which lead to business intelligence. Web analytics gives the data such as the number of visitors to a website, the number of page views, time spent to view a webpage, redirection from that webpage, etc. This helps to identify the popularity, and therefore ranking could be done [3]. A weblog record tracks of each visitor to a website. The type of information viewed (an image, a URL or HTML page), redirected are recorded in a weblog. Thereafter, log analyzer tools analyze these records to derive useful information as requested by the users [1].

3 Related Work Study Several works and research have been implemented to analyze e-commerce business and several of these works and researches have also addressed the e-learning platform in particular. E-Learning systems gained huge popularity in the past few years and specially, the use of web-based education systems has increased exponentially in recent times. A number of research works have been done to investigate different data mining methodologies to improve e-learning platforms. These data mining methods help to infer new knowledge based on the students’ usage data. The main purpose of these research works has been to show the current state-of-the-art e-learning systems, which are based on data mining methodologies that keep track of the interaction between the relevant areas, as well as different real-life case studies of e-learning systems [4]. In several papers, it has been presented how web mining techniques are applicable to improving the services provided by e-commerce-based enterprises [5]. A study was commissioned by eBay [6], one of the leading e-commerce platforms in the world and was conducted as in-house research. This study shows how data mining played a significant role to create a great experience at eBay and on other e-commerce platforms, as data mining is a systematic way of extracting information from data fed by consumers themselves. Techniques like pattern mining, trend discovery, and prediction can be hence easily applied. Different data mining techniques have been applied for data analytics on elearning. Decision tree [7] has been applied for the predictive analysis of student’s performance in big data environment. In [8], machine learning based data analytics have been proposed to extract the required information and to find the valuable patterns from the collected data. As visualization is becoming a very important aspect in data mining it is also applied in e-learning. In [9], the concept of visual

76

S. Ghosh et al.

analytics is introduced in terms of Group Analysis Component (GAC) and Case Analysis Component (CAC). Visual analytics helps the users by having practical education model with a clear insight of the entire process. The concept of visual analysis has been extended further in [10] to discover patterns for mobile learning in big data environment. As the users search many for their preferences in the system a big amount of datalog or history is generated in system. This helps in data mining for e-learning platform. Process modeling, behavior analytics, and group performance assessment of e-learning Logs have been done using fuzzy miner algorithm [11]. In order to do these different types of data analytics on e-learning platform, faster access to data and query optimizations are desirable. An abstract algebraic model called lattice has been used in many data-based applications by optimizing the lattice structure. In data warehouse optimization of lattice of cuboids has been done in [12, 13]. Heuristic search algorithm [14] also has been applied for quick search of query path. Optimization of lattice structure is very much useful in the apriori algorithm which is a very popular approach in data mining. It has been used in many applications of market basket analysis [15–17], share market analysis [18]. In this research work, the concept of lattice optimization is used for parameter optimization of queries. Query lattice is formed for this purpose and a methodology is proposed to identify the minimum number of query parameters to recommend suitable products to the users in e-learning platform.

4 Case Study for Data Mining on E-Learning Our work will be based on a case study on a live and operational e-learning platform. The features of the platform are discussed and then the constraints are identified. Our proposed methodology will improve these limitations. The e-learning platform is referred to as I-Digital Learning Hub or IDLH as an acronym in this paper. IDLH is conceptualized as a collaborative learning marketplace that integrates online courses, assessments, communities, and events to ensure superior learning outcomes. It supports learners across various segments with different types of courses.

4.1 Typical Workflow of Case Study E-Learning Platform The typical workflow of IDLH is depicted below: • User visits the IDLH home page. • User searches for the product, i.e., the course that they want to enroll in or is interested in. • The searched product or course’s stamp is retrieved from the database and displayed on the Catalog page.

An Efficient Recommendation System …

77

• The IDLH platform offers different ways to find the preferred course for a learner. – Search box in the index. – Filters present on the Catalog page. • The user, after finding the course of their choice can just simply visit the Product Microsite page that contains a detailed description of the course. • The user can subscribe to the product, that is, buy the product online. • After buying the product, the course is subscribed to the user. The user can then launch the course from the dashboard. • After completion of the course, that is, fulfilling the requirements of course completion, the user can print the certificate, that certifies successful completion of the course by the user.

4.2 Problems or Constraints Faced by Case Study E-Learning Platform (a) Related Courses Recommendation is a Fixed List—Not Dynamic. (b) Measurement of ‘Popularity’ of the Products. (c) Inability to Correlate Choices of Different Learners.

5 Problem Definition and Objective An e-learning platform must have a robust system of product recommendation that is based on detailed analysis of its product purchases, web page visits, product searches, etc. This research work and proposed methodologies are to serve as a proof of concept, which may translate into future investments in a more automated and process-driven manner for the IDLH platform being researched in the case study or any other elearning platform with similar constraints as listed above. For effectiveness, a process has to be built where existing sales, subscriptions, product information, and publisher information have to be integrated and analyzed to identify potential cross-sell and up-sell recommendations. The objective of the proposed methodology is to enable an e-learning platform to do the following: • Dynamically form a set of related products based on product purchases. • Recommend products to the user based on historical analysis of wish list, carts, and product purchases. • Correlate Customer Profiles with Product Purchases and recommend products based on the profile of the customers, who have purchased products in the past.

78

S. Ghosh et al.

6 Fundamentals of the Proposed Methodology The methodology that is being proposed in this work is based on the application of Data Mining on the Query Parameters of various queries that can be executed on the e-learning platform’s database. This paper proposes a data mining technique that will analyze the various queries and their respective predicate values that yield a particular product or course name as their result. The proposed methodology formulates an algorithm that put forwards a new technique for association mining between various products based on their purchase, web page hits, and search patterns. The methodology identifies a similar set of queries that yield a set of products and then tries to minimize the number of query parameters and predicate values to find an optimal query to correctly predict a set of recommended products. In this proposed methodology, we will analyze the queries and predicate value combinations from various sources and formulate Item Sets or set of courses that are complementary to one another, that is correlated or associated to one another. While analyzing these queries and their predicate values that give rise to a ‘similar’ kind of correlated products/courses, our continuous strive or aim will be to minimize the number of query parameters to identify the correlation pattern between the courses. That is, the aim of the methodology will not only be to identify similar products through the similarity of their queries, predicate values, and query results, but also to put forward a formal and efficient data mining technique by which this correlation between products can be identified with the minimum number of query parameters and predicate value sets. We are trying to design a recommendation system based on data mining. So, the basic objective is that if a user is interested in one product, say P1 , then, how to recommend a set of other products, {P2 , P3 , …, Pn }, based on the user’s choice of that product. So, for a particular course, say for example, “Fundamentals of Java Programming”, the methodology will analyze all possible queries and their predicate values on various data sources of the e-learning platform that yield the course “Fundamentals of Java Programming” as their query result. For instance, as an elementary example, we can have a set of query parameters, such as follows: [Product Type, Product Audience Category, Product Domain Category, Product Specialization Category, Publisher Name]. For the above query parameters, if we apply the following predicate values, the query result yields the course named, ‘Fundamentals of Java Programming’: . Now, the above < Query Parameter = Predicate Value > set can also yield other courses as their query outcome, e.g. ‘Python for Beginners’ and ‘C# Programming Crash Course’. So, we can say from our analysis of the query results that the two courses, ‘Python …’ and ‘C# …’ are actually related courses of the ‘… Java …’ course; and we can recommend these two as related courses of ‘Java’.

An Efficient Recommendation System …

79

Fig. 2. Lattice of queries formed with four query parameters

In Fig. 2, we represent such a query lattice that initially starts with 4 query parameters. The notation, Q represents a single (query parameter = predicate value). For example, we write a query as below: select product_name from purchase_details where product_type = ‘course’ and audience_category = ‘professional’ and domain_category = ‘computer science’ and specialization_category = ‘software development’; In this above query, we have four query parameters: product_type, audience_category, domain_category and specialization_category. Hence, Q1 is product_type = ‘course’, Q2 is audience_category = ‘professional’, Q3 is domain_category = ‘computer science’& Q4 is specialization_category = ‘software development’. Thus, < Q1 , Q2 , Q3 , Q4 > represents one query with 4 parameters. Similarly, < Q1 , Q2 , Q3 > represents another query with 3 query parameters and their corresponding predicate values. The proposed methodology will construct a lattice of < Query Parameter = Predicate Value > sets at each level, consisting of ‘n’ query parameters, where the value of ‘n’ will decrease as we drill down from one level to the next, until we find the optimal set of the < Query Parameter = Predicate Value > that is able to recommend a set of related items or products most accurately. Each of these queries will return a result set or a set of products related to the target product. The objective of the proposed methodology is to optimize this lattice of queries and find the minimum number of query parameters and predicate value combinations with which the recommended

80

S. Ghosh et al.

set of related products can be predicted for a particular product. This lattice structure is represented in Fig. 2. In this diagram, at Level-1, we will have a single query that consists of all the 4 query parameters and their corresponding predicates. At Level-2, we will decrease the number of query parameters for each query to 3; hence we will have 4 different queries that are formed out of a combination of the query parameters of Level-1 taking 3 of them at a time. At Level-3, we will have 6 query sets with 2 parameters in each set. At Level-4, we will consider queries consisting of 1 parameter each. In the last level, we obtain an optimal set of query parameter and predicate values that combine the query parameters in the precedent levels and which yield the result of products that are related to each other.

6.1 Optimization Factors or Convergence Factors for Lattice The recommendation system will work faster if the recommendation could be done based on the less number of query parameters. This means if a recommendation is done based on one parameter such as < Q1 > instead of two parameters such as < Q1 , Q2 > then that will be a faster system. Hence our objective will be to answer (recommend) from less number of query parameters. For optimization of the above-illustrated lattice of queries, and finding the minimum set of query parameters or finding the best optimal query on the data source that is sufficient enough to predict the related products to a given target product, in our proposed methodology, we define two important conditions that together help to converge the lattice and the subsequent algorithm towards its objective of finding the correct set of related products. These conditions are given as follows: First Condition: Threshold Value This value will be considered keeping in mind the business requirements of the elearning platform. The condition that the algorithm spells out at the very beginning is that the related product set will only select those products whose result count is greater than the threshold value for any number of query parameters in the queries of a particular iteration. Second Condition: Higher Product Count in Result Set with Lesser Query Parameters If the number of product count or recommendation in the result set of an iteration is more than that of the previous iteration, then the result set of the present iteration will be given higher preference, provided that the first condition is satisfied. For instance, the super set of queries {Q1 , Q2 , Q3 , Q4 } returns 5 elements or preferences in the result set. If the subset, {Q1 , Q2 , Q3 } returns 4 products, then it will be rejected as the candidate for optimal query set; and all its subsequent subsets will also not qualify as candidates for optimal query sets. But if it returns 5 or more elements in the result set, then only it will be considered as a candidate for optimal query set.

An Efficient Recommendation System …

81

7 Proposed Algorithm Input P = number of distinct predicate values sets in the data source. Initiate counter of outer iterations, I = 1, where 1 ≤ I ≤ P. Input Q = number of query parameters in the data source. Input T = threshold value percentage. N = value of query parameters at each step, where 1 ≤ N ≤ Q. Step-2 Get the Ith set of distinct predicate value combinations with respect to each query parameter identified from the data source. Initiate counter of inner iterations, L = 1, where 1 ≤ L ≤ (Q – 1). (L represents a level of the lattice). Step-3 Get the query sets or combinations of {(query parameter = predicate value)}; where each set consists of N query parameters. Step-4 Run each query set on the data source and get their respective result sets. Add an element into the result set only if it has a row count of ≥ T (1st condition given above). Step-5 For each query set, Q, consider A = number of elements in its result set; B = number of elements in the result set of its superset of queries, QS . Step-6 If A ≥ B, then consider the query set Q as a candidate for the optimal query set (according to the second condition described above). Step-7 If A ≤ B, then reject the query set Q and all its subsequent subsets for consideration as candidates for optimal query sets. Step-8 Get the optimal query sets for the level, L. Step-9 L = L + 1 (increment value of inner iteration by 1). N = Q − 1 (decrement the number of query parameters by 1). (Q − 1 will be the length of the subset for the next iteration). Step-10 Perform the steps from Step-3 to Step-9 until L = Q − 1, or none of the query sets and result sets satisfy 1st and 2nd conditions defined above. Step-11 Set I = I + 1 (increment outer iteration counter by 1. Perform the steps from Step-2 to Step-10 until I = P. Step-12 Exit from the outer iterations when there are no more distinct combinations of predicate values left in the data source.

Step-1

At the end, the algorithm would take the sets of query parameters that have satisfied the optimality conditions as the optimal query sets; i.e., these sets are would contain the minimum number of query parameters by which related products can be predicted for a given product.

8 Explanation of the Algorithm with Example Consider a data source with N query parameters consisting of the N attributes of the data source. For simplicity, let us consider a data source with 4 such attributes that can be used as query parameters. The product name (that is the recommendation

82

S. Ghosh et al.

parameter) attribute will not be considered as a query parameter, instead, it will be considered as a result parameter. Assume that the threshold value as T%. That is, each element of the query result must have their count greater than or equal to the T% out of the total number of records. Going back to the above lattice example query, assume that the following query returns more than 20 results consisting of our target product. SELECT PRODUCT_NAMEFROMPURCHASE_DETAILSWHERE PRODUCT_TYPE = ‘Course’ → Q1 AND USER_QUALIFICATION = ‘B. Tech.’ → Q2 AND QUALIFICATION_SUBJECT = ‘Computer Science’ → Q3 AND PROFESSIONAL_STATUS = ‘Employed’; → Q4 In Fig. 3, the lattice of queries obtained for the above set of query parameters along with how the algorithm will be applied on them. In Iteration-1, for the set of < Q1 , Q2 , Q3 , Q4 > assume that 2 other products are returned back with result count of each of more than or equal to 20. Say these two products are ‘A’ and ‘B’. In Iteration-2, we reduce the number of query parameters per query to 3 and get 4 possible combinations out of Q1 , Q2 , Q3, and Q4 . For the following combinations, assume the following results, where A, B, C are other product names in the data source. → {A}

For the next iteration, the algorithm will reject the queries of < Q1, Q2, Q4 > and < Q1, Q3, Q4 > because the second condition proponed in Sect. 6.1 above is not satisfied, i.e., that the result sets returned by these two queries are {A, B} and {A}, respectively whereas the other two queries return 3 products in their result sets. For the next iteration, the algorithm will take combinations of two query parameters and predicate values in each query and so on. This working of the algorithm is shown schematically in the following schematic diagram. In this diagram, we depict how the algorithm finds the optimal path for one set of predicate values, i.e., we take one outer iteration. Similar to this, the lattice will be formed for other outer iterations as well. In Iteration-3 (i.e., third level) of Fig. 3 each query has 2 parameters each. The rejected query paths are shown as dotted arrows, whereas the accepted paths are shown as solid arrows. In the query result sets, the product names which have a count less than the threshold value are shown in smaller sized letters, e.g., < Q1, Q3 > returns {A, B, C, d}, where ‘D’ is the result element or the product name that

An Efficient Recommendation System …

83

Fig. 3. Finding optimal queries from query lattice

appears in less than threshold value number of rows when results are returned by the query comprised of < Q1 , Q3 > . Similarly, for the query < Q2 , Q4 > the result set returned is {A, B, C, d, e}; here also the row count of ‘D’ and ‘E’ are less than the threshold value in the rows returned by the query comprised of < Q2, Q4 > . Hence, these two result sets are completely rejected by the algorithm since they do not satisfy the first condition proponed in Sect. 6.1. In Iteration-4 (i.e., fourth level in Fig. 3), each query now has 1 parameter each. But less number of query parameters means a higher number of results since these two are inversely proportional. At this level, all the query results are rejected since none of them satisfy the first condition of convergence. Hence, at the end of this level, there is no more need to perform any more iterations with the initial set of predicate values. Thus, we derive two optimal query parameters at the end, < Q1, Q2 > and < Q2, Q3 > . To find the set of related products to the target product, we need to at least retrieve those records from the data source that satisfy the query with

84

S. Ghosh et al.

comprising of either Product_Type and User_Qualification or User_Qualification and Qualification_Subject. This means that in order to find the related products for the target product from this data source, it will be enough to run data queries with the query-predicate sets of < Q1, Q2 > or < Q2, Q3 > . So, to find the related product of say, ‘Python’, we need to find out the product names that have the same Product_Type and User_Qualification or those that have the same User_Qualification and Qualification_Subject. The most vital step of this algorithm is where it accepts a higher count result set than the previous iteration, with less number of < Query Parameter = Predicate Value > values than the previous iteration, while keeping the count of each element in the result set greater than or equal to the threshold value. That is, a subset of query parameter sets must return the same or higher number of query results than its superset of query parameters. This is how the algorithm basically converges toward a minimum set of query parameters to identify related products for the target product. This means, that throughout the database, the query and predicate values will be analyzed from various data sources, but while predicting or determining the correlation or association between two or more products, the algorithm will try to narrow down its queries and specifically identify those optimized. {Query Parameter = Predicate Value} sets that will be enough to correctly correlate or associate one course of the e-learning platform with another course available within the same platform.

9 Conclusion and Future Work If we delve on the constraints and shortcomings currently being faced by the platform that we elaborated in the Sect. 4.2, we can very easily see that the entire situation can be improved by a comprehensive analytical solution and the use of a formal data mining based recommendation system. The benefits of the proposed methodology can be summarized as below. These benefits have been identified after a close analysis of the e-learning platform in our case study. • Enable the platform to recommend products to customers based on their past purchase history. • Ability of the platform to recommend products to customers based on their user or customer registration profile. • Correlate product details parameters with customer profile parameters for predicting product recommendations. • Since it is a query-based approach to find product associations, recommended product sets can be formed in real time and displayed to the users. The algorithm can be improved to cover multiple data sources. Currently, the algorithm can run on only 1 data source and recommend products based on it. In the

An Efficient Recommendation System …

85

future, the methodology should have the ability to run on 2 or more data sources and find related products. As a natural extension of this work, we can propose a methodology for Search Engine Optimization (SEO) of the Catalog page of e-learning platform by focusing on Web Page Visit Logs and relate the searches of various products with one another. User profiles can be correlated to web page visits and searches. This can be ideal groundwork for devising an algorithm for Web Page Recommender System, which is right now a hot topic in the realms of web data mining technologies.

References 1. Ricci, F., & Rokach, L., Shapira, B. (2011). Recommender systems handbook. Springer, Berlin. 2. Jansen, B. J. (2009). Understanding user-web interactions via web analytics. In Synthesis Lectures on Information Concepts, Retrieval, and Services. 3. G. Zheng, S. Peltsverger. (2015) “Web Analytics Overview”, in book, “Encyclopedia of Information Science and Technology”, 3rd Edition, IGI Global, 2015. 4. Morales, C. R., Ventura, S. (2005). “Data Mining in E-Learning” WIT Transactions on Stateof-the-art in Science and Engineering Book Series 4, Transaction Vol. 4. 5. Raghavan, S. (2005). Data mining in e-commerce: A survey. N.R. Sadhana, 30(2–3). 6. Hu. J. (2010). Data mining and e-commerce, study conducted for eBay. 7. Vyas, M. S., & Gulwani, R. (2017). Predictive analytics for E learning system. In International Conference on Inventive Systems and Control (ICISC). 8. Moubayed, A., Injadat, M., Nassif, A. B., Lutfiyya, H., & Shami, A. (2018). E-learning: Challenges and research opportunities using machine learning & data analytics, (in English). IEEE Access, 6, 39117–39138. 9. Li, X., Zhang, X., Fu, W., & Liu, X. (2015). E-Learning with visual analytics. In IEEE Conference on e-Learning, e-Management and e-Services (IC3e). 10. Zhou, D., Li, H., Liu, S., Song, B., & Hu T. X. (2017). A map-based visual analysis method for patterns discovery of mobile learning in education with big data. In IEEE International Conference on Big Data. 11. Premchaiswadi, W., Porouhan, P., & Premchaiswadi, N. (2018). Process modeling, behavior analytics and group performance assessment of e-learning logs via fuzzy miner algorithm. In 42nd Annual Computer Software and Applications Conference (COMPSAC). 12. Sen, S., Chaki, N., & Cortessi, A. (2009). Optimal space and time complexity analysis on the lattice of cuboids using galois connections for data warehousing. In. 4th International Conference on Computer Sciences and Convergence Information Technology (ICCIT). 13. Sen, S., Roy, S., Sarkar, A., Chaki, N., & Debnath, N. C. (2014). Dynamic discovery of query path on the lattice of cuboids using hierarchical data granularity and storage hierarchy. Elsevier Journal of Computational Science, 5(4). 14. Roy, S., Sen, S., & Debnath, N. C. (2018). Optimal query path selection in lattice of cuboids using novel heuristic search algorithm. In 33rd International Conference on Computers and their Applications (CATA). 15. Ding, Q., Ding, Q., & Perrizo, W. (2008). PARM—an efficient algorithm to mine association rules from spatial data. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(6). 16. Chapman, C., & Feit, E. M. (2019). Association rules for market basket analysis. In R For Marketing Research and Analytics. Use R!. Springer, Cham. 17. Faridizadeh, S., Abdolvand, N., Harandi, S., & Rajaee, S. (2018). Market basket analysis using community detection approach: A real case. In M. Moshirpour, B. Far, & R. Alhajj

86

S. Ghosh et al.

(Eds.), Applications of data management and analysis., Lecture notes in social networks Cham: Springer. 18. Maji, G., Sen, S., & Sarkar, A. (2017). Share market sectorial Indices movement forecast with lagged correlation and association rule mining. 16th International Conference on Computer Information Systems and Industrial Management Applications (CISIM).

DengueCBC: Dengue EHR Transmission Using Secure Consortium Blockchain-Enabled Platform Biky Chowhan, Rashmi Mandal (Vijayvergiya), and Pawan Kumar Sharma

Abstract In the last decade, world has faced a lot of troubles in terms of epidemics. Dengue infection is one of the cruelest of such epidemics that has already taken a large number of human lives. Citizen who survives after getting infected by Dengue virus, often suffer from severe side effects that largely impacts their lives, minds, and social interference. Researchers have imparted various notions of methodologies toward care, cure, and management of such patients. But, due to some social obligations and lack of awareness, the information about dengue infection in a particular geolocation gets hidden from the citizen. Administration and associated agencies seem to be reluctant in facilitation of dengue related information. Thus, creating a dangerous situation for the affected zone of dengue. Most of the cases, the information about the dengue affected patients is leaked, damaged, or lost while transferring from hospitals, clinics, or diagnostic centers to the local, regional, national, or global administration agencies. In this paper, we propose a novel system model to minimize the loss or leakage of such valuable information in form of electronic health record by using consortium blockchain platform. The proposed system model, i.e., DengueCBC is envisaged to leverage possible improvement in the current scenario of loss or damage of dengue specific electronic records. The proposed model enables the health clinics to securely transfer the dengue health record in electronic form to administration health agency. The process of such transmission is proposed to be highly secure, robust, tamper-proof, and immutable in nature. Stakeholders having decentralized ledger technologies can get associated with the proposed model to get transparent information about the dengue-based electronic data transmission. Thus, the system model encompasses a new way of catering dengue data transmission in more advanced fashion. We further perform comparison between DengueCBC and B. Chowhan (B) · P. K. Sharma IEEE, NIELIT, Gangtok, Sikkim, India e-mail: [email protected] P. K. Sharma e-mail: [email protected] R. Mandal (Vijayvergiya) NIELIT, Kolkata, West Bengal, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1174, https://doi.org/10.1007/978-981-15-5616-6_7

87

88

B. Chowhan et al.

existing literature to prove the effectiveness of this model. Lastly, we present some open issues related to the proposed system model and other related works that could be mitigated by implying the prescribed future directions. Keywords Consortium blockchain · Dengue · Electronic health record · Tamper-proof record transmission

1 Introduction Now a days due to contamination and overpopulation in towns and cities a lot of new diseases and problems are engendering. For example, if we consider the issue of an outbreak of Dengue. After a wide-ranging number of cases, awareness among society takes place. If we can design an apparatus by which the very first case can be registered and flashed instantly so that outbreak of such a disease can be prohibited. This can be done in many ways. Firstly, by the patient’s family or by the clinic or through the medical shop from where the medicines are issued. Most of the time, even the administration is ignorant about the situation and gradually the menace is spread. So, if we can introduce a system through which the information is automatically updated and corrective measures are taken. Dengue a viral disease which is caused by biting of mosquito family named ‘Aedes Aegypti’ can spread over 5–6 days and generally occurs in two forms (1) Dengue Fever (2) Dengue Hemorrhagic Fever (DHF). Dengue fever is severe but DHF is more severe and can lead to death. All age groups have a threat of this deadly disease which does not even have any antibiotic till date. The exact figure is also not provided by the administration due to deficiency of information and sometimes the hospitals are not equipped to deal with the situation. In order to prevent the general outcome of the disease if we can design an apparatus which enables to provide a fast accessibility of the dengue outbreak in a locality, then the impact of the disease can be measured. As per the death rate caused due to dengue since 2009 it can be observed that there has been a hike of 300% and maximum recorded in 2017. In the last one decade, the menace of Dengue is widespread as stated by National Vector Borne Disease Control Program (NVBDCP) and National Health Profile 2018. As per the report of 2017, total number of reported cases was 188,401, which is approximately equivalent to 300% to the reported cases in 2009 which was approximately 60,000 and also an approximate hike of 250 % in the number of cases reported in 2013 which was 75,808 (Fig. 1). It can be observed that north-eastern states such as Sikkim, Nagaland, Mizoram, and Tripura have seen the highest spring in terms of percentage (approximately 40 %), the southern states shared the maximum drain of numbers. The evolution of technology and wealth of health devices and health apps, a massive amount of medical data is recorded and transported every day [1]. This database needs to be accomplished regarding privacy, security, and availability. The

DengueCBC: Dengue EHR Transmission Using …

89

Fig. 1. Analysis of Global Dengue report. Source Global Dengue report by World Health Organization (WHO)

healthcare institutions need to exploit the data while keeping the data intimate to the other third parties [2]. Blockchain is a relatively new technology that can be used in many fields [3]. As mentioned above, it can be used in the field of medical bionetwork as well. Since the e-health data must be reserved very secure, so as no other third-party members can have an admittance to it. Thus, raising the need of dominant of the data confidentiality. Due to immutability feature, once an e-health data is inserted on a block of medical blockchain, it cannot be removed. Decentralization and decentralize ledger are the core benefits which may be used to progress the current e-healthcare service scenario with the help of the block chain. In real life, most of the patients face indulgence of the superficial bureaucracy by administration or private agencies as well as third-party interventions. Medical services, thus sometimes are interrupted by other non-linear factors and unpredictable issues [4]. By virtue, the blockchain is a perfect tool for providing such a platform that does not need any intermediaries and that can function with many different stakeholders who are selected via consensus mechanisms. Blockchains can be divided into three main types [5], such as (i) Public: it allows any stakeholder, such as miner, generic users to access the blocks and transactions, (ii) Private: in this type of blockchain stakeholders need to be granted prior consent to join the blockchain, thus more restricted, and (iii) Consortium: it is suitable for enterprises or large business applications where a group of people can grant or revoke access over other about the access of the blockchain. CBC comprises four units namely (i) trusted authority, (ii) miner (who administrates the data), (iii) data supplier (who provides the data), and (iv) service supplier and CBC also uses mechanisms such as proof of authorization in order to ensure the efficiency and security of the data being transferred.

90

B. Chowhan et al.

In this paper, we propose a novel system model based on consortium blockchain centric secure EHR messaging transmission. The EHR transmission will be done between the hospital (public/private) and the Govt. Health Agency. The transmission shall take place under the aegis of higher trust-less, transparent, and decentralized consortium-based blockchain platform. The contributions of the work may be summarized as follows: (i)

To propose a novel system model comprising hospital, administration health agency, patients, and health professionals. (ii) To compare the proposed system model against the existing literature. (iii) To discuss and identify current challenges and future direction. The rest of the article is organized as follows. Section 2 presents the proposed system model, i.e., DengueCBC. Section 3 illustrates the comparative analysis between the proposed system model and existing literature. Section 4 depicts key issues and future direction in the study. Section 5 concludes the paper.

2 Proposed System Model Dengue-based EHR mitigation is a challenging task. Most of the time, the problem occurs due to the lack of awareness, casual mentality, and socioeconomic breakdown, especially for administration agencies. To solve this issue, we propose a novel dengue EHR-based consortium blockchain system model, i.e., DengueCBC. The proposed system model comprises three major components, that includes (i) dengue infected patient, (ii) DengueCBC engine, and (iii) design flow of dengue-based EHR under consortium blockchain ecosystem. Upon infection of dengue virus, a patient feels eye pain, headache, rash on skin, nausea, omitting, joint pain, bone pain, and muscle pain. These symptoms are usually augmented with high fever, thus making dengue patients’ health condition miserable. Such patients are provided with a dengue patient’s channel to perform interactive communication with the DengueCBC engine. Dengue patient’s channel plays a crucial role to facilitate the required services to the dengue infected patients while providing Short Messaging Service (SMS), chatbot, and web services to send some query and get appropriate results. DengueCBC engine is located as the center of the proposed system model that is responsible for performing all types of encryption, decryption, and dengue EHR related information transmission from the hospitals/health clinics to the administration agencies, especially health departments. Such dengue information is permanently stored in the administration health departments for serving various citizencentric health services, that includes current status of possibility of dengue epidemic outbreak. The dengue infected patient firstly gets clinical check-up by using the Enzyme-Linked Immunosorbent Assay (ELISA) mechanism, when found positive, confirmation about dengue infection is generated. The patient is then either given

DengueCBC: Dengue EHR Transmission Using …

91

medications, rest suggestions or admitted to the clinic based on the health condition. Medical professionals including doctors, paramedics, nurses, and support staffs usually take part in this phase of activity. Medical professionals generate the dengue EHR of a dengue infected patient and encrypt using designated health department’s public key. The same is uploaded to the decentralized consortium blockchain network where a Secure Hash Algorithm (SHA) is performed over the encrypted dengue EHR file and the corresponding hash value is stored into the same blockchain network. Thus, consortium blockchain network contains each block having two vital information (i) encrypted dengue EHR file and (ii) hashed dengue EHR value for a dengue infected patient. The process of such block generation is done by the health clinics but validation of such EHR is performed by a set of trusted consortium miners present within the blockchain network (Fig. 2). The validation is taken place upon a successful mining of a dengue EHR block by one of such miners by using existing consensus algorithm. Once, the miner is approved with predefined amount of cryptocurrency as a reward, the block is validated and added to the distributed ledgers. Such distributed ledgers are available to both clinic and health department. Health department may access the dengue EHR stored in the DengueCBC by decrypting while providing its own private key. If the hash value of decrypted EHR block is matched with the appended pre-hashed value, the same EHR information is used by the health department to carry several operations. Dengue patients are also given a stake of the system who can shoot some queries to the DengueCBC engine and get the appropriate response from it. Top layer of the proposed system model shows the block generation, validation, appending, and access mechanism. It starts by identifying hospitals and patients by their unique ids. Later, dengue EHR if stored in the distributed ledgers located at

Fig. 2. DengueCBC: Dengue EHR-based consortium blockchain system model

92

B. Chowhan et al.

each of stakeholders of the system. Equivalent micro-payments are done to reward the miner for validating the dengue EHR. Proof of Dengue Disease (PoDD) may be used as the effective consensus algorithm to process the whole dengue information filing process. On the other hand, dengue EHRs must be recognized first to start further processing over it. Once done, the same is entered in the distributed ledgers of both clinic and administration agency nodes. Next job is to perform actual dengue EHR transfer from the clinic to the agency. The whole process is closed by payments of cryptocurrencies to the miners and reflecting the balance into all the ledgers located at various stakeholders. Earlier said activities are supported by the sequential operations of following acts that include (i) joining into the DengueCBC platform, (ii) controlling the dengue EHR files and fetch relevant information from it, (iii) reserving/sharing/consuming of dengue EHRs, (iv) sharing dengue EHRs with health departments, and (v) validation of whole work by paying requisite amount of cryptocurrency and returning of needed information from the health departments. Thus, DengueCBC system model leverages various key aspects of securing dengue EHR while transmitting from the clinics to administration health agencies. The most crucial prospect of this system model is that it provides (i) transparency about the EHR files to the contained stakeholders of the DengueCBC system, (ii) decentralized dengue data access, (iii) always ready dengue file servicing, (iv) duallayer security augmentation, (v) immutable dengue information, (vi) anonymity of dengue patients, and (vii) consistency mitigation of dengue information. It is envisaged that the proposed DengueCBC system upon implementation would leverage a new way of integrating dengue patients, clinics and administration health agencies closer to serve following, such as better dengue care, prevention, dengue related information sharing, and dengue specific awareness. The system model would also procure the dengue medicine and pharmaceutical dispensing while integrating health insurance vendors to assist the patient and their caregivers a promising health service. Moreover, the proposed system model aspires to minimize existing data loss, data tampering, and data unreliability focused on the dengue EHRs. Thus, a less corrupted health services would be possible to be paved from both the clinic and administration perspectives.

3 Related Works and Comparative Study This section illustrates how blockchain can be used in various ways to provide the access control of the data stored in EHR. In [6], an illustration was made to show EHRs system consisting of more than one authority. Figure 3 enumerates the following entities which took part in the architecture. • EHRs server acted as a remote storage server, such as cloud, holding the responsibility of storing and transferring the EHR data.

DengueCBC: Dengue EHR Transmission Using …

93

Fig. 3. EHRs system consisting of more than one authority

• Various parties like hospitals, insurances, and institutes that research over medical sciences which were responsible for the exchange of patient information. • The creation, management, control, and signing of the EHR data was accessible by the patient and same was verified by the data verifier for the originality of the data. Figure 4 presents the cloud-assisted EHR sharing via consortium blockchain [7]. • Patients visiting the hospitals seeking the medical assistance are the data owner. • Collection of the health data from the data owners is stored in Electronic Health Record (EHR). • Administrative officials of the hospitals then encrypt and upload the EHR data to the cloud. • The uploaded data can be accessible between the data provider and data owner. • Data requesters are the bodies willing to procure the data, admit, and provide the necessary medical facility to the owner. In [8], a scalability-based blockchain was used for health care organizations. Figure 5 illustrates the indexed form of all the health records of the data owners (Patients), that were maintained by the data handlers. The index acted as an encyclopedia used in libraries. The encyclopedia consists of all the information such as the health data and the location from which it was retrieved. Likely, the health blockchain works in the same manner. Transactions in the blocks of blockchain include a unique id for all the users, a connection to the record of health data which is encrypted, and

94

Fig. 4. Cloud-assisted EHR sharing via consortium blockchain

Fig. 5. Scalability-based blockchain used in Health Care

B. Chowhan et al.

DengueCBC: Dengue EHR Transmission Using …

95

a timestamp to record the instance at which it was created. All the data related to the medical assistance was kept in a repository namely, Data lake. Data lakes were much flexible in storing varieties of data, irrespective of their types and were highly scalable too. Various health investigators and scholars may use this data lake as the perfect tool for the analysis of the health issues and emerge with the most preventive medicines as lifesaving mechanism. The data lakes were able to provide the provision of enquiries, text analysis, and many other advanced features. Ensuring the privacy, security, and soundness of the data, all the data that were deposited in the data lakes were encrypted and signed digitally. Whenever any diagnosis is made upon a patient using various technologies and documents, such as prescriptions, these reports were then digitally signed for the validation of its originality. The data of the data owner would then be transferred to its repository for storing. At every individual instance, the data was stored at the repository, an indicator, indicating to the user’s unique id would be created in the blockchain and the same was conveyed to the patient. Exactly in the identical manner, the patient could supplement the health data by signing them digitally using mobile applications or sensors that are wearable. In [9], the EMR integrity management-based blockchain platform is proposed. Figure 6 describes the medical blockchain reservations a whole up to date antiquity of all medical data, casing the EMR, visits, drugs, billing, and IoT data, which would track a distinct user for life. The medical data lake was a self-determining data source, known as the stored-off blockchain. It would be an appreciated implement used for a diversity of examines not only circumscribed to hospital usage, for example, in health insurance and disease prevention and researched. In [10], it illustrates the hybrid blockchain-edge used for EHR. From Fig. 7, we compute the following:

Fig. 6. EMR integrity management-based blockchain platform

96

B. Chowhan et al.

Fig. 7. Hybrid blockchain-edge used for EHR

• Patient was the owner of the health data that was to be recovered. Also, patient might need some entry actions for the data in the EHR, that they own. • Doctors and other hospital staffs are referred to as the healthcare providers, were the ones who need to collect the data from EHR that were owned by the patients. Thus, healthcare providers had to request the access from the patients and only the authorized healthcare providers had the access to the data. • For the collection of the health data from the patients, devices with the latest technical competencies such as X-Ray, MRI were used and the same data was sent to the edge node. • For the better storing capability, access control, and temper-proof data access, the edge nodes were used by blockchain technology. Table 1, clearly illustrates that from the EHR-based blockchain models the security of the data being transmitted is not viable for the dual-layer confidentiality and safety. Also, the table illustrates the consortium blockchain technology has not been cast-off till date for EHR transmission for Dengue records.

DengueCBC: Dengue EHR Transmission Using …

97

Table 1. Comparison among the proposed models for data transmissions and security through EHR using Blockchain Author/year

Objectives

Types of model

Rifi et al. [61], 2017

For arrangement of a secure and a scalable resolution for medical data conversation in order to have the finest concert possible

Architecture-based No model

Azbeg et al. [62], 2018

To grow a Architecture-based Yes decentralized model application for handling data and registering devices in order to avoid addition of spiteful devices

Private secure No blockchain

Talukder et al. [63], 2018

To gain all the requirements of the medicine to minimize the effect of the disease

Private blockchain

No

Yang et al. [64], 2018

A protected Architecture-base implementation of model blockchain that can be extended further irrespective of any platforms that ensures the protection of data in any EHR

No

Permissioned blockchain

No

No

Private blockchain

No

Guo et al. [6], To pledge the 2018 asset of EHRs crushed in blockchain

Blockchain Types of platform blockchain

Architecture-based Yes model

System-based model

Public blockchain

Dengue EHR support No

(continued)

98

B. Chowhan et al.

Table 1. (continued) Author/year

Objectives

Wehbe et al. [65], 2018

Types of model

Blockchain Types of platform blockchain

Dengue EHR support

To offer a Schematic-based platform that model controls blockchain and artificial intelligence (AI) for (i) secure EHR managing, (ii) well-organized data integration, and (iii) consistent computer-aided diagnoses

No

Private blockchain

No

Vora et al. [66], 2018

Blockchain-based System-based agenda for model efficient storage and preservation of EHRs

No

Private blockchain

No

Shah et al. [67], 2018

To supply and Architecture-based Yes achieve Electronic model Health Records (EHR) of all patients would enable continuous access to important and real-time patient data while disregarding the problem and cost of data compromise

Private blockchain

No

Liu et al. [68], 2019

To improvise the system of electronic health scheme of health centers with a protective private blockchain

Private blockchain

No

Architecture-based No model

(continued)

DengueCBC: Dengue EHR Transmission Using …

99

Table 1. (continued) Author/year

Objectives

Wang et al. [7], 2019

Types of model

Blockchain Types of platform blockchain

Dengue EHR support

To ensure the System-based authorization, model safety, and privacy by using an encryption method that can be searched and is also provisional proxy enabled

No

Consortium blockchain

No

Tang et al. [69], 2019

To develop a System-based system that helps model in minimizing the centralized problems of EHRs based on cloud

No

Public blockchain

No

Nortey et al. [70], 2019

To ensure the data System-based owners about the model security and accessibility of the data stored in EHRs at the time of its distribution of blockchain

Yes

Private secure No blockchain

Proposed Work DengueCBC, 2019

For the secure System-based transmission of model records of Dengue infected patients from Health Clinics to Health Department

Yes

Consortium blockchain

Yes

100

B. Chowhan et al.

4 Opportunities and Challenges Despite huge prospect of blockchain technology into the dengue care and management, it lacks in few areas due to its emerging attribute. Being nascent in competitive technology domain, blockchain must cater some way outs for leveraging a set of challenges. We describe a few opportunities and challenges in blockchain-based dengue care, management, and prevention in this section. A. Opportunities (i)

(ii)

(iii)

(iv)

(v)

Transparency: Dengue management is a difficult process. Sometimes, a strong genre of dengue virus may eradicate a life from its full swing. Thus, information about dengue spread is very important. Currently, no such technique is available that can provide transparent information facilitation among the citizen about the epidemic aspects of dengue spread at a given point of time. Blockchain could be considered as a key enabler of this aspect that can leverage immediate dengue information sharing in transparent manner among its stakeholders. DengueCBC is such a system model that provides transparent dengue care to the dengue affected patient. Further, it ensures the tamper-proof dengue EHR transmission from the clinics to the health departments in transparent way [11–20]. Reduced transaction time: Transaction of regular paper-based information takes some days to weeks to reach from one administration office to other. The situation is severe in underdeveloped or developing countries. Resulting in inappropriate health care service provisioning for the dengue affected citizen [21– 25]. Blockchain in accordance to the proposed DengueCBC could resolve this issue by empowering of reduced transaction time. While employing Ethereum platform such consortium blockchain can facilitate less than five minutes of transaction time, thus incorporating better dengue-based health services. High security and privacy: Health data is considered as a very secure and privacy-sensitive matter. In earlier times, several attacks have been implied over many client–server network infrastructures where data leakage or data loss were seen. Blockchain is inherently tamper-proof in any type of practical situation [26–32]. Though few attacks have been carried over the deployed applications over the blockchain architecture, no one has been able to tamper any data in a private or consortium blockchain till date. Thus, DengueCBC could play security and privacy of the dengue related files while transmitting from clinics to health departments. Cost-efficient: Conventional health related data, especially dengue specific information is generated in costly manner. The reason behind this is the involvement of various pathological tests and official works [33–38]. Patients face huge monetary loss while catering dengue related care or management. DengueCBC would fill this wound by minimizing the cost of operations incurred into the dengue care, cure, and management processes. Irreversible transaction: Denial of service is a famous attack that has been performed in many sectors many times, be it digital or physical. It refers to

DengueCBC: Dengue EHR Transmission Using …

101

say no while a transaction has taken place. In the care of medical domain, this may cause a great chaos toward formulating an unstable and unreliable medical service sector. Blockchain is proven to act opposite to any such denial of service attack [39–45]. For a dengue patient, doctor might take fees but later he may refuse to accept the fact. DengueCBC could protect the irreversible transaction conducted while dengue treatment to make the health service more rigorous. (vi) Immutability: Health data must be non-editable while it is in doctors’ place be it clinic, hospital, or diagnostics centers. Sometimes, administration agencies seem to be suppressing such vital information to showcase that there is no such news of dengue infection [46–52]. It is important for such agencies to illustrate a clean image of their acts to the public. But, inclusion of DengueCBC shall change this scenario while making dengue files and EHRs immutable in all forms. Only trusted and allowed personnel can modify the content of the dengue EHRs while keeping similar records in all stakeholders’ places. B. Challenges (i)

Interoperability: Existing system is less prone to showcase themselves as interoperable. It happens due to business model or policy related problems. Health care industry is currently facing this type of challenges where every clinic, hospital, diagnostic center, and agencies of each country or region acting autonomously. This is creating a problem of interoperability, thus minimizing the opportunity to facilitate the patients with good quality health service. DengueCBC is proposed to provide interoperable services while inculcating all types of stakeholders to keep them in a closely connected periphery [53–57]. (ii) Scalability: If a system is not scalable, it would not survive longer time in its course of action. The same is true for health care industry and its associates. Dengue is such a disease that is virus affected, resulting in more vulnerable to get mutated at any point of time, thus causing death or severe lifethreatening situations. DengueCBC could infer its virtue to make a dengue care and management system more scalable and dynamic at the same time [58]. (iii) Storage: Medical data, i.e., EHRs could be of any form or factors, for example, image, text, audio, graph, or video. Thus, size of each EHRs could be different while storing it into a digital media. DengueCBC has come up with novel aspect of storing any type of dengue EHR into any type of stakeholders’ device, i.e., blockchain node with synchronous assimilations of EHR storage in distributed ledgers. Ethereum platform supports various sizes of blocks that can be used to store the required amount of dengue files. (iv) Social Inertia: Citizens of society are normally seen to be prone toward existing solutions. Health care service domain also falls in its range [59]. Thus, creating a barrier for involving novel solutions and alternatives in this application domain. DengueCBC is envisaged to mitigate the social inertia toward using and deploying of implied technologies for betterment of dengue care, cure, and management.

102

B. Chowhan et al.

(v) Need of Standardization: Blockchain is an emerging technology, thus it has not been standardized in many applications. Dengue care is an example of such domain where no such work has been done or thought of. DengueCBC could be considered as a key system model that could be incorporated into facilitations of dengue centric health services, especially for EHR transmission from clinics to administration agencies. The process of such transmission must be standardized in all regions of the act be it national or global [60]. Care givers, medical professionals, administration agencies, and health insurance providers must obey and include the system model in their basket of operations to make the dengue care, cure, and management standardized.

5 Conclusion Dengue care, cure, and management has become a trivial importance nowadays. Each year a large number of populations get affected by Dengue virus in many parts of the globe. The mortality rate due to this infection is high too. Thus, there arises a new challenge to tackle this scenario while incorporating advanced Information and Communication Technology (ICT). In this paper, we have proposed a novel decentralized consortium blockchain-based highly secure Dengue EHR transmission system model, DengueCBC. The DenguCBC employs three major components of operandi that includes involvement of existing blockchain platform, dengue infected patients, specialized medical caregiver, administration agencies, and clinics. The proposed model has been compared against alternative approaches to showcase its efficiency and significance toward managing the secure dengue related data transmission from the diagnostic clinics to the administration health agencies. The model allows all the aforementioned stakeholders to actively take part in the dengue EHR transmission and get them updated about the real-time decentralized EHR transmission data blocks. We further, investigated the key issues associated with the proposed model and provided a list of possible way outs to improve the overall dengue EHR data transmission in more secure, reliable, and better way.

References 1. Yanambaka, V. P., Mohanty, S. P., Kougianos, E., & Puthal, D. (2019). PMsec: physical unclonable function-based robust and lightweight authentication in the internet of medical things. IEEE Transactions on Consumer Electronics (TCE). 2. Puthal, D., & Mohanty, S. P. (2019). Proof of authentication: IoT-friendly blockchains. IEEE Potentials Magazine, 38(1), 26–29. 3. Mishra, P., Puthal, D., Tiwary, M., & Mohanty, S. P. (2019). Software defined IoT Systems: properties, state-of-the-art, and future research. IEEEWireless Communications Magazine (WCM). 4. Valentina, G., Fabrizio, L., Claudio, D., Pranteda, C., & Santamara, V. To blockchain or not to blockchain: That is the question. IT Professional 20(2), 62–74.

DengueCBC: Dengue EHR Transmission Using …

103

5. Wang, S., Ouyang, L., Yuan, Y., Ni, X., Han, X., & Wang, F. Blockchain-enabled smart contracts: Architecture, applications, and future trends. IEEE Transactions on Systems, Man, and Cybernetics: Systems. 6. Guo, R., Shi, H., Zhao, Q., & Zheng, D. (2018). Secure attribute-based signature scheme with multiple authorities for blockchain in electronic health records systems. IEEE Access, 6, 11676–11686. 7. Wang, Y., Zhang, A., Zhang, P., & Wang, H. (2018). Cloud-assisted EHR sharing with security and privacy preservation via consortium blockchain. In IEEE Access. 8. Laure, A. L., & Martha B. K. (2018). Blockchain for health data and its potential use in health it and health care related research. 9. Lei, H., Eunchang, C., & Do-Hyeun, K. (2019). A novel EMR integrity management based on a medical blockchain platform in hospital. Electronics, 8(4). 10. Hao, G., Wanxin, L., Mark, N., Chien-Chung, S. Access control for electronic health records with hybrid blockchain-edge architecture. In 2019 IEEE 4th international conference on blockchain. 11. Lanxiang, C., Wai-Kong, L., Chin-Chen, C., Kim-Kwang, R. C., Nan, Z. (2019). Blockchain based searchable encryption for electronic health record sharing. Future Generation Computer Systems, 95, 420–429. ISSN 0167-739X. 12. Liehuang, Z., Yulu, W., Keke, G., Kim-Kwang, R. C. (2019). Controllable and trustworthy blockchain-based cloud data management. Future Generation Computer Systems, 91, 527–535. ISSN 0167-739X. 13. Shaimaa, B., Ibrahim, G., Emad, A.-E. (2018). Multi-tier blockchain framework for IoT-EHRs systems. Procedia Computer Science, 141, 159–166. ISSN 1877-0509. 14. Sheng, C., Gexiang, Z., Pengfei, L., Xiaosong, Z., Ferrante, N. (2019). Cloud-assisted secure eHealth systems for tamper-proofing EHR via blockchain. Information Sciences, 485, 427–440. ISSN 0020-0255. 15. Ramzi, A., & David, R. (2019). Chapter Five: Blockchain applications in healthcare and the opportunities and the advancements due to the new information technology framework. In S. Kim, G. C. Deka, P. Zhang (Eds.), Advances in computers, vol. 115, pp. 141–154. Elsevier. 16. Ray, P. P., & Majumder, P. (2019). On chaining the epigenetic blocks. Current Science, Indian Academy of Sciences. 17. Ray, P. P., Thapa, N., & Dash, D. (2019). Implementation and performance analysis of interoperable and heterogeneous IoT-Edge gateway for pervasive wellness care. IEEE Transactions on Consumer Electronics. 18. Ray, P. P. (2014). Internet of things based physical activity monitoring (PAMIoT): An architectural framework to monitor human physical activity. In Proceeding of IEEE CALCON, Kolkata, pp. 32–34. 19. Ray, P. P., Thapa, N., Dash, D., & De, D. (2019). Novel implementation of IoT based non-invasive sensor system for real-time monitoring of intravenous fluid level for assistive e-Healthcare. Circuit World: Emerald Publishing. 20. Ray, P. P., Dash, D., & De, D. (2019). Edge computing for internet of things: A survey, ehealthcare case study and future direction. Journal of Network and Computer Applications. 21. Ray, P. P., Dash, D., & De, D. (2019). Implementation of IoT supported smart embedded web server for generic fever classification: A pervasive e-Healthcare approach. Brazilian Archives of Biology and Technology. 22. Ray, P. P., Dash, D., & De, D. (2019). A systematic review and implementation of IoT-based sensor-enabled pervasive tracking system for dementia patients. Journal of Medical Systems. 23. Ray, P. P. (2019). Energy packet networks: An annotated bibliography. SN Computer Science. 24. Ray, P. P., & Majumder, P. (2019). An introduction to pervasive biomedical informatics. CSI Communication Magazine, Computer Society of India, 62. 25. Ray, P. P., Dash, D., & De, D. (2019). Analysis and monitoring of IoT assisted human physiological galvanic skin response factor for smart E-Healthcare. Sensor Review. 26. Majumder, P., Ray, P. P., Ghosh, S., & Dey, S. K. (2019). Potential effect of tobacco consumption Through smoking and chewing tobacco on IL1beta Protein expression in chronic periodontitis

104

27.

28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48.

49.

50.

B. Chowhan et al. patients: In silico molecular docking study. IEEE/ACM Transactions on Computational Biology and Bioinformatics. Ray, P. P., Dash, D., & De, D. (2019). Internet of things-based real-time model study on e-Healthcare: Device, message service and dew computing. Computer Networks, 149(11), 226–239. Ray, P. P., Dash, D., & De, D. (2018). Approximation of fruit ripening quality index for IoT based assistive e-Healthcare. Microsystem Technologies. Ray, P. P., & Thapa, N. (2018). A systematic review on real-time automated measurement of IV fluid level: Status and challenges. Measurement, 129, 343–348. Ray, P. P. (2019). Minimizing dependency on internetwork: Is dew computing a solution? Transactions on Emerging Telecommunications Technologies, 30(1). Ray, P. P. (2018). Internet of things based approximation of sun radiative-evapotranspiration (ET0) models. Journal of Agrometeorology, 20(2), 171–173. Ray, P. P. (2018). Continuous glucose monitoring: A review of sensor systems and prospects. Sensor Review, 38(4), 420–437. Ray, P. P. (2017). An introduction to dew computing: definition, concept and implications. IEEE Access, 6, 723–737. Ray, P. P., Dash, D., & De, D. (2017). A systematic review of wearable systems for cancer detection: Current state and challenges. Journal of Medical Systems, 41, 180. Ray, P. P., Mukherjee, M., & Shu, L. (2017). Internet of things for disaster management: State-of-the-art, challenges, and future road map. IEEE Access, 5(1), 18818–18835. Ray, P. P. (2017). Internet of things for smart agriculture: Technologies, practices and future road map. Journal of Ambient Intelligence and Smart Environments, 9, 395–420. Ray, P. P. (2017). Understanding the role of internet of things towards providing smart eHealthcare services. Bio Medical Research, 28(4), 1604–1609. Ray, P. P. (2017). An IR sensor based smart system to approximate body core temperature. Journal of Medical Systems, 41, 123. Ray, P. P. (2017). A survey on visual programming languages in internet of things. Scientific Programming. Ray, P. P. (2017). Data analytics: India needs agency for health data. Current Science, 112(6), 1082. Ray, P. P. (2016). Internet of things cloud enabled MISSENARD index measurement for indoor occupants. Measurement, 92, 157–165. Ray, P. P. (2016). Communicating through visible light: internet of things perspective. Current Science, 111(12), 1903–1905. Ray, P. P. (2016). Internet of robotic things: Concept, technologies and challenges. IEEE Access, 4, 9489–9500. Ray, P. P. (2017). Obligations behind quantum internet dream. Current Science, 112(11), 2175– 2176. Ray, P. P. (2016). Creating values out of internet of things: An industrial perspective. Journal of Computer Networks and Communications. Ray, P. P. (2018). A survey on internet of things architectures. Journal of King Saud University Computer and Information Sciences, 30(3), 291–319. Ray, P. P. (2016). A survey of IoT cloud platforms. Future Computing and Informatics Journal, 1(1–2), 35–46. Ray, P. P., Chettri, L., Thapa, N. (2018). IoRCar: IoT supported autonomic robotic movement and control. In IEEE 7th international conference on computation of power, energy, information and communication, ICCPEIC-2018, Melmaruvathur, India. https://doi.org/10.1109/iccpeic. 2018.8525216. Ray, P. P. (2018). Digital India: perspective, challenges, and future direction. In IEEE 4th International Conference on Power Signals Control & Computation (EPSCICON), Thrissur, Kerala, India. https://doi.org/10.1109/epscicon.2018.8379594. Ray, P. P. (2016). IoT based fruit quality measurement system. IEEE International Conference on Green Engineering and Technologies (IC-GET). https://doi.org/10.1109/get.2016.7916620.

DengueCBC: Dengue EHR Transmission Using …

105

51. Ray, P. P. (2016). Towards internet of things based society. In IEEE International Conference on Signal Processing, Communication & Embedded Systems (SCOPES), Paralakhemundi, Odisa, India, pp. 345–352. https://doi.org/10.1109/scopes.2016.7955849. 52. Ray, P. P. (2016). Internet of things cloud based smart monitoring of air borne PM2.5 density level. IEEE International Conference on Signal Processing, Communication & Embedded Systems (SCOPES), Paralakhemundi, Odisa, India, pp. 995–999. https://doi.org/10.1109/sco pes.2016.7955590. 53. Ray, P. P., & Agarwal, S. (2016). Bluetooth 5 and internet of things: Potential and architecture. In IEEE International Conference on Signal Processing, Communication & Embedded Systems (SCOPES), Paralakhemundi, Odisa, India, pp. 1461–1465. https://doi.org/10.1109/sco pes.2016.7955682. 54. Ray, P. P., Rai, R., Chettri, L., & Bishunkey, K. (2016). B4Heal: Bio-inspired biopolymer based biocompatible biosensor for smart healthcare. In 1st International Conference on Nanocomputing & Nanobiotechnology (NanoBioCon), p. 74. 55. Ray, P. P. (2016). An internet of things based approach to thermal comfort measurement and monitoring. In IEEE International Conference on Advances in Computing and Communications (ICACCS), pp. 1–7. https://doi.org/10.1109/icaccs.2016.7586398. 56. Ray, P. P. (2015). Towards an internet of things based architectural framework for defence. In IEEE International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), Kumaracoil, pp. 411–416. https://doi.org/10.1109/iccicct. 2015.7475314. 57. Ray, P. P. (2015). A generic internet of things architecture for smart sports. In IEEE International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), Kumaracoil, pp. 405–410. https://doi.org/10.1109/iccicct.2015.7475313. 58. Ray, P. P. (2015). Internet of things based smart measurement and monitoring of wood equilibrium moisture content. In IEEE International Conference on Smart Structures and Systems (ICSSS). https://doi.org/10.1109/smartsens.2015.7873612. 59. Ray, P. P. (2015). Internet of Things for Sports (IoTSport): An architectural framework for sports and recreational activity. In Proceeding of IEEE International Conference on Electrical, Electronics, Signals, Communication and Optimization (EESCO), Vizag, pp. 79–83. https:// doi.org/10.1109/eesco.2015.7253963. 60. Ray, P. P. (2014). Home health hub internet of things (H3IoT): An architectural framework for monitoring health of elderly people. In Proceeding of IEEE ICSEMR, pp. 1–4. https://doi.org/ 10.1109/icsemr.2014.7043542. 61. Rifi, N., Rachkidi, E., Agoulmine, N., & Taher, N. C. (2017). Towards using blockchain technology for eHealth data access management. In 2017 fourth international conference on advances in biomedical engineering (ICABME), Beirut, pp. 1–4. 62. Azbeg, K., Ouchetto, O., Andaloussi, S. J., Fetjah, L., & Sekkaki, A. (2018) Blockchain and IoT for security and privacy: A platform for diabetes self-management. In 2018 4th international conference on cloud computing technologies and applications (Cloudtech), Brussels, Belgium, pp. 1–5. 63. Talukder, A. K., Chaitanya, M., Arnold, D., & Sakurai, K. (2018) Proof of disease: A blockchain consensus protocol for accurate medical decisions and reducing the disease burden, In 2018 IEEE SmartWorld, ubiquitous intelligence & computing, advanced & trusted computing, scalable computing & communications, cloud & big data computing, internet of people and smart city innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), Guangzhou, pp. 257–262. 64. Yang, G., & Li, C. (2018). A design of blockchain-based architecture for the security of Electronic Health Record (EHR) systems. In 2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), Nicosia, pp. 261–265. 65. Wehbe, Y., Zaabi, M. A., & Svetinovic, D. (2018). Blockchain AI framework for healthcare records management: constrained goal model. In 2018 26th Telecommunications Forum (TELFOR), Belgrade, pp. 420–425.

106

B. Chowhan et al.

66. Vora J., et al. (2018). BHEEM: A blockchain-based framework for securing electronic health records. In 2018 IEEE Globecom Workshops (GC Wkshps), Abu Dhabi, United Arab Emirates, pp. 1–6. 67. Shah, B., Shah, N., Shakhla, S., & Sawant, V. (2018). Remodeling the healthcare industry by employing blockchain technology. In 2018 international conference on circuits and systems in digital enterprise technology (ICCSDET), Kottayam, India, pp. 1–5. 68. Liu, X., Wang, Z., Jin, C., Li, F., & Li, G. (2019). A blockchain-based medical data sharing and protection scheme. IEEE Access, 7, 118943–118953. 69. Tang, F., Ma, S., Xiang, Y., & Lin, C. (2019). An efficient authentication scheme for blockchainbased electronic health records. IEEE Access, 7, 41678–41689. 70. Nortey, R. N., Yue, L., Agdedanu, P. R., & Adjeisah, M. (2019). Privacy module for distributed electronic health records (EHRs) using the blockchain. In 2019 IEEE 4th International Conference on Big Data Analytics (ICBDA), Suzhou, China, pp. 369–374.

Online Credit Card Fraud Analytics Using Machine Learning Techniques Akshi Kumar, Kartik Anand, Simran Jha, and Jayansh Gupta

Abstract Every year, companies and financial institutions worldwide lose billions due to credit card fraud. With the advancement in electronic commerce and communication technology and the development of modern technology, there has been an increase in the usage of credit cards and also the risks associated with it. Financial fraud detection systems are being implemented everywhere to minimize these losses. In the real world, fraudulent and real transactions are scattered all around and it is extremely difficult to distinguish between them. There is a class imbalance problem due to the minority of fraudulent transactions. The enormity of imbalanced data poses a challenge on how well Machine Learning techniques can efficiently detect fraud with a high prediction accuracy, and at the same time reduce misclassification costs. In this paper, experiments are performed to compare the most commonly used ML algorithms and investigate their performance when applied to a massive and highly imbalanced dataset. We compare six Supervised Machine learning techniques, namely, Logistic Regression, K-Nearest Neighbours, Naive Bayes, Decision Trees, Random Forest and Linear Support Vector Machine (SVM). Keywords Credit card fraud · Fraud detection · Class imbalance · Supervised machine learning · Logistic regression · K-nearest neighbours · Naïve Bayes · Decision trees · SVM · Random forest

1 Introduction Credit card fraud is a broad term for theft that is carried out by using a credit card to gain a source of legitimate funds through a given transaction. Credit card fraud detection is the process of identifying fraudulent transactions and its classification into two classes, i.e. legit class and fraud class. As the use of credit card as a source of payment becomes more widespread, many new computational methodologies have A. Kumar (B) · K. Anand · S. Jha · J. Gupta Department of Computer Science & Engineering, Delhi Technological University, Delhi 110042, New Delhi, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1174, https://doi.org/10.1007/978-981-15-5616-6_8

107

108

A. Kumar et al.

come into focus to handle the fraud issue. However, credit card fraud is difficult to detect and solve. The primary reason is the limited amount of data available for public access, making it challenging to match a pattern for a given dataset. Secondly, research results are often hidden and censored which makes the results unavailable making it difficult to create a benchmark for the models built. In addition, the security concern limits the exchange of ideas in fraud detection of credit card transactions. The Credit Card Fraud Detection Problem includes analysis and modelling of past card transactions including the fraud transactions. The model is then tested on new transactions in order to predict the status of the new transaction (fraud or not). This paper aims to conduct a comparative analysis of fraud detection of credit cards by evaluating six approaches: Logistic Regression, K-Nearest Neighbours, Naive Bayes, Decision Trees, Linear Support Vector Machine and Random Forest—and finally the results are collated to find which model performs the best. The paper begins with a general description of credit card fraud detection followed by a discussion on the background and similar past work. In the next section, an overview of the system’s architecture is presented followed by the proposed work. Here we explain the dataset that is being used along with the corresponding evaluation criteria and also describe the applied methodology. Further, an explanation of all the supervised learning techniques is given in detail. The results depict the performance of a given technique and determine whether the resulting fraudulent transaction matches the respective fraudulent transactions in the dataset. The paper concludes with the summary and research’s future scope.

2 Related Work Kamaruddin and Ravi [1] developed an architecture that combined Particle Swarm Optimization and Bi-Directional Neural Networks for classification of a single class in the Spark framework for fraud detection. Santiago et al. [2] used Support Vector Machine to detect whether a transaction in a credit card dataset is legit or not. They succeeded in the identification of 40–50% of the fraudulent transactions over a period of one month. Neural Networks were used by Gomez et al. [3] to reduce the imbalanced data and detect fraudulent credit card transactions. Bhattacharyya et al. [4] used Random Forest, Support Vector Machine (SVM) and Linear Regression to detect fraud. The overall accuracy of Random Forest was better than the other 2 techniques. Quah et al. [5] worked in the detection of credit card fraud using Self-Organizing Maps’ (SOM) clustering and filtering capabilities. Panigrahi et al. [6] developed a system for detection of credit card fraud using a combination of four techniques, i.e. filters based on rules, database of transaction history, Bayesian Learning and Dempster–Shafer adder. The suspicious transactions were first segregated by calculating their spread from the expected pattern. Then these cases were combined to measure the initial belief. Transactions are then categorized according to their extent of fraudulent nature and on the basis of history, the belief is accepted or discarded

Online Credit Card Fraud Analytics Using Machine Learning …

109

using Bayesian Learning. Halvaiee et al. [7] applied and analysed different classification techniques for credit card fraud prediction and concluded that the performance of Tree and Meta classifiers is better than the rest of the classification groups. Malini et al. [8] identified the pros and cons of KNN, Artificial Neural Network, Decision Tree, Hidden Markov Model, Logistic Regression, and Support Vector Machine algorithm. The author classifies fraudulent and non-fraudulent transactions using KNN. Details of outlier detection algorithms with their types, supervised and unsupervised outlier detection were also given. They have shown that the KNN algorithm provides accurate and efficient results and the outlier detection algorithm works faster and provides better results for large datasets online. KNN has less memory consumption and performs extremely well in fraud detection. Despite an exorbitant amount of work already done in fraud detection, there has been a paucity of comparative analysis of many supervised classifiers to identify and detect online fraud. Generally, a machine learning algorithm is applied on a dataset and the results are compiled by the authors. However, no work has been done in comparing and examining different performance metrics on a single platform.

3 Background 3.1 Class Imbalance Problem The datasets in many practical domains including real-time bidding in marketing, fraud detection in banking, intrusion detection in networks, detection of an anomaly, diagnosis of medical conditions, detecting oil spills, facial recognition, etc., have a common problem. In all these datasets, the total number of examples of a class of data (negative) by far exceeds the total number of examples of another class of data (positive), i.e. each class does not make up an equal portion of the dataset. This situation is termed as the class imbalance problem [9]. Class imbalance problem has become a paramount issue in the field of machine learning and data science. In Fig. 1, the red points are greatly outnumbered by the blue. Standard machine learning algorithms like Decision Tree (ID3), KNN, SVM, etc. are biased towards the majority class, and hence generally ignore the minor class and the classifier predicts that everything belongs to the majority class. This poses a major problem in the field of classification and the majority of intelligence algorithms are not suited to handle this problem. Fortunately, there are techniques that are capable of reducing the negative effects of such biases. These methods are explained as follows: • Collecting More Data The approach of collecting more data is almost always overlooked. If possible, more data should be gathered for the problem. A large dataset might expose a different

110

A. Kumar et al.

Fig. 1 Illustrating class imbalance problem

and perhaps more balanced perspective on the classes. However, data collection is nearly impossible and hence, this scenario is skipped. • Using the right performance evaluation metrics in the imbalanced domain If we use accuracy to measure a model’s goodness, a model that classifies all test samples as ‘not fraud’ will have an excellent accuracy (99.9%), but this model will obviously not provide us with valuable information. Hence, applying inappropriate evaluation metrics like accuracy for imbalanced data can be dangerous. Instead, Confusion Matrix is used to evaluate the performance of an algorithm which contains actual and the predicted class information (Fig. 2). Some other metrics can be applied in this case, such as Precision/Specificity: TP / (TP + FN)

(1)

Recall/Sensitivity: TN / (TN + FP)

(2)

F1 score:2 ∗ Precision ∗ Recall / (Precision + Recall)

(3)

ROC (Receiver Operating Characteristic): The ROC evaluates a classifier’s performance on the basis of trade-off between true positive rate (Y-axis) and false positive rate (X-axis) Fig. 2 Confusion matrix

Online Credit Card Fraud Analytics Using Machine Learning …

111

AUC (Area under ROC Curve): relationship between true-positive rate and false positive rate • Resampling the data set Another approach is to resample the dataset to make it balanced. This can be done in 2 ways: increase the examples of the minority category (oversampling) or reduce the examples of majority category (undersampling). This paper employs the following techniques for resampling: SMOTE (Synthetic Minority Over-Sampling Technique) and Random Under-Sampling which are discussed and explained in the later sections.

4 System Architecture Fraud detection primarily involves the identification of various types of frauds that can occur offline or online on the social web. Fraud detection was applied to a dataset of online transactions in the presented work. There can be transactions that can cause monetary loss and result in the degradation of economy alongside genuine ones. It is, therefore, essential and crucial to identify such transactions. In order to perform the work, various supervised machine learning techniques were used on the credit card dataset. The techniques have succeeded in pointing out authentic transactions and false ones. Figure 3 shows the System Architecture. The system architecture comprises of four phases for each workflow component.

4.1 Input Preprocessing The system is fed with the dataset of credit card transactions in the first stage. Changes are made to the dataset’s feature space in order to improve the results of the algorithms to be applied in the next stage.

4.2 Processing The input preprocessing phase output is taken and used as an input for this phase. The label for each transaction is calculated for each and every point of data in the input dataset. Each transaction is defined by the label as genuine or fraudulent. For each machine learning technique, the above process is done.

112

A. Kumar et al.

Fig. 3 System architecture

4.3 Matching the Results From the previous phase of processing, labels are obtained and used in this phase. Compared to the ground truth, these labels are then used to assess the accuracy of each algorithm.

4.4 Detection of Fraud The most optimum technique is selected on the basis of performance of classifiers. The label (fraud or not fraud) for each transaction is then reported.

Online Credit Card Fraud Analytics Using Machine Learning …

113

5 Materials and Methodologies Used 5.1 Dataset The dataset of credit card transactions (obtained from kaggle.com) includes the details of online transactions carried out by European credit card holders over a period of 2 days in September 2013. Out of a total of 284,407 transactions, 492 are fraudulent. The dataset contains 28 unlabelled columns resulting from Principal Component Analysis (PCA) feature selection transformation. In view of confidentiality concerns, the background details of these features cannot be presented. In addition, there are three labelled columns: Amount, Time and Class. The transaction amount or value is represented by the ‘Amount’ feature. ‘Time’ variable stores the time elapsed from the first transaction to the current one in seconds in the dataset. The ‘Class’ feature maps the output (0 to not fraud and 1 to fraud), determining the transactions as authentic or fraudulent (Fig. 4).

5.2 Data Acquisition It is the primary step before we begin analysing the data. The credit card dataset is obtained from kaggle.com. This dataset presents credit card transactions that have taken place in 24 hours. There are 492 fraudulent ones out of a total of 284,807 transactions. It is important to note that due to confidentiality reasons, the data was anonymized, variable names were renamed to V 1, V 2 and V 3 until V 28. Moreover, most of it was scaled, except for the Time, Amount and Class variables, the latter being our binary, and target variable.

5.3 Exploratory Data Analysis The first and foremost step after data acquisition was to gather the sense of our data. Remember, except for the transaction and amount features we don’t have any information about what the other columns are (due to confidentiality issues). The only thing we know about the columns labelled V 1, V 2…V 28 is that those columns Fig. 4 Data distribution

114

A. Kumar et al.

have been scaled already. Hence, to gather an overview and understanding of the dataset, exploratory data analysis was performed. Some of the gathered insights are mentioned below: • The transaction amount is relatively small. The mean of all the amounts made is approximately USD 88. • There are no ‘Null’ values, so we don’t have to work on ways to replace values. • Maximum transactions were Non-Fraud (99.83%), while Fraud transactions occur 0.17% of the time in the dataset. • The distributions for Amount and Time were analysed. By seeing the distributions, we got an idea of how skewed these features are. • Moreover, a Z-test was also performed with the valid transactions as the population. This test was performed for 99% significance level and hence, a z-score of at least 2.326 must be achieved. z-score = (x − μ)/S.E

(4)

where x = sample mean, μ = population mean, S.E = Standard Error. We obtained a z-score of 3.008 which signifies that there is a 99% chance that the amount spent on fraudulent transactions are on average significantly higher than legit transactions. • Similarly, a 2-tailed hypothesis test with a 99% significance level was performed for each of the 28 features and hence, a z-score of at least 3.37 must be achieved. This was done to see if the feature value for fraud data is significantly different from the valid transactions or not.

5.4 Preparation and Cleaning of Data The main problem with this dataset is that it is highly unbalanced, with only 0.172% of transactions with the class as ‘fraud’. Of the 284,407 transactions, there are only 492 frauds: too many positive cases (genuine) and too less negative cases (fraud). To mitigate this high imbalance ratio, the following techniques (Figs. 5 and 6) were used to enable the models to see sufficient examples of fraud while training. Synthetic Minority Oversampling Technique (SMOTE) is used to generate new cases of fraud (minority class) with construction and k-nearest neighbours. To oversample means to artificially create observations in our dataset belonging to the class that is under-represented in our data. In the process used by us, SMOTE performs the following steps: • It first finds the k-nearest-neighbours of the observations of the minority class. • Then it arbitrarily chooses one of the neighbours and uses it to create similar, but bit tweaked, new observations.

Online Credit Card Fraud Analytics Using Machine Learning …

115

Fig. 5 Oversampling illustration

Fig. 6 Under sampling illustration

• Finally, the number of positive samples is increased to 199,016 using this technique. Undersampling works by sampling the dominant class to reduce the number of samples. One simple way of undersampling is randomly selecting a handful of samples from the class that is overrepresented. The number of positive samples was decreased to 492 using this technique.

5.5 Classification Techniques Used The following techniques for classification were used: • Logistic Regression: ‘Logistic regression analysis studies the association between a categorical dependent variable and a set of independent (explanatory) variables and measures the probability of a binary response’. The response variable is evaluated using a mathematical equation relating it with the predictor variables. Equation 5 represents the calculation.   y = 1/ e−(a+b1∗x1+b2∗x2+b3∗x3...)

(5)

116

A. Kumar et al.

The parameters used are described below. The response variable is y, the predictor variable is x, a and b are the numerical constant coefficients. In this paper, an instance of a fraudulent transaction is represented by class ‘1’ and genuine transaction is represented by class ‘0’. • K-Nearest Neighbours: ‘K-Nearest Neighbour is defined as a non-parametric method that relies on the category labels where an output is a class membership’. K-NN identifies the class of an item based on the majority votes of its neighbours. K-nearest neighbour algorithm uses the query data’s K closest samples. Each sample belongs to a known class C i. The C m class is assigned to the query data Dq which has the most incidences among the K samples. The value of the k, topological distribution of samples over the feature space and the number of samples affect the performance of KNN classifiers. In this paper, for every sample, the k-nearest neighbour of the instance is located using Euclidean distance. If maximum is labelled as ‘1’, the model concludes they are fraudulent. Similarly, if maximum is labelled as ‘0’, the model concludes they are non-fraudulent. • Decision Tree: In Decision Trees, the data is continuously split on the basis of a certain parameter called the decision node. Two key units—decision nodes and leaves describe the tree. The leaves are the final results or the outcomes after taking a particular decision and the decision nodes help in splitting the data based on the feature values. The target variable type basically decides the type of decision tree and can be of two types, namely, Categorical Variable Decision Tree, Continuous Variable Decision Tree. In this paper, Categorical Variable Decision Tree is implemented in which the test sample is classified as fraudulent ‘1’ or not fraudulent ‘0’ based on the leaf node. The tree can be divided into branches in many ways. The tree structure is optimized by the model during its training. In this study, splitting is done on the basis of entropy which chooses the split with the highest information gain. Entropy ( S ) = Σ − pi log2 pi

(6)

Gain( S, A ) = Entropy (S) − Vε Values (A) |Sv | / |s|Entropy (Sv )

(7)

• Random Forest: ‘The Random Forest is an ensemble classifier based on Decision Trees which makes a prediction about the class, not simply based on one decision trees, but by an (almost) unanimous prediction, made by ‘K’ decision trees’. Every tree votes for the most popular class at input x. A new random vector is generated that is independent of the previous vectors having the same distribution and the training set generates a tree. For the classification of a new sample, the same process is repeated across all the trees in the forest, thereby training the Forest Classifier. The vote of each tree in the forest must be recorded to classify and label the new instance. Out of all the votes by the

Online Credit Card Fraud Analytics Using Machine Learning …

117

trees, the class with the largest number of votes is declared as the class of the new sample to be classified. This is termed as the Forest RI process. • Naïve Bayes Classifier: Naive Bayes classifiers, a collection of algorithms that are based on Bayes’ Theorem, can work upon any number of independent variables whether continuous or categorical. Bayes’ Theorem calculates the probability of the occurrence of an event given the probability that another event has already taken place. Mathematically, we can represent the Bayes’ Theorem as follows: P(A/B) = P (B/A) ∗ P(A) /P(B)

(8)

The dataset is composed of two components: the feature matrix and the response vector. – Feature matrix is made up of all the rows of the dataset wherein each row comprises the value of dependent features. – Response vector has the value of class variable (output) corresponding to each row of the feature matrix. – Support Vector Machine: ‘The Support Vector Machine (SVM) is described by a decision plane which builds decision boundaries for separating group of instances as different class members by buildings a hyper-plane or a set of hyper-planes that is further utilized for classification’. The kernel performs the function of taking data as input and simulating the projection of the initial data in a feature space with a higher dimension. Linear kernel: The prediction for a new input using the dot product is as follows: F(x) = B (0 ) + SUM ( ai ∗ x ∗ xi )

(9)

where x is the input, x i is the support vector. The learning algorithm estimates the coefficients B0 and ai (for each input) from the training data. Radial Basis Function (RBF) kernel: It is one of the most commonly used kernels in SVC. K (x, x) = e−x−x2 /2 ∗ σ 2

(10)

Where ||x−x  ||2 represents the square of Euclidean distance between the data points x and x  . An SVC classifier that uses an RBF kernel has two parameters, namely, Gamma and C.

118

A. Kumar et al.

• Gamma: It is the first important parameter of the RBF kernel. Gamma can be considered kernel’s spread and thus, the decision region. A low gamma implies that the ‘curve’ of decision boundary is low and hence, a very broad decision boundary. A high gamma implies that the ‘curve’ of decision boundary is high, creating islands of decision boundary around the data points. • C: This is the second important parameter of the SVC learner denoting the penalty for misclassification of a data point. A small C implies the classifier is okay with misclassifications (high bias, low variance). A larger C implies there is a heavy penalty for misclassifications and hence, should be avoided (low bias, high variance).

6 Results Here, the results we got after the application of all the mentioned techniques on the dataset are analysed. In addition, the efficiency of each technique is measured and compared on the basis of the following metrics: Precision, Recall, Accuracy, AUC. The ROC curves for unscaled data are shown in Fig. 7. Tables 1 and 2 depict the results of scaled data with undersampling and oversampling, respectively.

Fig. 7 ROC Curves for unscaled data

Online Credit Card Fraud Analytics Using Machine Learning …

119

Table 1 Scaled data with undersampling Metrics

LR

KNN

RF

SVM

NB

DT

Precision

0.9858

1.0000

0.9931

1.0000

1.0000

1.0000

Recall

0.9586

0.9103

0.9728

0.9932

0.8912

1.0000

Accuracy

0.9730

0.9527

0.9831

0.9966

0.9459

1.0000

F1 Score

0.9720

0.9530

0.9828

0.9966

0.9424

1.0000

AUC

0.9726

0.9789

0.9993

0.9965

0.9830

1.0000

Table 2 Scaled data with oversampling Metrics

LR

KNN

RF

SVM

NB

DT

Precision

0.0733

0.0097

0.9225

0.0752

0.1827

0.4762

Recall

0.8782

0.9317

0.8207

0.8947

0.7660

0.7843

Accuracy

0.9795

0.8207

0.9996

0.9802

0.9940

0.9981

F1 Score

0.1352

0.0192

0.8686

0.1388

0.2951

0.5926

AUC

0.9289

0.9441

0.9431

0.9375

0.9310

0.8913

7 Challenges Faced Various number of problems faced by fraud detection systems are listed below. All these problems need to be tackled by an effective fraud detection technique in order to achieve the best performance: • Data Imbalance: The dataset of credit card transactions is imbalanced in nature. It means that a few of the credit card transactions are fraudulent in nature. This causes difficulty and imprecision in the detection of fraud transactions. • Difference in the importance of Misclassification: There is a difference in importance for different misclassification errors in the fraud detection task. The incorrect classification of a fraud transaction as genuine is more harmful than classifying a normal transaction as fraud since the mistake in classification in the latter case will undergo further investigations. • Data Overlap: Many normal transactions can be classified as fraudulent (false positive) and similarly many fraudulent transactions can be labelled as genuine (false negative). Thus, the key challenge here is to obtain a low rate of false positive and false negative in fraud detection. • Failure to adapt: The problem of detecting different types of genuine or fraudulent patterns is usually faced with classification algorithms. Detecting new patterns of fraudulent and normal behaviours is one of the challenges with the unsupervised and supervised models, respectively. • Cost of detecting fraud: Both the cost of detecting fraudulent behaviour and fraud prevention should be considered by the system. For example, by stopping a fraudulent transaction of very less amount, no revenue is obtained.

120

A. Kumar et al.

8 Conclusion and Future Scope The models cannot be generalized for other fraud datasets as they are learnt from this specific dataset of credit card transactions with a particular set of features. Since the dataset is highly imbalanced, various other methods including different versions of SMOTE and ADASYN oversampling variants, Tomek’s link, AIKNN could be used while preprocessing the data for the creation of a balanced dataset. Different ratios could be tried for resampling (e.g., 2:3 or 1:2) instead of 1:1 ratio of the number of minority versus majority instances, which we used. More generalized models can be obtained by using different validation techniques like k-fold cross-validation instead of 70–30 validation which was used. By default hyper-parameters for all the models weren’t tuned which can be fine-tuned for improving the performance of the algorithms. For example, we could use a different threshold in the case of logistic regression instead of the default value. Our models were built considering the transactions to be time dependent, i.e. sequential. The time feature of the dataset could have been ignored to build another model for prediction. The specificity and the sensitivity of the prediction models can be improved by using the entire dataset in data augmentation for the minority fraud class and deep learning models (CNN or RNN if time feature is used). Also, different patterns could be found out in the input features for the detection of fraudulent transactions.

References 1. Kamaruddin, S., & Ravi, V. (2016). Credit card fraud detection using big data analytics: Use of psoaann based one-class classification. In Proceedings of the International Conference on Informatics and Analytics 2016, Aug 25 (p. 33). ACM. 2. Santiago, G. P., Pereira, A., Hirata, R., Jr. (2015). A modeling approach for credit card fraud detection in electronic payment services. In Proceedings of the 30th Annual ACM Symposium on Applied Computing 2015, Apr 13 (pp. 2328–2331). ACM. 3. Gómez, J. A., Arévalo, J., Paredes, R., & Nin, J. (2018). End-to-end neural network architecture for fraud scoring in card payments. Pattern Recognition Letters, 1(105), 175–181. 4. Bhattacharyya, S., Jha, S., Tharakunnel, K., & Westland, J. C. (2011). Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3), 602–613. 5. Quah, J. T., & Sriganesh, M. (2008). Real-time credit card fraud detection using computational intelligence. Expert Systems with Applications, 35(4):1721–1732. 6. Panigrahi, S., Kundu, A., Sural, S., & Majumdar, A. K. (2009). Credit card fraud detection: A fusion approach using Dempster-Shafer theory and Bayesian learning. Information Fusion, 10(4), 354–363. 7. Halvaiee, N. S., & Akbari, M. K. (2014). A novel model for credit card fraud detection using artificial immune systems. Applied Soft Computing, 1(24), 40–49. 8. Mahmud, M. S., Meesad, P., & Sodsee, S. An evaluation of computational intelligence in credit card fraud detection. In 2016 International Computer Science and Engineering Conference (ICSEC), 2016 Dec 14 (pp. 1–6). IEEE. 9. https://towardsdatascience.com/detecting-financial-fraud-using-machine-learning-three-waysof-winning-the-war-against-imbalanced-a03f8815cce9.

Identification of Significant Challenges Faced by the Tourism and Hospitality Industries Using Association Rules Prafulla B. Bafna and Jatinderkumar R. Saini

Abstract The tourism and hospitality sector is extending enormously resulting in generating revenues and employability. There are several challenges with respect to the tourism industry stated in the literature. In this paper, a novel approach is presented to generate association mining rules. This work comes up with the important challenges to be worked upon to get profit in the tourism industry which in turn will improve the economies of the countries. Thirty research papers published in the area of tourism during the last decade were analyzed and ten core challenges were identified from them. The proposed approach selects the top 4 critical factors as challenges which are affecting tourism practices. The Apriori algorithm was used to generate rules and in turn significant factors. Keywords Apriori · Data mining · Feature selection · Tourism industry

1 Introduction Feature Selection is used to identify the significant features. Decision-making is improved by feature selection. Data mining methods are used as different feature selection methods. FS is an important part of preprocessing which is generally performed before applying the data mining algorithm. The outliers, noisy and useless data, get identified and attributes get reduced. This unwanted data needs to be removed to increase the performance and efficiency of the algorithm. The optimum set of features get chosen by the FS algorithm [1–3]. The Apriori algorithm is used for choosing significant challenges faced by the Tourism industry. Multiple challenges are focused on the literature [4, 5]. The Apriori algorithm is executed on the dataset of challenges. The challenges were found out from various research papers focusing on challenges faced by Tourism industries. The research papers are classified based on the challenges they showcase P. B. Bafna (B) · J. R. Saini Symbiosis Institute of Computer Studies and Research, Symbiosis International (Deemed) University, Pune, Maharashtra, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1174, https://doi.org/10.1007/978-981-15-5616-6_9

121

122

P. B. Bafna and J. R. Saini

and a matrix is prepared in which a research paper takes up a row and the challenge is placed in a column. The main factors are selected by the proposed technique based on binary values in the matrix. These factors are generally dependent on each other. The factors that are really not significant in the identification of challenges can be ignored to save efforts and time.

2 Related Work Today, our planet has shrunk into a global village and there has been a tremendous increase in the field of tourism. The number of people traveling across the planet every day has increased manyfold. Consequentially, the tourism industry is one of the major industries in any country. It plays an important role in contributing to the economy of the country as it is a source of income for many people and regions. The service industry benefits a lot from the tourism industry. So it becomes necessary for the tourism industry to be at a profit and work fine [5]. Since 2010, tourism has been steadily growing on a worldwide level—despite several terror attacks and political turmoil. Economic Progress—Tourism is beneficial to the country, as it generates foreign currency. Each year, a huge number of travelers visit India and other locations. Their visits to different shops and events subsidize to a substantial amount of currency production. Global recession does not impact Indian tourism. In fact, it increases by 7% each year [6]. Tourism supports employments as it creates jobs precisely in different sectors such as hospitality, entertainment, hotel and so on. Development of Infrastructure—When any place is declared as a tourist place, it changes tremendously with respect to its infrastructure, construction and other developments like constructing ways through and increases interconnectivity between places. Societal Progress—Tourism is considered a delightful way for exchanging social, cultural themes. It boosts societal development as sightseers acquire to express respect, patience and affection for other when they stay at new locations or places. It is important to establish a connection between tourism and the physical and social environments. Strong tourism growth in the past means that tourism has reached negative impacts that must be mitigated not only for the good of physical and social environments but also for the sustainability of the industry itself [7]. The past misconceptions about tourism as an environmentally friendly industry has led to slow integration of responsible environmental and social considerations into tourism planning and development [8]. Several challenges are faced by this industry that are responsible for down marketing and if these challenges get solved, then it will be beneficial in multiple aspects such as economy and cultural developments. [1]. According to Witten and Frank (2004), exponential growth of data is an impact of technology. It has increased dimensions and sample size of data from all perspectives

Identification of Significant Challenges Faced by the Tourism …

123

(Witten and Frank, 2004). Sometimes because of huge data, significant attributes get ignored during the process of decision-making results [9, 10]. Feature selection is the most commonly used technique by using Linear Discriminant Analysis (LDA), Singular Value Decomposition (SVD) and so on. A feature subset is formulated to reduce redundancy. To evaluate selected features, different feature selection measures can be used like database index and so on [11]. Preprocessing is generally performed before applying the data mining algorithm. The outliers, noisy and useless data, get identified and attributes get reduced. This unwanted data needs to be removed to increase the performance and efficiency of the algorithm. The optimum set of features get chosen by the FS algorithm [1–3]. It will be advantageous to select top N challenges. Before applying any algorithm on data, the feature selection technique should be used. It increases the performance of the learning of an algorithm and also lessens the cost. Attribute reduction is achieved due to feature selection. FS has also been used by [12, 13] for classification in the poetry domain, identification of challenges in the sports domain, respectively.

3 Data Collection and Preprocessing We have made use of secondary data. The collected dataset is from different research papers focusing on challenges faced by the tourism industry. Different challenges act as columns and research papers represent rows. If the challenge is present in the paper, then respective entry is marked as 1, otherwise 0. 30 research papers from the last decade were downloaded from almost ten publications related to tourism and hospitality [14–43]. Ten core and commonly occurring challenges were identified. Table 1 shows the sample table input to the Apriori algorithm the description of C1 to C10 present in Table 1 is listed in Table 2. Table 1 Research papers versus challenges Research Papers

C1

C2

C3

C4

C5

C6

C7

C8

C9

C10

RP1

1

1

0

0

0

0

1

1

1

1

RP2

0

1

0

1

0

1

1

0

0

1

RP3

1

0

0

1

0

1

1

0

0

0

RP4

1

1

1

0

1

1

1

0

0

0

RP5

1

0

1

0

1

1

1

0

0

0

RP6

1

1

0

0

1

1

1

0

0

0

RP7

0

1

0

0

0

1

1

0

0

0

RP8

0

0

0

1

0

1

1

0

0

1 (continued)

124

P. B. Bafna and J. R. Saini

Table 1 (continued) Research Papers

C1

C2

C3

C4

C5

C6

C7

C8

C9

C10

RP9

1

1

0

1

1

1

1

0

0

1

RP10

1

1

1

1

1

1

1

0

0

1

RP11

1

0

0

0

1

0

1

0

1

0

RP12

1

0

0

0

1

1

0

0

1

0

RP13

0

1

1

1

0

0

1

1

1

0

RP14

1

1

1

1

1

1

0

0

1

0

RP15

1

1

1

1

0

0

1

0

1

1

RP16

1

0

0

1

1

0

0

1

1

1

RP17

1

0

0

1

1

1

1

0

0

1

RP18

0

1

1

1

1

0

0

0

1

1

RP19

1

1

1

1

1

0

0

1

0

0

RP20

1

1

1

1

1

1

0

1

0

1

RP21

0

1

1

0

0

1

1

0

0

1

RP22

1

1

1

0

0

1

1

0

0

1

RP23

1

1

0

0

1

0

0

0

0

1

RP24

0

1

1

0

1

0

0

0

0

1

RP25

0

1

1

1

0

0

0

0

1

RP26

1

1

1

1

1

0

0

0

0

1

RP27

1

1

1

1

0

0

1

1

1

1

RP28

1

1

0

0

0

0

1

1

1

1

RP29

1

1

0

0

0

0

1

1

1

1

RP30

1

1

0

0

0

0

1

1

1

1

Table 2 Description of challenge Id

..

Challenge Id

Description

C1

Inflation

C2

Information Technology

C3

Fluctuation in currency exchange

C4

Seasonal dependence

C5

Taxation

C6

Environmental and climate change

C7

Lack of infrastructure

C8

Security issues

C9

Regulatory issues

C10

Cultural factors

Identification of Significant Challenges Faced by the Tourism …

125

4 Data Analysis Using Tools We carried out an extensive literature survey of different association algorithms. Association algorithms depend on the nature of data. The Apriori algorithm was applied to which input is a binary-valued table, using the R programming platform. It builds association rules and identifies significant challenges. Figure 1 specifies the steps implemented while performing the experiment. The steps performed to implement the Apriori algorithm are depicted in Fig. 2 in the form of a screenshot. Table 3 shows the top 4 significant critical factors faced by the tourism industry and found by the Apriori algorithm. Fig. 1 Steps in data analysis

126

P. B. Bafna and J. R. Saini

Fig. 2 Steps of implementation of the Apriori algorithm

5 Conclusions Any application where differentiation and selection of features or factors are significant, the proposed approach is useful. Extracting significant factors is useful to speed up different activities such as data collection, decision-making and so on. In this work, thirty recent research papers were studied to identify 10 core challenges which further scaled down to four challenges. Working on the top four challenges will be benefited for the profit of the tourism and hospitality businesses. Apriori algorithm was used to generate rules and in turn significant factors.

Identification of Significant Challenges Faced by the Tourism …

127

Table 3 Top four critical factors Sr. No.

Factors identified in research papers

Important factors identified by the Apriori algorithm

1

Inflation

Lack of infrastructure

2

Information Technology

Taxation

3

Fluctuation in currency exchange

Inflation

4

Seasonal dependence

Regulatory issues

5

Taxation



6

Environmental and climate change



7

Lack of infrastructure



8

Security issues



9

Regulatory issues



10

Cultural factors



References 1. Liu, Huan, & Lei, Yu. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4), 491–502. 2. Mamitsuka, H. (2002). Principles of data mining and knowledge discovery, lecture notes in computer science. Springer. 3. He, X., Ji, M., Zhang, C., & Bao, H. A Variance minimization criterion to feature selection using laplacian. 4. Witten, L. H., & Frank, E. Data miming: practical machine tools and techniques. 5. Xiang, Z., Woeber, K., & Fesenmaier, D. R. (2008). Representation of the online tourism domain in search engines. Journal of Travel Research, 47(2), 137–150. 6. Technofunc. (2012). Challenges in the tourism industry. Tavel Domain. 7. Reisinger, Y., & Turner, L. (1999). A cultural analysis of Japanese tourists: challenges for tourism marketers. European Journal of Marketing, 33(11/12), 1203–1227. 8. Kasim, A. (2006). The need for business environmental and social responsibility in the tourism industry. International journal of hospitality & tourism administration, 7(1), 1–22. 9. Farahat, A. K. (2005). Data mining (ICDM), an efficient greedy method for unsupervised feature selection. IEEE. 10. Liu, H. T. (2005). Evolving feature selection, intelligent systems. IEEE. 11. Ramaswami, M., & Bhaskaran, R. (2009). A study on feature selection techniques in educational data mining. Journal of computing. 12. Kaur, J., & Saini, J. R. (2020). Designing Punjabi poetry classifiers using machine learning and different textual features. International Arab Journal of Information Technology 17(3). 13. Bafna, P. B., & Saini, J. R. (2019). Identification of significant challenges in the sports domain using clustering and feature selection techniques. In Proceedings of ICETET-SIP-19. IEEE. 14. Kumar, N., & Kumar, R. R. (2019). Relationship between ICT and international tourism demand: A study of major tourist destinations. Tourism Economics, 1354816619858004. 15. Yin, P., Zheng, X., Duan, L., Xu, X., & He, M. (2019). A study of the contribution of information technology on the growth of tourism economy using cross-sectional data. Journal of Global Information Management (JGIM), 27(2), 39–58. 16. Kumar, N., & Kumar, R. R. (2019). Relationship between ICT and international tourism demand: A study of major tourist destinations. Tourism Economics, 1354816619858004. 17. Kalbaska, N., & Cantoni, L. (2018). The use of eLearning strategies among travel agents in the United Kingdom, India and New Zealand. Journal of Teaching in Travel & Tourism, 18(2), 138–158.

128

P. B. Bafna and J. R. Saini

18. Agrifoglio, R., & Metallo, C. (2018). Knowledge management and social media in tourism industry. In Social media for knowledge management applications in modern organizations (pp. 92–115). IGI Global. 19. Huang, H., Liu, Y., & Lu, D. (2019). Proposing a model for evaluating market efficiency of OTAs: Theoretical approach. Tourism Economics, 1354816619853114. 20. Gretzel, U., Sigala, M., Xiang, Z., et al. (2015). Smart tourism: foundations and developments. Electronic Markets, 25(3), 179–188. 21. Gretzel, U., Werthner, H., Koo, C., et al. (2015). Conceptual foundations for understanding smart tourism ecosystems. Computers in Human Behavior, 50, 558–563. 22. Hernández, J. M., Kirilenko, A. P., & Stepchenkova, S. (2018). Network approach to tourist segmentation via user generated content. Annals of Tourism Research 73, 35–47. 23. Huang, C. D., Goo, J., Nam, K., et al. (2017). Smart tourism technologies in travel planning: the role of exploration and exploitation. Information & Management, 54(6), 757–770. 24. Ivars-Baidal, J. A., Celdrán-Bernabeu, M. A., & Mazón J. N., et al. (2017). Smart destinations and the evolution of ICTs: a new scenario for destination management? Current Issues in Tourism, 1–20. 25. Kim, J., Fesenmaier, D. R., & Johnson, S. L. (2013). The effect of feedback within social media in tourism experiences (pp. 212–220). Berlin: Springer. 26. Kim, J. J., & Fesenmaier, D. R. (2017). Sharing tourism experiences. Journal of Travel Research, 56(1), 28–40. 27. Kim, J.-H., Ritchie, J. R. B., & McCormick, B. (2012). Development of a scale to measure memorable tourism experiences. Journal of Travel Research, 51(1), 12–25. 28. Kotler, P., Bowen, J. T., Makens, J., & Baloglu, S. (2017). Marketing for hospitality and tourism. Pearson Education, Boston, MA. 29. Becken, S. (2013). Developing a framework for assessing resilience of tourism sub-systems to climatic factors. Annals of Tourism Research, 43, 506–528. 30. Bekele, T. M., & Weihua, Z. (2011). Towards collaborative business process management development current and future approaches. In 3rd International Conference on Communication Software and Networks (ICCSN). IEEE. 31. Bennett, N., Lemelin, R. H., Koster, R., & Budke, I. (2012). A capital assets framework for appraising and building capacity for tourism development in aboriginal protected area gateway communities. Tourism Management, 33(4), 752–766. 32. Biggs, D. (2011). Understanding resilience in a vulnerable industry: the case of reef tourism in Australia. Available at: www.ecologyandsociety.org/vol16/iss1/art30/. Accessed 30 April 2018. 33. Biggs, D., Hall, C. M., & Stoecki, N. (2012). The resilience of formal and informal tourism enterprises to disasters: reef tourism in Phuket, Thailand. Journal of Sustainable Tourism, 20(5), 645–665. 34. Boley, B. B. (2011). Sustainability in tourism and hospitality education: towards an integrated curriculum. Journal of Tourism and Hospitality Education, 23(4), 22–31. 35. Brown, N. A., Rovins, J. E., Feldmann-Jensen, S., & Orchiston, C. (2017). Exploring disaster resilience within the hotel sector: a systematic review of literature. International Journal of Disaster Risk Reduction, 22, 362–370. 36. Cai, Z., Wang, Q., & Liu, G. (2014). Modelling the natural capital investment on tourism industry using a predator-prey model. In H. Jeong, M. Obaidat, N. Yen, & J. Park (Eds.), Advances in computer science and its applications. Lecture notes in electrical engineering, vol. 279, pp. 751–756. Springer, Berlin, Heidelberg. 37. Calgaro, E., Lloyd, K., & Dominey-Howes, D. (2014). From vulnerability to transformation: a framework for assessing the vulnerability and resilience of tourism destinations. Journal of Sustainable Tourism, 22(3), 341–360. 38. Collins, A. (1999). Tourism development and natural capital. Annals of Tourism Research, 26(1), 98–109. 39. Espiner, S., & Becken, S. (2014). Tourist towns on the edge: conceptualising vulnerability and resilience in a protected area tourism system. Journal of Sustainable Tourism, 22(4), 646–665.

Identification of Significant Challenges Faced by the Tourism …

129

40. Murai, R., & Matsuno, F. (2018). Field experiment of feasibility for offering service by an mobile robot in hotel and airport. Journal of the Robotics Society of Japan, 36(4), 279–285. 41. Murphy, J., Gretzel, U., & Hofacker, C. (2017). Service robots in hospitality and tourism: investigating anthropomorphism. Paper presented at the 15th APacCHRIE Conference, 31 May–2 June 2017. 42. Bali, Indonesia. Retrieved from: http://heli.edu.au/wpcontent/uploads/2017/06/APacCHRIE 2017_Service-Robots_paper-200.pdf. 43. Mazanec, J. A., Wober, K., & Zins, A. H. (2007). Tourism destination competitiveness: From definition to explanation? Journal of Travel Research, 46(1), 86–95.

Part II

Big Data Management

An Approach of Feature Subset Selection Using Simulated Quantum Annealing Ashis Kumar Mandal, Mrityunjoy Panday, Aniruddha Biswas, Saptarsi Goswami, Amlan Chakrabarti, and Basabi Chakraborty

Abstract Feature selection is one of the important preprocessing steps in machine learning and data mining domain. However, finding the best feature subsets for large datasets is computationally expensive task. Meanwhile, quantum computing has emerged as a new computational model that is able to speed up many classical computationally expensive problems. Annealing-based quantum model, for example, finds the lowest energy state of an Ising model Hamiltonian, which is the formalism for Quadratic Unconstrained Binary Optimization (QUBO). Due to its capabilities in producing quality solution to the hard combinatorial optimization problems with less computational effort, quantum annealing has the potentiality in feature subset selection. Although several hard optimization problems are solved, using quantum annealing, not sufficient work has been done on quantum annealing based feature subset selection. Though the reported approaches have good theoretical foundation, they usually lack required empirical rigor. In this paper, we attempt to reduce classical benchmark feature evaluation functions like mRMR, JMI, and FCBF to QUBO formulation, enabling the use of quantum annealing based optimization to feature selection. We then apply QUBO of ten datasets using both Simulated Annealing (SA) A. K. Mandal · B. Chakraborty (B) Iwate Prefectural University, Iwate, Japan e-mail: [email protected] A. K. Mandal e-mail: [email protected] M. Panday Calcutta University, Kolkata, India e-mail: [email protected] A. Biswas · S. Goswami · A. Chakrabarti Calcutta University, Kolkata, India e-mail: [email protected] S. Goswami e-mail: [email protected] A. Chakrabarti e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1174, https://doi.org/10.1007/978-981-15-5616-6_10

133

134

A. K. Mandal et al.

and Simulated Quantum Annealing (SQA) and compared the result. Our empirical results confirm that, for seven in ten datasets, SQA is able to produce at most equal or less number of features in all selected subset compared to SA does. SQA also results in stable feature subsets for all datasets. Keywords Quantum machine learning · QUBO · Feature selection · Filter method · Quantum annealing model

1 Introduction Feature subset selection is the process of selecting subsets of meaningful features from the original feature set. It is one of the dimensionality reduction techniques that can improve classification accuracy, improves interpretability, and reduce the training time of a machine-learning model. Over the last few decades, significant advances have been made for feature selection, which are primarily categorized as wrapper based, filter based, and embedded approaches [1]. While wrapper and embedded approaches use classifier in their processes, filter approach does not need classifier involvement. Filter approach uses various measures for feature subset evaluation other than classification accuracy. All these approaches have their respective benefits and limitations, but the underlying search techniques which require substantial computation play crucial role for feature subset selection [2]. Survey papers [3] and [4] provide a detailed review on the different aspects of feature selection approaches. During the era of big data, the increasing number of features has resulted in numerous challenges in machine learning and data mining field, such as scalability and stability challenges to feature selection algorithms. One of the possible approaches to effectively address the challenges could be exploiting quantum computing based feature selection, which is yet to be explored thoroughly. Quantum computing performs computation based on the principles of quantum mechanics such as superposition and entanglement. This computing can gain enormous processing power due to its ability in processing multiple states simultaneously, a difference from classical computer that executes one state at a time. Quantum computing, therefore, potentially provides significant speed up on solving problems, which are computationally costly for classical machines. There are two major computational models in quantum computing: annealing model and gate model. They are recently used for quantum machine, such as D-Wave’s annealing-based quantum hardware and IBM Q’s gate-based quantum hardware [5]. While theoretically one model can be converted into another with only polynomial overhead, they are specialized for different types of problem [6]. Nielsen et al. in [7] provide an extensive introduction to quantum computing and its potentialities in computational speed up. Quantum annealing has been very effective in quickly finding a good enough solution from a large potential solutions space, thus being suitable for solving combinatorial optimization problem. It generates the solution vector of the problem that can be converted into QUBO or Ising model [8]. It is, however, challenging to appropriately

An Approach of Feature Subset Selection Using Simulated …

135

map the problems into QUBO and quantum systems [9]. A kind of quantum annealing called Simulated Quantum Annealing (SQA) is an emulation of quantum annealing in a classical machine. It has recently been proven to be exponentially faster than its classical counterpart Simulated Annealing (SA) [10, 11]. Although there have been numerous combinatorial optimization problems that have been solved using quantum annealing, little attention is given to the problem of feature selection that can be executed on the quantum machine. In this paper, our aim is to develop a novel quantum-based approach to address the feature subset selection problem. Well-recognized filter-based feature subset selection methods including Minimum Redundancy Maximum Relevance (mRMR) [12], Joint Mutual Information (JMI) [13], and Fast Correlation-Based Filter (FCBF) [14] are chosen to transform them into QUBO formulation for the first time. This QUBO objective represents the Hamiltonian of traverse field Ising model. QUBOtransformed mRMR is then simulated on ten benchmark public datasets in different domain using SA and SQA solvers separately, which produce feature subsets of datasets. To test the effectiveness of SQA, we examined the selected feature subsets it produces and compared the subsets with that produced by SA.

2 Quantum Machine Learning (QML) Quantum machine learning (QML) is a comparatively new area of machine intelligence that leverages the properties of quantum mechanics such as superposition and entanglement to solve the machine-learning problem in quantum processors. As the dimension of data is increased rapidly, one promising answer is developing quantum version of classical machine learning algorithms. These algorithms are expected to provide the quantum speed up to the computationally expensive learning algorithms. For instance, machine-learning problems frequently require computation of the complex linear algebra, such as Fourier transforms, finding eigenvectors and eigenvalues, which are computationally expensive in classical computer. Theoretically, in those cases, quantum machine produces exponential speed ups [15]. This is because a Qubit circuit can perform a multiplication of an exponentially large matrix with a similarly large vector [16]. Besides, many real world problems can be combinatorial in nature, requiring computationally extensive search to get optimum or nearoptimum result. Feature subset selection falls under this category as the problem is searching for the best feature subsets. It is like finding the lowest energy state from a vast dimensional energy landscape. Adiabatic quantum computing holds this property and efficiently finds the ground state—desired global minima of problem. It is so effective for optimization problem that quantum annealing machine like D-Wave quantum annealer was developed primarily for this type of problem [17]. Although true potential of quantum machine learning is yet to be established, in future, it is promising for several application domains including designing drug and chemicals, pattern recognition, material discovery, mapping our brain circuitry, and understanding genetic makeup [15].

136

A. K. Mandal et al.

3 Ising Model The Ising model is a mathematical model of ferromagnetism in statistical mechanics. It was initially proposed by Ernst Ising [18]. It consists of discrete variables that represent magnetic dipole moments of atomic spins that can be in one of the two states (+1 or −1). The spins are arranged in a graph, usually a lattice, allowing each spin to interact with its neighbors. Ising model Hamiltonian (H (t)) with longitudinal and transverse fields can be shown mathematically in Eq. 1: 

z −h J σiz σi+1  = H0 − Γ (t) σix

H (t) = −



σiz − Γ (t)



σix (1)

H (t) is the Hamiltonian of the Ising Model, H0 is the initial state Hamiltonian, and σiz andσix are Pauli z and x operators on the ith particle (Qubit). The timedependent Hamiltonian of the system can be represented in terms of Hamiltonian of the initial state and the time-dependent transverse field over all particles with Pauli x operator. Longitudinal term is added to remove degeneracy. The dynamics are governed by time-dependent Schrodinger’s equation. J is the interaction strength. If the neighboring spins are in the same direction, J is negative; otherwise, it is positive. For non-neighboring spins, the value of J is zero as they will not contribute to the total energy of the system. Here, (t) is the tunneling field strength, which causes the tunneling across the energy landscape to the desired ground state. This term is actually external magnetic field (quantum fluctuations), which behaves very similar to temperature in SA. If the term is larger than all, the spins will be aligned toward the external transverse field in order to have the lowest possible energy state. At high temperature in SA, all the solutions are equally probabilistic (disordered state) and this behavior is exactly similar to the linear superposition of all the solutions in quantum annealing [19]. By changing the temperature, we can force the system to climb the barrier so that it can partially accept any inferior solution. In quantum annealing, on the other hand, we get this by tuning (lowering to zero) the transverse field under a specific annealing schedule. All of these are the inspiration of mapping feature subset selection optimization problems into Ising model and using SQA over SA as solver. Note that Ising Hamiltonian is also represented into QUBO Hamiltonian if discrete variables are represented as Boolean variables [17].

4 Quantum Feature Selection Approaches Feature selection method with quantum annealing machine was proposed by Milne et al. in [20], where 1QBit Quantum-Ready Software Development Kit was used as intermediary layer for the assistance of solving the problem either in D-wave hardware or in simulated Ising hardware. The prime target was to select the appropriate feature

An Approach of Feature Subset Selection Using Simulated …

137

subsets for German Credit Data using quantum annealing framework. Correlationbased feature selection is mapped into QUBO form followed by minimizing the objective function using quantum solver. The first part of the QUBO objective function is the influence of each feature with target output, whereas the second one characterizes the independence of features. That is, desired feature subsets are likely to strongly correlate with target values but less correlated with one another. This QUBO feature selection produces a smaller feature subset in comparison with two classical feature selection approaches, named Recursive Feature Elimination (RFE) and RFE wrapped with Cross-Validation (RFECV). Logistic Regression was used for model building process in which the accuracy result of QUBO feature is relatively better than other two feature selection approaches. Although promising results were obtained, the QUBO feature selection approach is only employed on the specific type of dataset. In [21], Kotaro Tanahashi et al. presented mutual information based strategy for feature selection on quantum annealing machine. In this approach, QUBO formulation of Mutual Information based Feature Selection (MIFS) is developed that can be solved in Ising annealing machine like D-wave. The QUBO objective function is built in such a way that it minimizes the redundancy of features but increases the relevance and complementarity of feature. Public dataset named a1a was used for experiment. QUBO matrix of the dataset was employed on D-wave solver along with two different classical solvers. Results indicate that in terms of reducing number of features with maintaining high accuracy, D-wave outperforms other solvers, implying that MIFS can be embedded in quantum environment and solved in Ising machine. Another way to solve quantum feature selection has been proposed based on quantum circuit by He et al. [19]. In this approach, classical wrapper-based feature selection is converted into quantum gate model. Quantum Least-Squares Support Vector Machine (LSSVM) is incorporated for evaluating the feature subsets. Features are added and removed from the candidate feature set, respectively, using both forward selection and backward elimination. The modified Grover’s algorithm, which is used for the searching process, accelerates the appropriate feature selection from candidate feature subset. Although empirical results are not highlighted, complexity theory presented in the paper indicates quadratic acceleration of the quantum wrapper-based feature selection over the classical one. Quantum feature selection literature indicates that until recently there has been little effort to solve feature selection on quantum frameworks. The above-mentioned approaches are the preliminary exploration in the quantum feature selection domain.

5 Digital Versus Quantum Annealing 5.1 Simulated Annealing Simulated Annealing (SA) [22], a metaheuristic technique, optimizes the objective function based on physical annealing process that probabilistically accepts some

138

A. K. Mandal et al.

worse solutions to escape from the local optimum. SA starts with a random generated initial solution and iteratively improves the solution quality. Current solution is replaced by the neighboring solution when it is better than the current one. However, worse neighbor is not    always rejected but accepted with probability, which is a   f s − f (s) ), where f s is neighboring solution, f (s) is current function of exp(− T solution, and T is a parameter known as temperature. Temperature parameter plays an important role, which is initialized with high value and gradually decreased according to cooling schedule. At higher temperature, probability of acceptance of worse solution is more frequent, indicating exploration of solution space, but the worse solution is likely to be rejected frequently with the gradual reduction of temperature, indicating exploitation of the solution space. In the annealing process, it is likely that temperature with high value and slow cooling rate guides solution to reach near the lower energy ground state.

5.2 Simulated Quantum Annealing (SQA) Nishimori and Kadowaki [23] introduced the theory of Quantum annealing, which is used as a key framework for combinatorial optimization problem in quantum machine. Like SA, quantum annealing converges to the lower energy state, but main difference is that quantum annealing uses quantum tunneling rather than thermal bouncing process, resulting in returning the ground state quicker than SA. SQA, however, is mainly the mapping of the quantum annealing in classical computer using Markov Chain Monte Carlo (MCMC) algorithm that samples the equilibrium thermal state of a quantum annealing Hamiltonian [10]. It is stated that the performance of quantum Monte Carlo implementation of quantum annealing and physical dynamics of quantum annealing are quite alike [23]. Therefore, using SQA, we can acquire the quantum power in the classical machine.

6 Materials and Methods In this section, we describe the proposed approach of feature subset selection. First, three filtering approaches of feature selection including mRMR, JMI, and FCBS are mapped into QUBO formulation of the objective function. This function is represented as mathematical model so that it can preserve QUBO property (i.e., Ising model). We then select ten datasets containing numerous features, and we generate QUBO matrix based on QUBO function of mRMR approach for each of the dataset. In this paper, mRMR feature selection method is used for experimental evaluation. Finally, QUBO matrices of the datasets are employed into SQA solver and annealing process generates feature subsets for the datasets. In order to evaluate the significance of quantum annealing machine, we also solve the feature subset selection problem with classical SA solver. In our experiment, both solvers are executed on PC with

An Approach of Feature Subset Selection Using Simulated …

139

Intel core i7 CPU, 8 GB RAM, Nvidia GPU GForce GTX660 and Ubuntu 18.04.2 LTS OS.

6.1 Dataset Description Datasets used in the experiment are illustrated in Table 1. This table shows a short description of the ten datasets along with the number of features, the number of Table 1. Descripton of datasets Datasets

No. of features

No. of classes

No. of instances

Description

CMC

9

3

1473

This dataset contains survey data about married women and the main task is to predict the contraceptive methods the women are likely to use

Dermatology

34

6

366

Dataset contains instances of erythemato-squamous diseases

Wisconsin

10

2

699

This is breast cancer dataset and aim is to predict whether the cancer is malignant or benign

E. coli

7

6

336

The dataset contains Protein Localization Sites

Iris

4

3

150

Classification of three species of Iris plant based on the length and the width of the sepals and petals

Lung cancer

56

3

32

Dataset describes three types of pathological lung cancers. It has comparatively more attributes than number of instances

Lymphography

18

2

148

Original dataset contains four classes but two are used as others are considered as outliers. The task is to perceive the current status of lymphoma

Vehicle

18

4

846

Dataset contains features of four different types of vehicles which are extracted from image of vehicles with different angles

WBDC

32

2

570

Features are extracted from the image of breast mass. The task is to classify breast cancer as malignant or benign

Wine

13

3

178

The task is to classify three types of wine providing 13 constituents of wines

140

A. K. Mandal et al.

classes, and the number of instances. All data are taken from the public data repository UCI [24]. Datasets are chosen in such a way that they are from different application domains with diverse data characteristics.

6.2 Formulating the Cost Functions as QUBO Form Standard Ising model Hamiltonian in Eq. 1 can be written as QUBO problem and is defined as   Q i,i X i + Q i, j X i X j = X T Q X (2) H0 (X ) = i

i, j

where X = (X 1 , X 2 , . . . , X n ) is a vector of Boolean variables and Q is a QUBO matrix which is symmetric n × n matrix. In our feature subset selection problem, we consider this vector X as a feature vector, with 0 indicating the specific feature that has been selected and 1 signifying that one has been discarded. The first term (i.e., before the addition sign) in the Eq. 2 encompasses in the QUBO matrix diagonally, representing the feature vector itself. The second term is basically relationship strength between features. The aim is to find the feature vectors (i.e., feature subsets) that will minimize the QUBO objective function. In the following step, we convert the classical feature selection filtering approaches into QUBO objective function. The following notations are used for conducting QUBO formulation. MI = Mutual Information, D = Diagonal matrix (feature class relevance), CMI = Conditional Mutual Information, X k = Feature subset, Y = Outcome variable, |S| = Cardinality of feature set, X j = S − X k , XT and X are Boolean row vector and column vector, respectively, and Y indicates the class labels. mRMR objective function The Classical Objective to mRMR [10] is as follows: JM R M R (X k ) = M I (X k ; Y ) −

  1  M I Xk; X j |S| X ∈S

(3)

j

This has relevance term and redundancy term. We observe that addition can be represented as matrix multiplication and vector can be represented as diagonal matrix, which transforms the first term to D inside the QUBO form. The second term is already in matrix form giving us  X JMI objective function

T

 1 D− MI X |S|

(4)

An Approach of Feature Subset Selection Using Simulated …

141

Extending the MRMR QUBO formulation, we add the complementarity term. Considering the Classical JMI [11] formulation as J J M I (X k ) = M I (X k ; Y ) −

    1  1  M I Xk; X j + M I X k ; X j |Y (5) |S| X ∈S |S| X ∈S j

j

This can be formulated as   1 1 T X D− MI + CMI X |S| |S|

(6)

FCBF objective function FCBF [12] at its essential breaks the mRMR objective into two stages of optimization, first optimizing the Relevance and then Redundancy. Step 1(Objective Maximize Relevancy): X T [D]X

(7)

Step 2(Objective Minimize Redundancy):  XT

 1 MI X |S|

(8)

Conditional Likelihood Maximization The above objectives can be considered as special cases of conditional likelihood maximization [9] objective below where 1 1 1 , γ = 0 and for JMI β = |S| , γ = |S| for mRMR β = |S| J (X k ) = M I (X k : Y ) − β



   M I (X k : X i |Y ) M I Xk : X j + γ

(9)

We formulate mRMR, JMI, and FCBF using MI. Our preliminary study indicates that mRMR formulation with MI provides stronger convergence of SQA. Therefore, in this paper, we highlight the experiments where mRMR objective function with MI is used as a QUBO form. We measured the QUBO matrix for all ten datasets that are later used as input for annealing systems. Figure 1 shows the QUBO matrix for E. coli Dataset, containing seven features F1 to F7. Here diagonal values are linear coefficients and others are quadratic coefficients for the QUBO equation.

6.3 Feature Subset Generation Using SA and SQA We used Sqaod solver as SQA. This solver is specially designed to address Ising problems in classical CPU and CUDA (Nvidia GPU) [25]. The solver takes QUBO

142

F1 F2 F3 F4 F5 F6 F7

A. K. Mandal et al.

F1 -0.311 0.038 0.000 0.000 0.011 0.044 0.018

F2 0.038 -0.339 0.000 0.000 0.007 0.032 0.010

F3 0.000 0.000 -0.149 0.000 0.000 0.000 0.000

F4 0.000 0.000 0.000 0.003 0.000 0.000 0.000

F5 0.011 0.007 0.000 0.000 0.209 0.021 0.024

F6 0.044 0.032 0.000 0.000 0.021 0.113 0.128

F7 0.018 0.010 0.000 0.000 0.024 0.128 0.438

Fig. 1 QUBO matrix of E. coli dataset

feature matrix of a dataset as an input and anneals the system until a minimum energy state is obtained. This ground state indicates the desired feature subset. We set the sample time 100 in order to reduce the probability of sub-optimal solution from sampled solutions if they exist. Note that in quantum annealing the solutions are probabilistic and may vary with different run. Apart from Sqaod tools, we also used classical solver in comparison. Initially, we programmed SA for that purpose. SA starts with randomly selected feature subset of a particular dataset and iteratively minimizes the QUBO objective function until significant quality solutions are obtained. The main parameter Temperature is updated at each step using temperature scheduling Tn+1 = αTn , where α is set as 0.9996. Our SA annealing was run 100 times separately, and each time generated feature subsets were recorded.

7 Experimental Result and Discussion Our experiments for finding feature subsets for datasets were run on SA and on SQA. Table 2 summarizes the performance of both approaches in terms of generating feature subset solution. As stating the original features’ name for an individual subset is difficult for representation in a table, we highlight the selected feature of a subset by its position value. For example, subset value {F3, F6} indicates that third and sixth features of original dataset are only selected in the subset. It is apparent from the table that SA has produced more than one feature subsets for most of the datasets, whereas SQA has resulted in only one subset for all the datasets. Due to multiple solutions of SA, we count the number of times feature subsets of a dataset is obtained. In this table, we only highlight the top three frequent subsets for each dataset. It is also observed from the table that the number of feature selected by SQA is less than the best occurrence subset produced by SA in most cases. Figure 2 describes the distribution of the number of feature generation of ten datasets after executing SA and SQA methods. In Fig. 2, it can be seen that frequently (seven out of ten cases), the number of selected features in the subset produced by

An Approach of Feature Subset Selection Using Simulated …

143

Table 2 Experimental results that compares classical SA with SQA for feature subset selection Datasets

CMC

Dermatology

Feature selection by classical SA

Feature selection by

(Top three feature subsets)

SQA

Feature subset with position of feature (Fi = ith feature of the dataset)

Subset size

Occurrence Of Feature Subset

Feature name

Subset size

F3, F6

2

30

F2–F3, F5–F6

4

F2–F3, F6

3

22

F2, F6

2

9 12

F1–F2, F4, F7, 13 F11, F15–18, F28, F30–F32

48

F1, F4, F7, F11

F1, F4, F7, F11, F15–F18, F28, F30–F32

35

F15–F18, F28, F30–F32

12

F1, F4, F7, F9, 13 F11, F15–18, F28, F30–F32

5

Wisconsin

F1–F8

8

100

F1–F3, F5–F7

6

E. coli

F1–F4

4

54

F1–F4

4

F1–F3

3

46

Iris

F3–F4

2

100

F4

1

Lung cancer

F4–F6, F12, F19–F21, F26, F31, F37–F40, F42–F56

28

15

30

F4–F6, F19–F21, F26, F31, F37–F40, F42–F56

27

14

F1, F3–F5, F11, F18–F20, F25, F30–F31, F36–F39, F41–F55

F2, F4–F6, F19–F21, F26, F31, F37–F40, F42–F56

28

6

F1, F4, F5–F10, F12, F14–F18

14

66

F1, F4–F10, F14–F18

13

F1, F4, F6–F10, F12, F14–F18

13

10

F1, F3–F10, F12, F14–F18

15

9

F6, F8, F15, F18

4

53

F6, F8, F18

3

F6, F8, F18

3

31

F8, F15, F18

3

10

Lymphography

Vehicle

(continued)

144

A. K. Mandal et al.

Table 2 (continued) Datasets

WBDC

Wine

Feature selection by classical SA

Feature selection by

(Top three feature subsets)

SQA

Feature subset with position of feature (Fi = ith feature of the dataset)

Subset size

Occurrence Of Feature Subset

Feature name

Subset size

F1–F4, F6–F8, F11, F13–F14, F16, F18, F21–F29

21

3

F1, F3–F4, F6–F8, F11, F13–F14,

17

F1–F4, F6–F8, F11, F13–F14, F16–F18, F21–F29

22

3

F18, F21–F28,

F1–F4, F6–F8, F11, F13–F14, F17–F18 F21–F29,

22

1

F4, F8

89

2

F4

11

1

F4

1

Fig. 2 Statistical box plot of the experimental results for SA (left columns) and SQA (right columns) for each dataset

An Approach of Feature Subset Selection Using Simulated …

145

SQA is comparatively lower than the median value of subsets obtained by SA. It is noted that, in the case of SQA, the box plot is always a straight line. This is because SQA generates only one subset sample, also the evidence of capabilities of generating lower feature subset of SQA over SA.

8 Conclusion Usually, numerous search strategies have been utilized for finding quality feature subsets. However, this task can be solved in different ways using quantum annealing concept, which can provide computationally effective quality solutions. Literature study reveals that attempts to solve feature selection using quantum annealing are not adequately explored. In this paper, we have formulated three important and popular filter-based feature selection objectives such as mRMR, JMI, and FCBF to QUBO, and then solved mRMR MI objective with QUBO matrices of ten datasets on SA and SQA. The experimental results indicate that, for all the datasets, SQA is able to produce feature subsets, which are stable, computationally less expensive, and have the ability to converge toward global optimum in comparison with SA. In the future, we will experimentally evaluate other QUBO formulation we have derived. The main contribution of this paper is to map multiple feature selection objectives to QUBO and empirically compare SA and SQA over ten publically available datasets. Acknowledgements This is to acknowledge that the present work is supported by DST-JSPS (Indo-Japan) Bilateral Open Partnership Joint Research Project.

References 1. Kumar, V., & Minz, S. (2014). Feature selection. SmartCR, 4, 211–229. 2. Goswami, S., Chakrabarti, A., & Chakraborty, B. (2018). An empirical study of feature selection for classification using genetic algorithm. International Journal of Advanced Intelligence Paradigms, 10, 305–326. 3. Tang, J., Alelyani, S. & Liu, H. (2014). Feature selection for classification: A review. Data Classification: Algorithms and Applications, 37. 4. Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., et al. (2017). Feature selection: A data perspective. ACM Computing Surveys, 50, 1–45. 5. Mueller, F., Dreher, P., & Byrd, G. (2019). Programming quantum computers: a primer with IBM Q and D-Wave exercises. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, p. 451. 6. Yu, H., Huang, Y. & Wu, B. (2018). Exact equivalence between quantum adiabatic algorithm and quantum circuit algorithm. Chinese Physics Letters, 35, 110303. 7. Nielsen, M. A., & Chuang, I. (2002). Quantum computation and quantum information. American Journal of Physics, 70, 558–559. 8. Havlíˇcek, V., Córcoles, A. D., Temme, K., Harrow, A. W., Kandala, A., Chow, J. M., et al. (2019). Supervised learning with quantum-enhanced feature spaces. Nature, 567, pp. 209–212.

146

A. K. Mandal et al.

9. Ushijima-Mwesigwa, H., Negre, C. F. A. & Mniszewski, S. M. (2017). Graph partitioning using quantum annealing on the D-wave system. In: Proceedings of the Second International Workshop on Post Moores Era Supercomputing. Denver, CO, USA. 10. Crosson, E. & Harrow, A. W.(2016). Simulated quantum annealing can be exponentially faster than classical simulated annealing. In: 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 714–723. 11. Brown, G., Pocock, A., Zhao, M.-J., & Luján, M. (2012). Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Journal of Machine Learning Research, 13, 27–66. 12. H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information: criteria of maxdependency, max-relevance, and min-redundancy,” IEEE Transactions on Pattern Analysis & Machine Intelligence, pp. 1226–1238, 2005. 13. Yang, H. & Moody, J. (1999). Feature selection based on joint mutual information. In: Proceedings of International ICSC Symposium on Advances in Intelligent Data Analysis, pp. 22–25. 14. Yu, L. & Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In: Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 856–863. 15. Ciliberto, C., Herbster, M., Ialongo, A. D., Pontil, M., Rocchetto, A., Severini, S., et al. (2018). Quantum machine learning: A classical perspective. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474, 20170551. 16. Biamonte, J., Wittek, P., Pancotti, N., Rebentrost, P., Wiebe, N., & Lloyd, S. (2017). Quantum machine learning. Nature, 549, 195. 17. Neukart, F., Compostella, G., Seidel, C., von Dollen, D., Yarkoni, S. & Parney, B. (2017). Traffic flow optimization using a quantum annealer. Frontiers in ICT, 4. 18. Ising, E. (1925). Beitrag zur Theorie des Ferromagnetismus. Zeitschrift für Physik, 31, 253–258. 19. He, Z., Li, L., Huang, Z., & Situ, H. (2018). Quantum-enhanced feature selection with forward selection and backward elimination. Quantum Information Processing, 17, 154. 20. Milne, A., Rounds, M., & Goddard, P. (2018). Optimal feature selection using a quantum annealer. In: High-Performance Computing in Finance, (Ed): Chapman and Hall/CRC, pp. 561– 588. 21. Tanahashi, K., Takayanagi, S., Motohashi, T., & Tanaka, S. (2018, 2019). Global mutual information based feature selection by quantum annealing. https://www.dwavesys.com/sites/def ault/files/qubits2018_mifs_ver2.pdf. 22. Van Laarhoven, P. J., & Aarts, E. H. (1987). Simulated annealing. In: Simulated Annealing: Theory and Applications, (Ed). Springer, pp. 7–15. 23. Kadowaki, T., & Nishimori, H. (). Quantum annealing in the transverse Ising model. Physical Review E, 58, 5355–5363. 24. Dua, D., Casey, G. (2017, 2019). UCI machinle learning repository. https://archive.ics.uci.edu/ ml/index.php. 25. Shinmorino, S. M. (April 15 2019). Solvers/annealers for simulated quantum annealing on CPU and CUDA(NVIDIA GPU). https://github.com/shinmorino/sqaod.

A Novel Framework for Data Acquisition and Retrieval Using Hierarchical Schema Over Structured Big Data Neepa Shah

Abstract Big data systems typically have many parameters for the retrieval process. It is well recognized that these parameters have defined hierarchies such as “monthly– quarterly–half yearly–yearly” with respect to the time parameter. The standard way is to model and manage such data by using a star/snowflake schema approach; however, the approach does not give solutions by considering the implicit hierarchies of data. Data in fact table of star schema makes the size of the table very large and applying “join” operations becomes a very complex process. This paper proposes the Summary model which is based on the hierarchical and relational schema. Hierarchical schema gives standard storage form to dimension (hierarchy) of star schema and pre-calculated data based on its implicit hierarchies. This model also permits structural heterogeneity of dimension hierarchies. Sophisticated queries such as ROLLUP, DRILLDOWN, SLICE, and DICE involving hierarchies and precalculations on implicit hierarchies of star schema data are provided in this model. Use of set theory operations boosts query processing over a complex sequence of “join” operations. These operations are given as an algorithm using big O notation. Keywords Complex query analysis · Data analysis · Hierarchical schema · Metadata structure · OLAP systems · Pre-calculated data · Relational schema · Sequence key

1 Introduction Retrieval process of data depends on the logical and physical view of the data. The logical view gives guideline about the structure of data and its relationship with other data in the database [1]. Physical view shows the internal representation of data that is the way data has been stored on the secondary storage. Knowledge about metadata and internal storage helps us to retrieve data from the storage media [2–4]. N. Shah (B) Department of Computer Science, Gujarat Vidyapith, Ahmedabad 380014, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1174, https://doi.org/10.1007/978-981-15-5616-6_11

147

148

N. Shah

This paper is about a model that is using hierarchical storage system because a decision support system (DSS) is worked on a large voluminous data. This data is implicitly having hierarchical form to generate summary for decision-making [5, 6]. This model stores pre-calculated data that reduce time for preparing query result that needs to be generated for the process like slicing and dicing and preparing summary by applying multiple dimensions as a part of DSS for voluminous data [7, 8]. This model has merged some quality of relational and hierarchical schema model to handle very large data to boost summarization process of data with many dimensions. The model has been named “Summary model”. Summary model has most of the advantages of getting summarized data over star/snowflake schema model [9]. Combination of hierarchical and relational schema approach in this model is very near to real-world examples and their relationship consistently handles large data summary. This model is based on set theory and relational theory to achieve a high degree of independence [10]. Daily transaction of OLTP system generates a large number of rows (tuples). These large data generated through heterogeneous sources are stored in the Summary model that uses relational schema. However, system that servers as the decision support (DSS) in the organization uses summarized or pre-calculated data. This data is stored into hierarchical schema [6, 11, 12]. Relational schema is efficient for OLTP system, here changes occur from the transactions are easily applicable to the raw data (data which is not finely finished) of relational schema [13]. Summarized entities are hierarchical entities. It consists of finished data generated by applying various functions on the raw data. Hierarchical entities are made by many levels, each level consists of various levels beneath it. These hierarchical entities will get refresh periodically through offline changing on summarized data based on changes done in raw data. Using these two different schemas this paper makes the following contributions: • Hierarchical schema design for summarized data for efficient retrieval—Retrieval process depends on the storage system so here logical and physical schema design is proposed. • Retrieval of summarized data as well as raw transactional data management up to finer level of data granularity. • Semantics of relational-hierarchical retrieval request. It shows the algorithm mainly for the hierarchical schema. • Advantages over existing star/snowflake schema model for data warehouse system.

2 Survey of Literature There are many approaches for data management; some of them are very successful. Relational schema model for the database management system (DBMS) is successful for online transaction processing (OLTP) system [13]. However, other models like network schema model and hierarchical schema model are less successful. These

A Novel Framework for Data Acquisition and Retrieval …

149

models are indeed very effective for specific requirements like—one-to-many relationship management through hierarchical schema model and many-to-many relationship management through network schema model. Similarly when data become very large, OLTP (Relational schema) model is less efficient for faster retrieval. Voluminous data with terabytes or petabytes of size need special management, unlike relational schema. Most of the database software is having their own individual model for managing such huge structured data [14, 15]. Conventional database system for structured data like star/snowflake schema model is known for data warehouse and very large databases systems [9, 16]. This model is having different utilities to manage large database, which also includes the conversion of relational data into the non-relational schema. In this model, to get precalculated data with its implicit hierarchy stated in the dimension entity of the star schema model is less effective for one or other reasons [17, 18]. There are also many other models which have been developed to manage calculated data for the same. One of them is SQL (H). SQL (H) provides an extension to the retrieval language [19]. It is defined as structural and schematic heterogeneity in dimension hierarchies where storage is a relational table. Another such model is TAMP, based on Hadoop and Hadoop DB with the use of MapReduce [20]. Such a model is using distributed processing and multiple nodes. Implementation of the dimensional hierarchy is done by using indexing method.

3 Introduction to Summary Model Summary model is mainly concerned with storage and retrieval of large database content. As discussed earlier large amount of data collected from heterogeneous sources is stored in the conventional relational database system. This data is not prepared for DSS that is composed of high granularity on voluminous data. Summary model is an approach to provide data with higher granularity using hierarchical schema model. Data is stored in the summary entities. This summary entity has Summary attributes that are generated by pre-defined functions also known as precalculated data. Let us first understand few terminology and definitions of Summary model. Pre-calculated data: In a conventional system, when data is collected from heterogeneous sources in dimension object of star schema [9, 17], there exist implicit hierarchies within that data. Analysis of data is performed by applying functions with reference to these levels of hierarchies. For example, if we have hierarchy related to time, it will have a timestamp of the transaction [21]. Hierarchy of timestamp keeps information about summary that starts from date with hours and minutes of the transactions to year of the transactions. This makes levels of the time hierarchy like—daily transactions level, weekly transactions level, monthly transactions level, quarterly transactions level, half-yearly transactions level, and yearly transactions level. This is the way

150

N. Shah

summarized data has been generated where the top level is “Year level” and leaf level is “Daily level”. Mathematical functions apply to the value of data to make higher granularity by considering these hierarchical levels is called pre-calculated data. To understand the relationship between pre-calculated data and hierarchical summary entity let us first understand summary entity. Here is a definition of a summary entity. Summary Entity: let us take S to be a summary entity such that S(s) ⊆ D1 × D2 × D3 × · · · × Dn Representation of one segment (level) of a summary entity as S(A1 , A2 , A3 ,…, An ) where A1 , …, An are attributes of a segment of summary S and each attribute belongs to domain Di such that attribute Ai ∈ Di for 1