Data Management, Analytics and Innovation: Proceedings of ICDMAI 2020, Volume 2 [1st ed.] 9789811556180, 9789811556197

This book presents the latest findings in the areas of data management and smart computing, big data management, artific

359 15 20MB

English Pages XII, 462 [454] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Front Matter ....Pages i-xii
Front Matter ....Pages 1-1
Non-parametric Distance—A New Class Separability Measure (Sayoni Roychowdhury, Aditya Basak, Saptarsi Goswami)....Pages 3-19
Digital Transformation of Information Resource Center of an Enterprise Using Analytical Services (Preeti Prasad Ramdasi, Aniket Kolee, Neha Sharma)....Pages 21-34
Prediction of Stock Market Performance Based on Financial News Articles and Their Classification (Matthias Eck, Julian Germani, Neha Sharma, Jürgen Seitz, Preeti Prasad Ramdasi)....Pages 35-44
Front Matter ....Pages 45-45
Automatic Standardization of Data Based on Machine Learning and Natural Language Processing (Ananya Banerjee, Kavya SreeYella, Joy Mustafi)....Pages 47-54
Analysis of GHI Forecasting Using Seasonal ARIMA (Aditya Kumar Barik, Sourav Malakar, Saptarsi Goswami, Bhaswati Ganguli, Sugata Sen Roy, Amlan Chakrabarti)....Pages 55-69
Front Matter ....Pages 71-71
Scoring Algorithm Identifying Anomalous Behavior in Enterprise Network (Sonam Sharma, Garima Makkar)....Pages 73-81
Application of Bayesian Automated Hyperparameter Tuning on Classifiers Predicting Customer Retention in Banking Industry (Akash Sampurnanand Pandey, K. K. Shukla)....Pages 83-100
Quantum Machine Learning: A Review and Current Status (Nimish Mishra, Manik Kapil, Hemant Rakesh, Amit Anand, Nilima Mishra, Aakash Warke et al.)....Pages 101-145
Survey of Transfer Learning and a Case Study of Emotion Recognition Using Inductive Approach (Abhinand Poosarala, R Jayashree)....Pages 147-161
An Efficient Algorithm for Complete Linkage Clustering with a Merging Threshold (Payel Banerjee, Amlan Chakrabarti, Tapas Kumar Ballabh)....Pages 163-178
Analytics for In Silico Development of Inhibitors from Neem (Azadirachta Indica) Against Pantothenate Synthetase of Mycobacterium Tuberculosis (Saurov Mahanta, Bhaskarjyoti Gogoi, Bhaben Tanti)....Pages 179-200
A Machine Learning Model for Forecasting Wind Disasters for Farmers (M. Iyyanar, M. Usha, C. Birundha, M. Anbuselvi, V. Haritha)....Pages 201-209
Prediction of Most Valuable Bowlers of Indian Premier League (IPL) (Arnab Santra, Aniket Mitra, Abhirup Sinha, Amit Kumar Das)....Pages 211-223
Categorizing Text Documents Using Naïve Bayes, SVM and Logistic Regression (Shubham Kumar, Anmol Gulati, Rachna Jain, Preeti Nagrath, Nitika Sharma)....Pages 225-235
Machine Learning to Diagnose Common Diseases Based on Symptoms (Suvasree S. Biswal, T. Amarnath, Prasanta K. Panigrahi, Nrusingh C. Biswal)....Pages 237-245
Efficacy of Oversampling Over Machine Learning Algorithms in Case of Sentiment Analysis (Deb Prakash Chatterjee, Sabyasachi Mukhopadhyay, Saptarsi Goswami, Prasanta K. Panigrahi)....Pages 247-260
Study and Detection of Fake News: P2C2-Based Machine Learning Approach (Pawan Kumar Verma, Prateek Agrawal)....Pages 261-278
Demand Forecasting Framework for Optimum Resource Planning (Sreeja Ashok, N. P. Amal Das, Kanu Aravind)....Pages 279-292
Determination of Ozone Density Applying Artificial Intelligence (Saumen Ghosh, Himadri Bhattacharyya Chakrabarty, Moheeyoshee Moitra)....Pages 293-299
Development of LSTM Neural Network for Predicting Very Short Term Load of Smart Grid to Participate in Demand Response Program (Surekha Deshmukh, Jayashri Satre, Miss Suruchi Sinha, Dattatray Doke)....Pages 301-311
Answering Predictive Questions in Natural Language Based on Given Data for Forecasting (U. Abhinand Varma, D. V. Rakesh Reddy, Prasanth Paraselli, Joy Mustafi)....Pages 313-327
Detecting Duplicate Bi-Lingual Mash-up Question Pairs Using Siamese Multi-layer Perceptron Network (Seema Rani, Avadhesh Kumar, Naresh Kumar)....Pages 329-338
Glowing Window-Based Feature Extraction Technique for Object Detection (Shalu Gupta, Y. Jayanta Singh)....Pages 339-351
Improving Siamese Networks for One-Shot Learning Using Kernel-Based Activation Functions (Shruti Jadon, Aditya Arcot Srinivasan)....Pages 353-367
Prediction of Mental Disorder Using Artificial Neural Network and Psychometric Analysis (D. D. Sapkal, Chintan Mehta, Mohit Nimgaonkar, Rohan Devasthale, Shreyas Phansalkar)....Pages 369-377
Design Thinking for Class Imbalance Problems Using Compound Techniques (Rajneesh Tiwari, Aritra Sen, Kaushik Dey)....Pages 379-400
A Hybrid Graph Centrality Based Feature Selection Approach for Supervised Learning (Abhirup Banerjee, Saptarsi Goswami, Amit Kumar Das)....Pages 401-419
Modeling Earthquake Damage Grade Level Prediction Using Machine Learning and Deep Learning Techniques (Sai Dhanuj Nukala, Vipul Kumar Mishra, Gopala Krishna Murthy Nookala)....Pages 421-433
Comparative Study of Machine Learning Models to Classify Gene Variants of ClinVar (V. Venkata Durga Kiran, Sasumana Vinay Kumar, Suresh B. Mudunuri, Gopala Krishna Murthy Nookala)....Pages 435-443
Bio-Molecular Event Extraction Using Classifier Ensemble-of-Ensemble Technique (Manish Bali, P. V. R. Murthy)....Pages 445-462
Recommend Papers

Data Management, Analytics and Innovation: Proceedings of ICDMAI 2020, Volume 2 [1st ed.]
 9789811556180, 9789811556197

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Advances in Intelligent Systems and Computing 1175

Neha Sharma Amlan Chakrabarti Valentina Emilia Balas Jan Martinovic   Editors

Data Management, Analytics and Innovation Proceedings of ICDMAI 2020, Volume 2

Advances in Intelligent Systems and Computing Volume 1175

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong

The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. ** Indexing: The books of this series are submitted to ISI Proceedings, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **

More information about this series at http://www.springer.com/series/11156

Neha Sharma Amlan Chakrabarti Valentina Emilia Balas Jan Martinovic •





Editors

Data Management, Analytics and Innovation Proceedings of ICDMAI 2020, Volume 2

123

Editors Neha Sharma Society for Data Science Pune, Maharashtra, India Valentina Emilia Balas Department of Automatics and Applied Software, Faculty of Engineering University of Arad Arad, Romania

Amlan Chakrabarti A.K. Choudhury School of Information Technology University of Calcutta Kolkata, West Bengal, India Jan Martinovic IT4Innovations VSB-Technical University of Ostrava Ostrava, Czech Republic

ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-981-15-5618-0 ISBN 978-981-15-5619-7 (eBook) https://doi.org/10.1007/978-981-15-5619-7 © Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

These two volumes constitute the proceedings of the International Conference on Data Management, Analytics and Innovation (ICDMAI 2020) held during 17–19 January 2020 at United Service Institution of India, New Delhi. ICDMAI is a signature conference of Society for Data Science (S4DS) which is a non-profit professional association established to create a collaborative platform for bringing together technical experts across Industry, Academia, Government Labs and Professional Bodies to promote innovation around Data Science. ICDMAI is committed to creating a forum which brings data science enthusiasts to the same page and envisions its role towards its enhancement through collaboration, innovative methodologies and connections throughout the globe. Planning towards ICDMAI 2020 started around 14 months back, and the entire core team ensured that we surpass our own benchmark. The core committee had taken utmost care in each and every facet of the conference, especially regarding the quality of the submissions. Out of 514 papers submitted to ICDMAI 2020, only 12% (62 papers) were selected for an oral presentation after a rigorous review process. This year, the conference witnessed participants from 8 countries, 25 Industries and 80 International and Indian Universities (IITs, NITs, IISER, etc.). Besides paper presentations, the conference also showcased workshops, tutorial talks, keynote sessions and plenary talks by experts of the respective field. We appreciate the bonhomie and support extended by IBM, Wizertech, Springer since, Ericsson, NIELIT-Kolkata, IISER-Kolkata and REDX-Kolkata. The volumes cover a broad spectrum of Data Science and all the relevant disciplines. The conference papers included in these proceedings are published post conference and are grouped into the four areas of research such as Data Management and Smart Informatics, Big Data Management, Artificial Intelligence and Data Analytics, and Advances in Network Technologies. All the four tracks of the conference were very relevant to the current technological advancements and received Best Paper Award in each track. Very stringent selection process was adopted for paper selection, from plagiarism check to technical chairs' review to double-blind review, every step was religiously followed. We compliment all the authors for submitting high quality to ICDMAI 2020. The editors would like to acknowledge all v

vi

Preface

the authors for their contributions and also the efforts taken by reviewers and session chairs of the conference, without whom it would have been difficult to select these papers. We appreciate the unconditional support from the members of the National and International Program Committee. It was really interesting to hear the participants of the conference highlight the new areas and the resulting challenges as well as opportunities. This conference has served as a vehicle for a spirited debate and discussion on many challenges that the world faces today. Our heartfelt thanks to our General Chairs, Dr. P. K. Sinha, Vice Chancellor and Director, IIIT, Naya Raipur, India, and Prof. Vincenzo Piuri, Professor, Università degli Studi di Milano, Italy. We are grateful to other eminent personalities who were present at ICDMAI 2020 like Alfred Bruckstein, Technion—Israel Institute of Technology; C. Mohan, IBM Fellow, IBM Almaden Research Center in Silicon Valley; Dinanath Kholkar, Tata Consultancy Services; Anupam Basu, National Institute of Technology, Durgapur; Biswajit Patra, Intel; Lipika Dey, Tata Consultancy Services, New Delhi; Sangeet Saha, University of Essex, UK; Aninda Bose, Springer India Pvt. Ltd.; Kranti Athalye, IBM India University Relations; Mrityunjoy Pandey, Cognizant; Amit Agarwal and Ishant Wankhede, Abzooba; Kaushik Dey, Ericsson; Prof. Sugata Sen Roy, University of Calcutta; Amol Dhondse, IBM Master Innovator; Anindita Bandyopadhyay, KPMG; Kuldeep Singh, ODSC, Delhi Chapter; Rita Bruckstein, Technion, Israel; Sonal Kukreja, TenbyTen and many more who were associated with ICDMAI 2020. The conference of this magnitude was possible due to the consistent and concerted efforts of many good souls. We acknowledge the contribution of our advisory body members, technical programme committee, people from industry and academia, reviewers, session chairs, media and authors, who have been instrumental in making this conference possible. Our special thanks go to Janus Kacprzyk (Editor in Chief, Springer, Advances in Intelligent Systems and Computing Series) for the opportunity to organize this guest-edited volume. We are grateful to Springer, especially to Mr. Aninda Bose (Senior Publishing Editor, Springer India Pvt. Ltd.), for the excellent collaboration, patience and help during the evolvement of this volume. We are confident that the volumes will provide state-of-the-art information to professors, researchers, practitioners and graduate students in the area of data management, analytics and innovation, and all will find this collection of papers inspiring and useful. Pune, India West Bengal, India Arad, Romania Ostrava, Czech Republic

Neha Sharma Amlan Chakrabarti Valentina Emilia Balas Jan Martinovic

Contents

Data Management and Smart Informatics Non-parametric Distance—A New Class Separability Measure . . . . . . . Sayoni Roychowdhury, Aditya Basak, and Saptarsi Goswami Digital Transformation of Information Resource Center of an Enterprise Using Analytical Services . . . . . . . . . . . . . . . . . . . . . . . Preeti Prasad Ramdasi, Aniket Kolee, and Neha Sharma Prediction of Stock Market Performance Based on Financial News Articles and Their Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Eck, Julian Germani, Neha Sharma, Jürgen Seitz, and Preeti Prasad Ramdasi

3

21

35

Big Data Management Automatic Standardization of Data Based on Machine Learning and Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ananya Banerjee, Kavya SreeYella, and Joy Mustafi Analysis of GHI Forecasting Using Seasonal ARIMA . . . . . . . . . . . . . . Aditya Kumar Barik, Sourav Malakar, Saptarsi Goswami, Bhaswati Ganguli, Sugata Sen Roy, and Amlan Chakrabarti

47 55

Artificial Intelligence and Data Analysis Scoring Algorithm Identifying Anomalous Behavior in Enterprise Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sonam Sharma and Garima Makkar

73

Application of Bayesian Automated Hyperparameter Tuning on Classifiers Predicting Customer Retention in Banking Industry . . . . Akash Sampurnanand Pandey and K. K. Shukla

83

vii

viii

Contents

Quantum Machine Learning: A Review and Current Status . . . . . . . . . 101 Nimish Mishra, Manik Kapil, Hemant Rakesh, Amit Anand, Nilima Mishra, Aakash Warke, Soumya Sarkar, Sanchayan Dutta, Sabhyata Gupta, Aditya Prasad Dash, Rakshit Gharat, Yagnik Chatterjee, Shuvarati Roy, Shivam Raj, Valay Kumar Jain, Shreeram Bagaria, Smit Chaudhary, Vishwanath Singh, Rituparna Maji, Priyanka Dalei, Bikash K. Behera, Sabyasachi Mukhopadhyay, and Prasanta K. Panigrahi Survey of Transfer Learning and a Case Study of Emotion Recognition Using Inductive Approach . . . . . . . . . . . . . . . . . . . . . . . . . 147 Abhinand Poosarala and R Jayashree An Efficient Algorithm for Complete Linkage Clustering with a Merging Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Payel Banerjee, Amlan Chakrabarti, and Tapas Kumar Ballabh Analytics for In Silico Development of Inhibitors from Neem (Azadirachta Indica) Against Pantothenate Synthetase of Mycobacterium Tuberculosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Saurov Mahanta, Bhaskarjyoti Gogoi, and Bhaben Tanti A Machine Learning Model for Forecasting Wind Disasters for Farmers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 M. Iyyanar, M. Usha, C. Birundha, M. Anbuselvi, and V. Haritha Prediction of Most Valuable Bowlers of Indian Premier League (IPL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Arnab Santra, Aniket Mitra, Abhirup Sinha, and Amit Kumar Das Categorizing Text Documents Using Naïve Bayes, SVM and Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Shubham Kumar, Anmol Gulati, Rachna Jain, Preeti Nagrath, and Nitika Sharma Machine Learning to Diagnose Common Diseases Based on Symptoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Suvasree S. Biswal, T. Amarnath, Prasanta K. Panigrahi, and Nrusingh C. Biswal Efficacy of Oversampling Over Machine Learning Algorithms in Case of Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Deb Prakash Chatterjee, Sabyasachi Mukhopadhyay, Saptarsi Goswami, and Prasanta K. Panigrahi Study and Detection of Fake News: P2C2-Based Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Pawan Kumar Verma and Prateek Agrawal

Contents

ix

Demand Forecasting Framework for Optimum Resource Planning . . . . 279 Sreeja Ashok, N. P. Amal Das, and Kanu Aravind Determination of Ozone Density Applying Artificial Intelligence . . . . . . 293 Saumen Ghosh, Himadri Bhattacharyya Chakrabarty, and Moheeyoshee Moitra Development of LSTM Neural Network for Predicting Very Short Term Load of Smart Grid to Participate in Demand Response Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Surekha Deshmukh, Jayashri Satre, Miss Suruchi Sinha, and Dattatray Doke Answering Predictive Questions in Natural Language Based on Given Data for Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 U. Abhinand Varma, D. V. Rakesh Reddy, Prasanth Paraselli, and Joy Mustafi Detecting Duplicate Bi-Lingual Mash-up Question Pairs Using Siamese Multi-layer Perceptron Network . . . . . . . . . . . . . . . . . . . 329 Seema Rani, Avadhesh Kumar, and Naresh Kumar Glowing Window-Based Feature Extraction Technique for Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 Shalu Gupta and Y. Jayanta Singh Improving Siamese Networks for One-Shot Learning Using Kernel-Based Activation Functions . . . . . . . . . . . . . . . . . . . . . . . 353 Shruti Jadon and Aditya Arcot Srinivasan Prediction of Mental Disorder Using Artificial Neural Network and Psychometric Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 D. D. Sapkal, Chintan Mehta, Mohit Nimgaonkar, Rohan Devasthale, and Shreyas Phansalkar Design Thinking for Class Imbalance Problems Using Compound Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 Rajneesh Tiwari, Aritra Sen, and Kaushik Dey A Hybrid Graph Centrality Based Feature Selection Approach for Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 Abhirup Banerjee, Saptarsi Goswami, and Amit Kumar Das Modeling Earthquake Damage Grade Level Prediction Using Machine Learning and Deep Learning Techniques . . . . . . . . . . . 421 Sai Dhanuj Nukala, Vipul Kumar Mishra, and Gopala Krishna Murthy Nookala

x

Contents

Comparative Study of Machine Learning Models to Classify Gene Variants of ClinVar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 V. Venkata Durga Kiran, Sasumana Vinay Kumar, Suresh B. Mudunuri, and Gopala Krishna Murthy Nookala Bio-Molecular Event Extraction Using Classifier Ensemble-of-Ensemble Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 Manish Bali and P. V. R. Murthy

About the Editors

Neha Sharma is working with Tata Consultancy Services and is a Founder Secretary, Society for Data Science, India. She has worked as Director of premier Institutes of Pune, that run post-graduation courses like MCA and MBA. She is an alumnus of a premier College of Engineering and Technology, Bhubaneshwar and completed her PhD from prestigious Indian Institute of Technology, Dhanbad. She is a Senior IEEE member and Executive Body member of IEEE Pune Section. She is an astute academician and has organized several national and international conferences and published several research papers. She has received a “Best PhD Thesis Award” and “Best Paper Presenter at International Conference Award” at the National Level from the Computer Society of India. Her research interests include data mining, database design, artificial intelligence, big data, cloud computing, blockchain and data science. Amlan Chakrabarti is a Full Professor at the School of IT, University of Calcutta. He was a Postdoctoral Fellow at Princeton University, USA, from 2011 to 2012. With nearly 20 years of experience in Engineering Education and Research, he is a recipient of the prestigious DST BOYSCAST Fellowship Award in Engineering Science (2011), JSPS Invitation Research Award (2016), Erasmus Mundus Leaders Award from the EU (2017), and a Hamied Visiting Professorship from the University of Cambridge (2018). He is an Associate Editor of the Journal of Computers and Electrical Engineering, senior member of the IEEE and ACM, IEEE Computer Society Distinguished Visitor, Distinguished Speaker of the ACM, Secretary of the IEEE CEDA India Chapter and Vice President of the Data Science Society. Prof. Valentina Emilia Balas is currently a Full Professor at the Department of Automatics and Applied Software, “Aurel Vlaicu” University of Arad, Romania. The author of more than 300 research papers, her research interests include intelligent systems, fuzzy control, soft computing, smart sensors, information fusion, modeling

xi

xii

About the Editors

and simulation. She is the Editor-in-Chief of the Inderscience journals IJAIP and IJCSysE. She is the Director of the Department of International Relations and Head of the Intelligent Systems Research Centre at Aurel Vlaicu University of Arad. Dr. Jan Martinovic is currently Head of the Advanced Data Analysis and Simulation Lab at IT4Innovations National Supercomputing Center, VSB-TUO, Czech Republic. His research activities focus on HCP, cloud and big data convergence, HPC-as-a-Service, traffic management and data analysis. He is the coordinator of the H2020 ICT EU-funded project LEXIS (www.lexis-project.eu). He has previously coordinated contracted research activities with international and national companies and been involved in the H2020 projects ANTAREX and ExCAPE. He has been a HiPEAC member since January 2020.

Data Management and Smart Informatics

Non-parametric Distance—A New Class Separability Measure Sayoni Roychowdhury, Aditya Basak, and Saptarsi Goswami

Abstract Feature Selection, one of the most important preprocessing steps in Machine Learning, is the process where we automatically or manually select those features which contribute most to our prediction variable or output in which we are interested. This subset of features has some very important benefits: it reduces the computational complexity of learning algorithms, saves time, improves accuracy, and the selected features can be insightful for the people involved in the problem domain. Among the different ways of performing feature selection such as filter, wrapper and hybrid, filter-based separability methods can be used as a feature ranking tool in binary classification problems, most popular being the Bhattacharya distance and the Jeffries–Matusita (JM) distance. However, these measures are parametric and their computation requires knowledge of the distribution from which the samples are drawn. In real life, we often come across instances where it is difficult to have an idea about the distribution of observations. In this paper, we have presented a new nonparametric approach for performing feature selection called the ‘Non-Parametric Distance Measure’. The experiment with the new measure is performed over nine datasets and the results are compared with other ranking-based methods for feature selection using those datasets. This experiment proves that the new box-plot-based method can provide greater accuracy and efficiency than the conventional rankingbased measures for feature selection such as the Chi-Square, Symmetric Uncertainty and Information Gain. Keywords Feature selection · Chi-Squared · Symmetric uncertainty · Information gain · Non-parametric distance · Spearman’s correlation · F1 score S. Roychowdhury Department of Statistics, University of Calcutta, Kolkata, India e-mail: [email protected] A. Basak (B) Keysight Technologies, Kolkata, India e-mail: [email protected] S. Goswami A.K.Choudhury School of IT, University of Calcutta, Kolkata, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1175, https://doi.org/10.1007/978-981-15-5619-7_1

3

4

S. Roychowdhury et al.

1 Introduction The major objective of machine learning and data mining is to effectively deal with multidimensional data. The most important job is to estimate functional relationships between target and features which may be dependent or independent. Feature selection is an optimization problem which is the process of selecting only those features from a dataset that are relevant and nonredundant for performing a particular task keeping in mind that there is not much degradation of the performance of a system. As such, feature selection is different from dimensionality reduction, since the latter tries to reduce the number of attributes in the dataset by creating a new combination of attributes, whereas feature selection includes and excludes attributes in data without changing them. The size of the datasets has significantly increased in the last few years particularly in the last two decades, especially in the fields of biomedical research [1, 2]. The increased usage of the Internet, social networking, Internet of Things, sensors, etc., and the ease of storage of features via cloud computing has also contributed to this near exponential increase [3]. The high-dimensional data, consisting of multiple features, have higher space and time requirements for computation. Not all the captured features help the model. On the contrary, irrelevant and redundant features confuse the learning algorithm and affect the accuracy negatively [4, 5]. This is where feature selection plays a major role in identifying the important and significant features according to our requirements. The different kinds of algorithms developed for performing feature selection can be classified into four main broad categories, namely, the filter-based, wrapper-based, hybrid and embedded approaches [5, 6]. Using the filter-based approach, a feature is selected or discarded based on some predefined measure. This approach can work either as a search-based or a ranking-based approach. While a search-based method tries to search for an optimal subset with the help of some cost function, the rankingbased approach tries to assign a rank to all the features based on the measure. Some of these measures are mutual information and class separability measures. Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model is used to evaluate a combination of features and assign a score based on model accuracy. While the filter-based approach is independent of the induction algorithm, the wrapper-based model ‘wraps’ around the algorithm and searches for the best feature subset depending on how the algorithm performs [7]. Consequently, the wrapper-based approach requires a large amount of training and greatly increases the computational overhead. Hybrid approaches try to combine both the filter and wrapper approaches to gain a higher accuracy in feature selection than both looked at individually. Liu et al. [8] used a hybrid feature selection approach where in the first stage, they used the filter-based ranking approach to rank features by mutual information and in the second stage implemented a wrapper-based model using Shepley value to evaluate the contribution of the features to a classification task. The hybrid approach increases accuracy at the cost of being even more computationally complex.

Non-parametric Distance—A New Class …

5

Embedded methods learn which features best contribute to a model’s accuracy while the model is being created. If a dataset has ‘N’ features, for a large value of N, the task of finding an optimal subset is an NP-complete problem [9]. Even for a mid-size value of N, this can reach extremely high levels [10]. Hence in multivariate filter-based approaches, heuristic methods such as approximate search techniques become necessary in order to reduce computational expenses. For univariate datasets, the ranking-based feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and are either selected to be kept or removed from the dataset. These methods consider the feature independently, or with regard to a dependent variable. Feature ranking has been observed to have been used already in various domains such as Alzheimer’s disease classification from structural MRI [11], software quality [12], textual data [13], gene expression data [14], etc. Ranking-based approaches of feature selection tend to be more generic and computationally less expensive. Several different measures are used for the ranking of features. By convention, a high score is indicative of a valuable and relevant feature. This is usually not optimal but often preferable to other complicated methods. In consistency-based methods, the determinacy and monotonicity of a feature are measured to provide it with a score to determine its relevance [15]. Certain widely popular and often used measures such as Information Gain, Symmetric Uncertainty, Chi-Square [16], etc. focus on the predictive power of features with respect to a class. Another different approach to measuring the effectiveness of a feature in a univariate filter-based approach is by its separability between the classes of a dataset. A widely used measure for parametric separability-based filter classification, in binary classification problems, is a measure called the Bhattacharya distance measure [17], which measures the similarity between two statistical samples. It, however, has a disadvantage that the measure of separability continues to grow, even after the classes have become so well separated that any classification procedure can distinguish them perfectly. The Jeffries–Matusita distance standardizes the Bhattacharya distance between 0 and 2, for easy comparison. Non-parametric separability measures, such as box-whisker plots, have an advantage, in that, they provide better visibility of the classification than normalization parametric measures and thus provide better results. In order to calculate the similarity between two statistical samples, Bhattacharyya distance, which forms the foundation on the basis of which JM distance is defined, requires probability distribution functions for continuous variables and probability mass functions in case of discrete variables for both the statistical samples. Depending on whether the distribution functions are discrete or continuous, we perform a summation or an integral over the square root of the product of the functions. Identifying the distribution functions and performing the summation or integral on their product is often a computationally complex task and requires a lot of resources. In this paper, we have defined a new measure called the Non-Parametric Distance Measure to provide an alternative approach for univariate feature selection, using the concept of box plot, over 9 publicly available datasets. The results have then been compared with three popular ranking-based measures including Information Gain,

6

S. Roychowdhury et al.

Chi-Square and Symmetric Uncertainty. An analysis of this value with the F1 score has been done and the results were ranked accordingly. The rest of the paper is arranged as follows. Section 2 focuses on related works on feature ranking and separability. In Sect. 3, some general measures for evaluating the relevance of a feature are discussed. In Sect. 4, the approach adopted has been explained. Section 5 presents the details of the experimental setup. In Sect. 6, the results have been presented with a critical analysis of the same. Section 7 contains the conclusion.

2 Related Works Feature selection is an intense topic of research and has been gaining ground in the last few decades. Multiple domains have implemented ranking-based feature selection. Geng et al. have applied feature selection for ranking to OHSUMED datasets to analyze ranking performance [18]. Hariz et al. have used ranking-based feature selection in dynamic belief clustering to cluster dynamic belief feature set [19]. Kumar et al. have used ranking-based feature selection to perform an automated evaluation of descriptive answers through sequential minimal optimization models. [20]. Winker et al. have used a feature ranking mechanism to identify type 1 diabetes susceptibility genes [21]. Hulse et al. did a study on the bioinformatics dataset using ranking based on Chi-Square, Information Gain and Relief, and a metric that the authors themselves proposed [22]. C.C. Reyes et al. have used the concepts of Bhattacharya distance to postulate Bhattacharya Space for problems related to texture segmentation [23]. One of the earlier works of feature selection using Bhattacharya distance was done as early as 1996. Here, a recursive algorithm to select the optimal feature subset was proposed for the L-class problem under normal multi-distribution assumption [24]. Xuan et al. proposed a recursive feature selection algorithm based on Bhattacharya distance. This algorithm used PCA-based preprocessing and attempts to find an m*n transformation matrix, which converts the original feature space from n dimensions to m dimensions and also minimizes the class error probability [25]. Dabboor et al. have used JM distance to measure separability for polarimetric SAR data [26]. JM distance has also been used in feature extraction [27] as well as in certain search-based methods of feature selection [28].

3 Feature Ranking Measures As mentioned in Sect. 1, a lot of feature-based ranking methods are in use extensively. Some of the most important ones, especially those that we have compared the results of our Non-Parametric Distance measure in Sect. 6 with, shall be discussed in this section. They are as follows.

Non-parametric Distance—A New Class …

7

3.1 Chi-Square (χ 2 ) In Chi-Squared method, it is first assumed that the feature and the class are independent of each other. The Chi-Square statistic then measures how expectations compare to actual observed data. The data used is drawn from independent variables in a large enough sample. χ 2 is given by the following equation: c χ2 =

k=1

j

n k (μk − μ j )2 σ j2

(1)

where k denotes the class and j denotes the feature.

3.2 Information Gain Information Gain (IG), widely used in Machine Learning, is an entropy-based feature evaluation method [6*]. Information Gain, when used in feature selection, can be defined as the amount of information provided by the feature items for the text category. It is calculated depending on how much of a term can be used for classification, in order to measure the importance of the different features for the purpose of classification. The formula used is G(D, t) = −

m 

P(Ci ) log P(Ci ) + P(t)

i=1

+ P(t¯)

m 

P(Ci |t ) log P(Ci |t )

i=1 m 

  P(Ci t¯ ) log P(Ci t¯ )

(2)

i=1

where C is a dataset where there is not the feature t. If the value of G(D, t) is greater, t is more useful for the classification of C.

3.3 Symmetric Uncertainty Symmetric Uncertainty is used to calculate the fitness of a feature for feature selection. In a binary distribution, for symmetric uncertainty, we first have to find the mutual information between two features. The feature which has a high value of Symmetric Uncertainty gets high importance. Symmetric Uncertainty (SU) may then be defined as

8

S. Roychowdhury et al.

SU (X, Y ) =

2 × M I (X, Y ) H X + HY

(3)

where MI(X, Y) is the Mutual Information between X and Y. H(X) and H(Y) denotes the entropies of X and Y. This measure lies between 0 and 1. SU(X, Y) =1 indicates strong dependence between X and Y, whereas SU(X, Y) = 0 indicates the independence of X and Y.

4 Proposed Approach In this section, a new class separability measure will be discussed. The motivation of this technique will be to select features having high class separability. Let us consider a binary classification problem. In a binary classification problem, the distribution of the classes is one of the following three types: I.

Two classes are overlapping and partially inclusive, i.e., one is partially contained within the other. II. Two classes are overlapping and completely inclusive, i.e., one is completely contained within the other. III. Two classes are nonoverlapping, i.e., they are completely distinct. Let ‘f ’ be a feature of interest in a binary classification problem. The values of ‘f ’ will correspond to either of the two classes. Suppose f 0 and f 1 are two subsets of ‘f ’ corresponding to two classes. Let f 0H and f 0L be the hth percentile and lth percentile of f 0 , and let f 1H and f 1L be the hth percentile and lth percentile of f 1 .

I.

II.

III.

Fig. 1 Box plots representing the types of distribution of classes in a binary dataset

Non-parametric Distance—A New Class …

9

As a measure of separability between the two classes, the Non-Parametric Distance (NPD) Measure is defined as follows: N on Parametric Distance (N P D) = 1 −

o s

(4)

where s = max(f 0H , f 0L , f 1H , f 1L ) – min(f 0H , f 0L , f 1H , f 1L ) and o = f 1H – f 0L , if the classes are overlapping and partially inclusive. = f 1H – f 1L , if the classes are overlapping and completely inclusive. = 0, if the classes are nonoverlapping. The higher the value of NPD, the more separable the two classes are. If the two classes are highly separable, the measure will be approximately equal to 1. If the classes are more or less similar, their separability will be low, and in that case, NPD will be approximately equal to 0. The evaluation of this measure does not take into account the underlying distributional assumption of the features. This non-parametric approach behind the development of NPD measure adds to its advantage. To demonstrate this, let us consider the Bupa dataset. The dataset consists of 7 attributes. The first 5 variables are all blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. The dataset has 345 instances where each instance in the dataset constitutes the record of a single male individual. We compute the value of the proposed measure and the F1 score for each of the attributes. This has been repeated 100 times with different seeds, and the average value has been represented in the table below (Table 1). The scatterplot of ranked values of NPD and F1 score is plotted below (Fig. 2). Spearman’s correlation coefficient between NPD and F1 score is found to be 0.8285 which indicates a strong correlation between the ranked values of the said two. Thus, the features having high NPD values give a high value of F1 score. From the table above, we find that for ‘sgot’, the NPD measure is maximum, i.e., it has maximum separability among other features. It also has a maximum F1 score. It is followed by ‘gammagt’, ‘sgpt’ in terms of having high NPD value as well as Table 1 Comparison of NPD and F1 score Attribute name

Non-Parametric Distance (NPD)

F1 score

Rank with NPD

Rank with F1 score

mcv

0.1398

0.0723

5

6

alkphos

0.1612

0.1275

4

5

sgpt

0.1906

0.2294

3

3

sgot

0.3134

0.4520

1

1

gammagt

0.2712

0.3206

2

2

drinks

0.0801

0.1678

6

4

10

S. Roychowdhury et al.

Fig. 2 Scatterplot of ranked values of NPD and F1 score

F1 score. Thus, if we base our analysis on these features, they will help to achieve better model accuracy by contributing much to the target variable. It should be noted that this measure has been developed for binary classification problems.

5 Materials and Methodology In this section, we enlist the details of the study that we conducted and the environments we used for conducting the study on our own proposed method. The datasets are taken from the public UCI data repository. Dataset Description: The datasets used in the study are enlisted in Table 2 below. • ‘R’ has been used as the computational environment. • ‘R’ packages ‘spatialEco’, ‘e1071’, ‘caret’, ‘psych’, ‘rJava’, ‘FSelector’ have been used for the calculations of all the measures—Mutual Information, Chi-Square, Symmetric Uncertainty and Non-Parametric Distance Measure.

2

2

32

5

9

21

Wdbc (Breast Cancer Wisconsin Diagnostic Dataset)

Banknote authentication

Pima Indians Diabetes Database

Twonorm

Lsvt Voice Rehabilitation Dataset 309

No. of classes

Apndcts (Appendicitis Dataset)

2

2

2

2

No. of features

8

Datasets

Table 2 Description of datasets

126

7400

768

1372

569

106

No. of instances

2.000

1.002

1.866

1.249

1.684

3.516

Class Imbalance ratio

(continued)

This study uses 309 speech signal processing algorithms to characterize 126 signals from 14 individuals collected during voice rehabilitation. The aim is to replicate the experts’ assessment denoting whether these voice signals are considered ‘acceptable’ or ‘unacceptable’

This is an implementation of Leo Breiman’s twonorm example. It is a 20-dimensional, 2-class classification example. Each class is drawn from a multivariate normal distribution with unit variance

The objective of the dataset is to predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset

Data were extracted from images that were taken from genuine and forged banknote-like specimens

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image

The data represents 7 medical measures taken over 106 patients on which the class label represents if the patient has appendicitis (class label 1) or not (class label 0)

Description

Non-parametric Distance—A New Class … 11

2

2

7

Bupa (Liver Disorders DataSet)

Mgc (MAGIC Gamma Telescope 11 Dataset)

No. of classes

Ion (Ionosphere Dataset)

2

No. of features

33

Datasets

Table 2 (continued)

19020

345

351

No. of instances

1.829

1.379

1.786

Class Imbalance ratio

The data are MC generated (see below) to simulate registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique

The first 5 variables are all blood tests which are thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. Each line in the dataset constitutes the record of a single male individual

This radar data was collected by a system in Goose Bay, Labrador. This system consists of a phased array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kw. The targets were free electrons in the ionosphere

Description

12 S. Roychowdhury et al.

Non-parametric Distance—A New Class …

13

• As all the datasets used have a Class-Imbalance Ratio greater than 1, F1 score has been preferred over Classification Accuracy for the calculations. • 80% of each of the datasets has been taken as training and 20% as testing. 100 iterations of this action have been performed and the average value has been tabulated. • Spearman’s Correlation Coefficient has been used to record the correlation between F1 score and the different ranking measures.

6 Results and Analysis The values obtained for the Spearman’s correlation coefficient between F1 score and Non-Parametric Distance Measure (NPD) along with the class imbalance of each dataset is shown in Sect. 6.1. In Sect. 6.2, we have provided the value obtained for Spearman’s correlation coefficient between F1 scores and each of the methods of Information Gain, Chi-Square and Symmetric Uncertainty. Finally, in Sect. 6.3, we have provided the values for average Spearman’s correlation coefficient and average rank for all the measures.

6.1 Spearman’s Correlation Coefficient for Non-parametric Distance Measure In Table 3, we provide the value for Spearman’s Correlation Coefficient obtained for each dataset between the F1 score and the proposed Non-Parametric Distance Measure (NPD). It can be observed from Table 3 that Table 3 Correlation for Non-Parametric Distance Measure

Dataset

Spearman’s Correlation Coefficient between F1 score and NPD

Apndcts

0.71429

Wdbc

0.94705

Banknote

0.40000

Pima

0.73810

Twonorm

0.65263

Lsvt

0.80113

Ion

0.77406

Bupa

0.82857

Mgc

0.73333

14

S. Roychowdhury et al.

• For all datasets, the values obtained with Non-Parametric Distance Measure are positive, indicating that the measure successfully identifies a positive correlation between the feature subsets. • The values obtained are on the higher side, close to 1, which indicates a high degree of positive correlation. • For all datasets, the values of Spearman’s rank correlation coefficient with NonParametric Distance Measure is high which indicates that the features having high class separability also correspond to high value of F1 score. Thus, a feature having high NPD value impacts the model positively. In this way, this method can identify the important and relevant features from the datasets.

6.2 Spearman’s Correlation Coefficient for Different Ranking Measures In Table 4, we have provided the values of Spearman’s Correlation Coefficient for each of the datasets obtained between the F1 score and Information Gain (IG), ChiSquared (CS) and Symmetric Uncertainty (SU) measures. We have also added the values obtained from Table 2 for Non-Parametric Distance (NPD) Measure. It can be observed from Table 4 and Fig. 3 that • On 6 out of 9 occasions, the value obtained from Non-Parametric Distance Measure is the highest. • No other method has outperformed the others so many times. • For the dataset ‘Ion’ where IG, CS and SU failed to find a high degree of correlation, NPD heavily outperforms all the other measures. • The values obtained from the other datasets are comparable. Table 4 Correlation coefficient comparison Dataset

Spearman’s Correlation Coefficient between F1 score and IG

Spearman’s Correlation Coefficient between F1 score and CS

Spearman’s Correlation Coefficient between F1 score and SU

Spearman’s Correlation Coefficient between F1 score and NPD

Apndcts

0.71429

0.64286

0.71429

0.71429

Wdbc

0.92169

0.92125

0.91902

0.94705

Banknote

0.40000

0.40000

0.40000

0.40000

Pima

0.71429

0.76190

0.80952

0.73810

Twonorm

0.83158

0.84662

0.76391

0.65263

Lsvt

0.70745

0.70716

0.70547

0.80113

Ion

0.03409

0.08222

0.39205

0.77406

Bupa

0.75897

0.75897

0.75897

0.82857

Mgc

0.80606

0.80606

0.76970

0.73333

Non-parametric Distance—A New Class …

15

1

0.8 0.6 0.4 0.2 0 Apndcts

Wdbc

Banknote

Pima

Twonorm

Lsvt

Ion

Bupa

Mgc

Spearman’s Correlaon Coefficient between F1 Score and I.G. Spearman’s Correlaon Coefficient between F1 Score and C.S. Spearman’s Correlaon Coefficient between F1 Score and S.U. Spearman’s Correlaon Coefficient between F1 Score and N.P.D.

Fig. 3 Line graph of the Spearman’s Correlation Coefficient values of different ranking measures versus F1 score

6.3 Average Execution Time for Different Ranking Measures In Table 5, we have tabulated the execution time for each ranking measure for one hundred iterations of the corresponding dataset. The execution time is measured over 100 iterations to ensure that aberrations due to miscellaneous factors in any particular iteration do not interfere with the actual overall results. The time obtained is measured in seconds. It can be observed from Table 5 that • The proposed Non-Parametric Distance Measure can be calculated in much less time than other ranking measures. • Chi-Square performs the second best among the other methods. Table 5 Execution time comparison Dataset

Apndcts

Execution time for Information Gain (in s) 6.956

Execution time for Chi-Square (in s)

5.965

Execution time for Symmetric Uncertainty (in s) 6.669

Execution time for Non-Parametric Distance Measure (in s) 1.253

Wdbc

5.787

5.583

6.217

1.064

Banknote

6.141

5.633

6.070

1.064

39.527

35.558

41.565

7.630

Pima Twonorm

394.404

369.857

384.174

126.196

Lsvt

92.353

83.991

88.731

13.599

Ion

11.860

9.245

9.748

1.655

Bupa

79.871

72.131

76.994

11.967

Mgc

37.912

36.126

38.662

6.116

16

S. Roychowdhury et al.

Table 6 Average values of all measures Non-Parametric Distance

Information Gain

Chi-Square

Symmetric Uncertainty

Average Spearman’s Correlation Coefficient

0.73213

0.65427

0.65856

0.69258

Average rank of measure

1.889

2.111

2.222

2.333

Average rank of execution time of measure

1.000

3.667

2.000

3.333

• IG and SU are comparable.

6.4 Average Rank and Average Spearman’s Correlation Coefficient We have taken an average of the values of Spearman’s Correlation Coefficient for all the measures including our own proposed measure over all datasets. We have ranked each of the measures according to their values for each dataset and have taken an average rank over all the datasets combined. The measure with the highest value for Spearman’s Correlation Coefficient is the best possible measure of the four; and the lowest rank is more accurate than the other measures. Finally, we have ranked the execution time of each measure for each dataset and averaged out the ranks to find out how they perform with respect to each other over all the considered datasets. The results have been tabulated in Table 6. It can be observed from Table 6 and Fig. 4 that • The value for the average Spearman’s Correlation Coefficient over all the nine datasets is the highest for NPD. Thus, it provided the highest accuracy in feature selection. • The average rank of NPD is the highest. Thus, over all the datasets it has outperformed the other measures the maximum number of times. • The average rank of the execution time for NPD is the highest at 1, that is, it has taken the least time to execute every time, against IG, CS and SU.

7 Conclusion Feature selection is widely used as one of the most important preprocessing steps in machine learning. Being able to identify the features that most effectively affect

Non-parametric Distance—A New Class …

17

4 3.5 3 2.5 2 1.5 1

0.5 0 Average Spearman’s Correlaon Coefficient Non-Parametric Distance

Average Rank of Measure

Informaon Gain

Chi Squared

Average Rank of Execuon Time of Measure Symmetric Uncertainty

Fig. 4 Bar graph of average values of different parameters

a model’s performance and being able to reject the redundant ones gives a great advantage in reducing the spatial as well as temporal complexity of machine learning algorithms. If on the other hand, the process of performing feature selection increases the complexity of a problem, then the benefits of the process are not well-realized. Class separability measures of univariate variables have long been used and studied, but their calculation is rather a complex task. In this paper, we have proposed a new classification method called the Non-Parametric Distance Measure and have compared the results obtained from this method with other popular ranking methods such as Information Gain, Chi-Square and Symmetric Uncertainty. The Spearman’s Correlation Coefficient obtained from our proposed method with the F1 score often produced higher values than the other methods, indicating it often produces better results than them. The computation time for our proposed method is remarkably lesser than any of the other methods compared for 100 iterations over 9 publicly available datasets, clearly indicating that it is temporally less complex. The above analyses reveal that Non-Parametric Distance Measure heavily reduces computation time while providing marginally better accuracy than some of the most popular ranking methods in cases of univariate feature selection problems.

References 1. A. Elisseeff, I. Guyon, An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003) 2. E. Xing, M. Jordan, R. Karp, Feature selection for high-dimensional genomic microarray dat, in Proceedings of the Eighteenth International Conference on Machine Learning (2001), pp. 601–608 3. W. Shu, W. Qian, Y. Xie, Incremental approaches for feature selection from dynamic data with the variation of multiple objects. Knowl.-Based Syst. 163, 320–331 (2019)

18

S. Roychowdhury et al.

4. H. Liu, L. Yu, Feature selection for high-dimensional data: a fast correlation-based filter solution, in International Conference on Machine Learning (2003) 5. S. Goswami, A. Chakrabarti, Feature selection: a practitioner view. Int. J. Inf. Technol. Comput. Sci. (IJITCS) 6(11), 66 (2014) 6. S. Goswami, A.K. Das, A. Chakrabarti, B. Chakraborty, A feature cluster taxonomy based feature selection technique. Expert Syst. Appl. 79, 76–89 (2017) 7. Sheng-Uei Guan, Yinan Qi, Chunyu Bao, An incremental approach to MSE-based feature selection. Int. J. Comput. Intell. Appl. 6, 451–471 (2006) 8. J. Liu, G. Wang, A hybrid feature selection method for data sets of thousands of variables, in 2010 2nd International Conference on Advanced Computer Control, Shenyang (2010), pp. 288– 291 9. A.L. Blum, R.L. Rivest, Training a 3-node neural network is NP complete, COLT (1988) 10. M. Dash, H. Liu, Feature selection for classifications. Intell. Data Anal. Int. J. 1, 131–156 (1997) 11. I. Beheshti, H. Demirel, Feature-ranking-based Alzheimer’s disease classification from structural MRI”. Magn. Reson. Imaging 34(3), 252–263 (2016) 12. T.M. Khoshgoftaar, K. Gao, A. Napoliano, An empirical study of feature ranking techniques for software quality prediction. Int. J. Softw. Eng. Knowl. Eng. 22(02), 161–183 (2012) 13. A. Rehman, K. Javed, H.A. Babri, M. Saeed, Expert systems with applications relative discrimination criterion – a novel feature ranking method for text data. Expert Syst. Appl. 42(7), 3670–3681 (2015) 14. Shaohong Zhang, Hau-San Wong, Ying Shen, Dongqing Xie, A new unsupervised feature ranking method for gene expression data based on consensus affinity. IEEE/ACM Trans. Comput. Biol. Bioinform. 9(4), 1257–1263 (2012) 15. K. Shin, X.M. Xu, Consistency-based feature selection, in Knowledge-based and intelligent information and engineering systems. KES 2009, ed. by J.D. Velásquez, S.A. Ríos, R.J. Howlett, L.C. Jain. Lecture notes in computer science, vol. 5711 (Springer, Berlin, 2009) 16. J. Novakovi´c, P. Strbac, D. Bulatovi´c, Toward optimal feature selection using ranking methods and classification algorithms. Yugosl. J. Oper. Res. 21(1), 119–135 (2011) 17. P. Drotár, J. Gazda, Z. Smékal, An experimental comparison of feature selection methods on two-class biomedical datasets. Comput. Biol. Med. 66, 1–10 (2015) 18. X. Geng, T.-Y. Liu, T. Qin, H. Li, Feature selection for ranking, in Proceedings of ACM SIGIR International Conference on Information Retrieval (2007) 19. S.B. Hariz, Z. Elouedi, Ranking-based feature selection method for dynamic belief clustering, in ICAIS (2011) 20. C.S. Kumar, R.J. Rama Sree, Application of ranking based attribute selection filters to perform automated evaluation of descriptive answers through sequential minimal optimization models, in SOCO 2014 (2014) 21. C. Winkler et al., Feature ranking of type 1 diabetes susceptibility genes improves prediction of type 1 diabetes. Diabetologia 57(12), 2521–2529 (2014) 22. J. Van Hulse, T.M. Khoshgoftaar, A. Napolitano, A comparative evaluation of feature ranking methods for high dimensional bioinformatics data, in Proceedings of 2011 IEEE International Conference on Information Reuse and Integration IRI 2011 (2011), pp. 315–320 23. C.C. Reyes-Aldasoro, A. Bhalerao, The Bhattacharyya space for feature selection and its application to texture segmentation. Pattern Recogn. 39(5), 812–826 (2006) 24. X. Guorong, C. Peiqi, W. Minhui, Bhattacharyya distance feature selection. Proc.-Int. Conf. Pattern Recogn. 2, 195–199 (1996) 25. G. Xuan, X. Zhu, P. Chai, Z. Zhang, Y.Q. Shi, D. Fu, Feature selection based on the Bhattacharyya distance, in Proceedings - International Conference on Pattern Recognition, vol. 3 (2006), pp. 1232–1235 26. S. Lei, A feature selection method based on information gain and genetic algorithm, in Proceedings—2012 International Conference on Computer Science and Electronics Engineering, ICCSEE (2012)

Non-parametric Distance—A New Class …

19

27. M.R.P. Homem, N.D.A. Mascarenhas, P.E. Cruvinel, The linear attenuation coefficients as features of multiple energy CT image classification, Nucl. Instrum. Methods Phys. Res. Sect. A Accel. Spectrom. Detect. Assoc. Equip. 452(1–2), 351–360 (2000) 28. S.B. Serpico, M. D’Inca, F. Melgani, G. Moser, Comparison of feature reduction techniques for classification of hyperspectral remote sensing data. Image Signal Process. Remote Sens. VIII 4885, 347–358 (2003)

Digital Transformation of Information Resource Center of an Enterprise Using Analytical Services Preeti Prasad Ramdasi, Aniket Kolee, and Neha Sharma

Abstract An information resource center at any of the organization—simply termed as “Library”, is evolving to get itself aligned with the organizational goals in its journey of digital transformation. Huge amount of data is generated through variety of library services and management operations. Therefore it is very much critical to make ethical utilization of this data and arrive at meaningful patterns through analytical tools and techniques. This paper initially discusses basics of data analytics, and then presents the application of descriptive, diagnostic and prescriptive analytical principles to information resource datasets. Experimental examples have been used to showcase the discovery of probable significant insights towards identifying improvements, designing appropriate action plans aligned to specific objectives and tracking and monitoring activities. Besides, visualizations tool is used to present outcome of the analytics in a way that explains future actions. Thus, preparing the library department for digital-future. Keywords Analytics · Business intelligence · Library analytics · Library queries · Analytical environment

1 Introduction Analytics is the buzz word in any given field today and is almost synonymous with the survival and growth of established business practices. Libraries are no exceptions!! Libraries as knowledge management centers or information centers has a prime P. P. Ramdasi (B) · A. Kolee TCS Data Office, Analytics & Insights, Tata Consultancy Services, Pune, India e-mail: [email protected] A. Kolee e-mail: [email protected] N. Sharma Society for Data Science, Pune, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1175, https://doi.org/10.1007/978-981-15-5619-7_2

21

22

P. P. Ramdasi et al.

purpose to provide information materials and services to the members. Library or Information Resource Center (IRC) is a unit usually associated with academic setup, however, gradually it is finding its place in corporate as well. Modern enterprise environment is going through a tremendous transformation due to evolving information technology, global economy, diverse competition, complanation structure, and growing information resource [1]. The management of the enterprise undergoing transformation evolves through scientific management to humanistic management and to information resources management [2]. Finally, information resources serve as the center in the evolution process [3–5]. The IRC is not only catering to the employees of the enterprise but also steadily paving its way into online or remote access mode to reach global clientele. At the same time, for any library at the enterprise level, it is required to get aligned with organizational digital goals. Along with traditional book lending and basic services, libraries must build digital capabilities. Organizations are always keen on budget constraints. For the libraries associated with them, Return on Investment (RoI) is an important measure of performance that challenges sustainability. To understand the RoI on services and resources, libraries have to deal with various data, statistical reports, survey responses, number of hits/downloads on online services, feedback through internal portals, social networking sites, query logs, email, and chat services, to name a few. Thus, to start with, setting up a good data collation process and building data governance policies is of paramount importance. A matured data model, along with quality data, is a great deal for applying analytics to derive business intelligence. Awareness of the digital world, looking at data from a newer perspective, understanding of data analytics, creatively experimenting with the possibilities of analytics, and right visualization of the resultant insights play a vital role. A large corporate setup with lakhs of associates primarily depending on online resources leads to the creation of voluminous data pool in order to cater to the need of every member. However, the data size is still considered as relatively “small” rather than “Big Data”. In a dynamic business environment, an experiment with collective intelligence on all such feeds, using analytical tools help inbuild intelligent insights on current services, as well as improvise and align the services with changing needs and thus aid in powerful decision making. The following sections take the reader through a quick journey of few analytical applications to a library unit. In the beginning, it gives a brief overview on “how and why” of Library analytics. Further, it describes major analytical techniques and takes users into real life experiments with analytical environment in the library. The sample dashboards mentioned here are actual results developed by one of the librarians as a part of TCS’s Project Manager Offices and Librarians to Data scientist’s transformational program.

Digital Transformation of Information Resource …

23

2 Libraries and Analytics—An Overview Gorman and Lee, in their respective research articles, have mentioned that the traditional approach of the collection is challenged due to innovations in library management systems, latest licensing model, new subscription models, open access to webbased documents, and have suggested concentric-circle as a new conceptual model [6–9]. There were a couple of alternative terms coined for libraries in the digital age, like “Information Resource Center” by Savic [10] and “Content Center” by Budd and Harloe [11]. Historically, libraries have been practicing analytics in a traditional way, where customer surveys, user feedback, and collection usage are considered to study and identify usage patterns [12]. Traditional business intelligence methods primarily focus on the past. Beyond such basic analytics, with the emergence of newer services, there is huge scope to collect and analyze multiple datasets generated in day-to-day operations [12]. Modern analytics provides a far deeper insight and excellent pointers, which if implemented, holds the potential to radically revamp or accurately re-align the entire gamut of services offered [13]. It forms the backbone for value addition through a better understanding of customer needs and preferences. Further, more complex, advanced analytics in libraries facilitate intelligent perceptions, predictions, and prescriptive patterns [14]. Therefore, the role of a librarian is continually evolving to meet the organization’s business objectives, social and technological needs. This explicitly demands self-skill upgradation from traditional librarian to a modern librarian. This section presents the overview of analytics a. Analytics—Comprehending the Definition According to SAS Institute Inc., “Analytics is an encompassing and multidimensional field that uses mathematics, statistics, predictive modelling and machine-learning techniques to find meaningful patterns and knowledge in recorded data.” [15]. The ultimate intention of applying analytics to any function of the organization is to generate more business value. This is achievable through taking quicker and datadriven decisions, by emphasizing machine first approach where human dependency is minimized and at the same time, more correctness is assured due to the leaning possibility of human errors. Thus, analytics is trusted and has gained tremendous importance as one of the components that weave the Data science platform [16–18]. Mathematics and statistics form the basis while Data modeling, Artificial intelligence and Machine learning strengthens the knowledge, insights discovery [19–21]. b. Analytical Techniques Organizations have multiple objectives for analytics to help uncover the true value of data. Depending upon the nature of work and the requirement of analysis, business analyst plans the tools and techniques for exploration. There are four major types of data analytics [22]. They are Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. Mapping of which type of analytics to be used

24

P. P. Ramdasi et al.

Fig. 1 Phases of analytics as suggested by Gartner [23]

as per the requirement of information is depicted by Gartner Business Intelligence and Analytics Summit March 2016 in Fig. 1 [23]. As the complexity of the analysis increases, the results bring more value. Through all these types, one can deep dive into data in search of a much-needed and fact-based insight. (i)

Descriptive analytics primarily provides insights about the past through histological trends, distribution, basic comparisons, etc. These findings indicate the correctness or success of any of the approaches or actions taken in the past. Examples are depicting the distribution of specific types of queries across multiple locations, the trend of a number of transactions at a particular location, etc. These can be derived from transactions log, queries log, and website hits log. (ii) Diagnostic analytics is a way to explore more about a particular analytical result. This is derived by using historical data that can be measured against other data to answer the question of “why” part of the observation. Further, referring to the descriptive analytical insights, diagnostic analytics makes it possible to drill down and identify pattern of dependencies. Examples are the relation of peak usage of reading room and books in a specific period, increasing usage of a specific source of information, etc. (iii) Predictive analytics is a valuable tool for forecasting. It suggests “what is likely to happen”. The results of descriptive and diagnostic analytics are utilized to detect tendencies, clusters, and exceptions, and to predict future trends. Examples are decreasing usage of print collection and increasing use of electronic collection, possible research area based on current usage of existing research papers and domains, and so on. (iv) Prescriptive analytics further helps in deriving relevant suggestions “what action to take” so as to bring expected change regarding certain observation. This is the most complex form of analytics, and demands not only historical

Digital Transformation of Information Resource …

25

data, but also external information to use in statistical algorithms. Prescriptive analytics helps in creating optimizers that are used for What-if-analysis. Libraries are yet to mature to this level. This is set as a target for most of the librarians. Involvement of multiple skilled resources such as subject matter experts in library function, data scientists/statisticians, business analysts as well as usage of Business Intelligence (BI) tools would speed up the work systematically. Thus maturity can be achieved step by step. The following sections take the reader in detail of our initial experiment about implementing analytics in libraries and the results of analytical study. The reusability is claimed across a small or large scale of library operations at single or multiple locations.

3 Material Methods: Analytical Environment at Library Recent observations at libraries pose a challenge to the improved and advanced working environment. Huge data created in the library is left unutilized to some extent to derive further insights. Other observations in various library activities probe additional challenges to deal collectively. Critical examples are • • • • •

Reduced physical space Reduced footfall in libraries Budget constraints and Increased cost of assets Lower stakeholders satisfaction Inefficiencies in manual operations.

It becomes necessary for every librarian and library manager to understand current challenges and probable issues thoroughly, form a relevant problem statement that is aligned to the organizational goal, imagine the expected actionable outcome, collate relevant data around the objective, and identify appropriate approaches to solutions and work on action plans aligned to library goals. This can be achieved by developing a deep understanding of the library data and collecting it over a sufficient period of time for analysis. Many foreseen benefits of developing and implementing a well thought analytical environment are globally being accepted. Specific to a library function, they could be • • • • • • • • •

Improved and insightful library reports Improved view of library operational data Improved vendor management Improved decision making Improved risk management Improved customer relationships Improved understanding of customers and their information needs More opportunities to create new products/services Reinventing library processes.

26

P. P. Ramdasi et al.

Within organizations like TCS, where research division and delivery units both avail library services very much specific to their needs, libraries can play an important role in making use of knowledge from both the departments. Delivery functional units like engineering and industrial services focus on client-specific deliveries, while scientists are focused on core research. Collaborating with these units and understanding the work, the library can track information on the latest technological developments within the organization and across the globe, futuristic developments, patentability of novel work, shop floor technological development, customer’s business needs, and see to seek some correlation among them. This can bring a tweak in the work and may delight customers for value add delivery. a. Challenges and Problem Statements in Library setup An exhaustive collation of observations and challenges helps in analyzing and forming an appropriate problem statement. It is also necessary to understand the nature and scope of the problem and decide if a solution is required to be implemented locally or globally. The following set of observations are worked on as problem areas with opportunities for analytical solution: • • • • • • •

Continual decrease in usage of books Inadequate usage of library services Inadequate usage of electronic resources Reduced footfall for library events Reduced interest in physical books Weeding out of books as per latest technology Rare opportunities for connecting with the customer.

b. Datasets for Library Analytics Various datasets with continuous data (time series) are generated at the library. It is extremely important to have a relevant set of data captured, and a thorough understanding is developed to address selected problem areas. The following data points can be captured systematically and are well organized to form a Metadata and a simplified data model: • • • • • • • • • • •

New Additions to the library collection Transaction’s log with user details Reference queries log Log of OPAC Searches Online sources usage report User Surveys Feedback/complaints received Stock Verification data Weeding out data Expenditure details (Subject wise/Document type wise) Membership data (Addition/Renewal/Drop out)

Digital Transformation of Information Resource …

27

• User profiles • Users footfall in library/different sections • Vender and contract details. These datasets are reorganized to form an optimized data model as a part of EDM—Enterprise Data Model. There is enough flexibility for further expansion or modification. This has initiated a structured approach toward analytics.

4 Experimental Details and Results This section takes the reader through real implementation of analytics and results in the form of observations. The scope is limited to the library function operating from multiple locations across the globe for one organization. The work has started through a preliminary analysis of available data, to identify Key Performance Indicators (KPIs) and Key Risk Indicators (KRIs) and for the specific objectives, improving the quality of data, developing a data model, creating charts for communicating the results onto dashboards. Visualization is an important and final phase to analytics—it is about representing analytical outcome in a meaningful way so as to understand actions to be planned. QlikSense software has been used for data visualization. At the outset, a sample dashboard developed for a high-level understanding of the big picture of IRC operations is presented in Fig. 2. In the following demonstration, one of the datasets of acceptable quality is considered for descriptive analytics. The name of the dataset is the library reference query dataset. It maintains details of each query received from, like, where the query is been raised? What is the priority mentioned? Which domain it is associated with?

Fig. 2 IRC overall operations dashboard

28

P. P. Ramdasi et al.

Phase (1) Descriptive Analytics The available dataset is preliminarily analyzed to identify some of the Key Performance Indicators (KPI), and visualization is created. The objective is to develop overall insights on total reference queries received in a quarter till date. The following problem statements are formed, and KPIs are developed for exploration. • • • •

How is the segregation of Reference queries by geography and domain wise? How is the distribution of Query type and what is the frequency per type? How is the Location wise query distribution? How is the priority distribution of queries?

Example (a) Query Categorization—Geography and Domain demonstrated in Fig. 3. Observation: As seen in Fig. 3, (a) Total count of 1464 tickets are raised for the duration of one quarter. (b) The visualization describes that the Mumbai branch is a top branch in terms of the number of tickets raised during the said period. (c) Types of Plagiarism tickets were raised more in the quarter followed by other queries. Example (b) Query Type Categorization—Frequency demonstrated in Fig. 4. Observation: As seen in Fig. 4, (a) A specific location, Pune, is explored to understand that the maximum number of article queries were asked, followed by plagiarism and market data queries. (b)Total queries received from Pune are 148. Example (c) Location wise priority distribution demonstrated in Fig. 5. Observation: Sample of five cities taken here for analyzing ticket priority. From Fig. 5, it is understood that (a) urgent queries were highest at Pune, followed by Bangalore and Chennai. (b) Maximum number of queries of Medium category are asked at Delhi.

Fig. 3 Query type categorization by geography and domain

Digital Transformation of Information Resource …

29

Fig. 4 Query categorization by type and frequency

Fig. 5 Library reference queries as per priorities per location wise

Phase (2) Diagnostic analytics Considering the organizational objectives, the outcome of descriptive analytics is further analyzed by asking rigorous questions. This has reaped better insight to create the required action plan. Basic awareness of correlation of data items and functional expertise help experimenting right diagnosis. The following are two examples from libraries reference query log: Example (a) The objective is planned to optimize resource usage. The problem statement is formed as “Analyze the reason for high or low utilization of a particular subject resource”. For this, multiple insights are derived like resource and content availability, skilled librarians available per location, analysis of survey report on awareness among the associates about the resource, as demonstrated in Fig. 6. To diagnose the reasons for low utilization, while exploring the correlation, the following diagnostic observations were revealed. • • • •

Lack of skilled librarians for HiTech domain Lack of availability of relevant content Lack of availability of adequate resources at the location Lack of awareness for resources on the Management domain.

Accordingly, the following action plan aligned to the defined objective was proposed.

30

P. P. Ramdasi et al.

Fig. 6 Library reference queries across domains

(a) Identify individual librarians for skill upgradation. (b) Arrange appropriate skill training sessions. (c) For poor content as a prominent reason, identify and discontinue resources with poor contents, replace them with new ones. Example (b) The objective is to appreciate performing librarians, keeping them engaged for a longer duration, and to motivate non-performers. The problem statement is formed as “Analyze librarian’s individual performance”. The top ten performers in resolving queries are depicted in Fig. 7. The performer’s distribution is further analyzed to diagnose the reasons and parameters contributing to improve performance. Similarly, it is required to dig out the causes for low performance. Diagnostic information that is revealed, to mention a few, are as below;

Fig. 7 Top ten performing librarians involved in resolving library queries

Digital Transformation of Information Resource …

• • • •

31

Lack of aspirations, Pending request for relocation, Appreciable skills of good people connect, Good technical understanding. Accordingly, the following action plan is proposed.

(a) Identify high performers and offer opportunities as per aspirations in order to take efforts to retain the top performer. (b) Utilize the top performer’s skills to develop other people. (c) Prepare individual learning plans for non-performers. Example (c) The objective is planned to create an effective dashboard of library services to increase more awareness. The problem statement is formed as “Analyze the reason for high or low utilization of a particular service”. When diagnosed with the outcome, the following reasons for low utilization of service came out prominently. • Lack of good visibility • Lack of relevant content available • Lack of technical skills. The proposed action plan includes (a) Develop librarian’s analytical/technical skills along with other basic skills. (b) Develop a deeper understanding of the product to make the product better utilized. (c) Improve the visibility of the service by spreading more awareness among users. Phase (3) Predictive analytics This is the advanced stage of analytics. Predictive analytics can be done further by developing more advanced statistical analytical models. It makes use of multiple analysis and modeling techniques and discovers patterns and relationships in datasets. Such insights forms the basis for required predictions. The simplest example could be to build book collections as per changing needs and demands in the organization. Book and course recommendations based on individual learning and reading history is another classical example. With such predictive recommendations, organizations gain enhanced user experience. The following dashboard in Fig. 8 gives details about vendors and the deals with the library. With a good amount of history maintained for individual vendor and related deals, a trend line is formed and a customized predictive model would tell about the likelihood of continuing the services of the vendor/by the vendor, probable change in the nature of the deal. Due to insufficient historical data, trend analysis could not give significant insight at the local branch level. However, work is going on at the corporate level to develop the predictive model. Given below are few more examples for reference that are aimed through different predictive models. • Training needs for library staff

32

P. P. Ramdasi et al.

Fig. 8 Library vendor and deals dashboard

• • • •

Suggestions on developing cross-functional teams Predict trend and demand for new services Predicting risks Predicting the need for latest titles and research papers.

Additionally, in the area of library management, predictive analytics can be used to help develop an overall financial plan, need for staffing, and other facilities.

5 Conclusion The ability to look at the data analytically will change the library’s role, re-imagine established practices and services. Organizations like TCS have noticed the skills gap and running training programs for librarians to become business partners rather than being just a service provider. Librarians need to adopt the mindset of data generation and analytical use of it aiming at overall improvement. If libraries succeed to maintain and grow library datasets, then designing newer library services, defining futuristic action plans can be strategically achieved. This helps to reduce the risk of failure and thus, overall efforts and cost can be saved, human resource consumption can be optimized. Libraries can identify areas of improvements and developmental challenges. Therefore, a systematic method of data capturing is to be devised to build sufficient time series data in the long run. The involvement of multiple stakeholders like data scientists, statisticians, six sigma experts, business analysts, and related computer professionals will help in creating an analytical environment for futuristic libraries. Thus the library’s impactful contribution in business can be recognized and increases the sustainability likelihood of the library as a valued service provider.

Digital Transformation of Information Resource …

33

Acknowledgments Through this work we acknowledge department of library of Tata Consultancy Services for contributing their functional expertise and sharing learning experience. Acknowledge Analytics & Insights unit’s head and subject matter experts in developing cross-functional analytical minds while developing relevant offerings for digital transformation across the organization. This has created the opportunity to work with both Library unit and pool of data science practitioners within the organization.

References 1. P. Wen, Management changes and information resource management. Inf. Sci. 11, 1008–1011 (2000) 2. Mao, Lingxiang. (2015). Information Resources Management Framework for Virtual Enterprise 3. F.W. Horton, Information Resources Management: Harnessing Information Assets for Productivity Gains in the Office, Factory, and Laboratory (Prentice Hall, Upper Saddle River, 1985) 4. D. Tapscott, A. Caston, P. Shift, The New Promise of Information Technology (McGraw-Hill Companies, 1992) 5. J. Huang, Research on Integration Management of Enterprise Information Resource (Wuhan University of Technology, Wuhan, 2005), pp. 1–168 6. M. Gorman, Our Enduring Values: Librarianship in the 21st Century (American Library Association, Chicago, 2000) 7. M. Gorman, Collection development in interesting times: A summary. Lib. Collect. Acquis. Tech. Serv. 27, 459–462 (2003) 8. H.-L. Lee, What is a collection? J. Am. Soc. Inf. Sci. 51, 1106–1113 (2000) 9. H.-L. Lee, Information spaces and collections: Implications for organization. Lib. Inf. Sci. Res. 25, 419–436 (2003) 10. D. Savic, Evolution of information resource management. J. Lib. Inf. Sci. 24, 127–138 (1992) 11. J.M. Budd, B.M. Harloe, Collection development and scholarly communication in the 21st century: From collection management to content management, in Collection Management for the 21st Century: A Handbook for Librarians, ed. by G.E. Gorman, R.H. Miller (Westport Greenwood Press, 1997), pp. 3–25 12. Christiane Lehrer, How big data analytics enables service innovation: materiality, affordance, and the individualization of service. J. Manag. Inf. Syst. 35(2), 424–460 (2018) 13. J. Townes, Library improvement through data analytics (ALA Neal-Schuman, 2016) 14. Jennifer Townes, Library analytics and metrics: using data to drive decisions and services. Lib. J. 140(13), 113 (2015) 15. C. Moyce, Intelligent thinking: using cognitive analytics in organizational decision-making and planning. Manag. Serv. 62(2), 30–33 (2018) 16. Visnia Istrat, Sania Stanisavljev, Branko Makoski, The role of business intelligence in decision process modeling. Eur. J. Appl. Econ. 12(2), 44–52 (2015) 17. H. Kriegel, K.M. Borgwardt, P. Kröger, A. Pryakhin, M. Schubert, A. Zimek, Future trends in data mining. Data Min. Knowl. Disc. 1(1), 87–97 (2007) 18. J. Skinner, R.J. Higbea, D. Buer, C.C. Horvath, Using predictive analytics to align ED staffing resources with patient demand. Healthc. Financ. Manag. 2018, 1–6 (2018) 19. M. Wedel, P.K. Kannan, Marketing analytics for data-rich environments. J. Mark. 80(6), 97–121 (2016) 20. Dale Farris, Keeping up with the quants: your guide to using and understanding analytics. Lib. J. 139(16), 52 (2014)

34

P. P. Ramdasi et al.

21. Changhyun Kim, Ben Lev, Enterprise analytics: optimize performance, process, and decisions through big data. Interfaces 43(5), 495–497 (2013) 22. T.H. Davenport, Analytics 3.0. Harv. Bus. Rev. 91(12), 64–72 (2013) 23. Gartner Business Intelligence & Analytics Summit, March 2016 (2016)

Prediction of Stock Market Performance Based on Financial News Articles and Their Classification Matthias Eck, Julian Germani, Neha Sharma, Jürgen Seitz, and Preeti Prasad Ramdasi

Abstract Hundreds of stock market news reach the financial markets every day. In order to benefit from these movements as an investor, this work aims at developing a system with which the direction of the price fluctuations can be predicted. This paper is about building an own dataset of financial information with thousands of financial articles, and historical stock data are used to build the foundation of the following actions. This foundational data is used to build a learning algorithm. Therefore, the articles are normalized by the methods of natural language processing and converted into a matrix based on the occurrences of the individual words in the news. This matrix then serves as an endogenous variable that predicts the likely direction of market impact. In order to make that statement, the actual impact on the markets was used as an exogenous variable to train different classification algorithms. To find the best algorithm to train, we created a confusion matrix. In the end, the best algorithm gets selected to perform the prediction and as a result, our trained algorithm achieved high accuracy. Keywords Data mining · Logistic regression · Statistical method · Sentiment analysis · Knowledge discovery in database · Confusion matrix · Prediction

M. Eck · J. Germani · J. Seitz Duale Hochschule Baden-Württemberg, Wirtschaftsinformatik, Heidenheim 89518, Germany e-mail: [email protected] J. Germani e-mail: [email protected] J. Seitz e-mail: [email protected] N. Sharma (B) Society for Data Science, Pune, India e-mail: [email protected] P. P. Ramdasi TCS Data Office, Analytics & Insights, Tata Consultancy Services, Pune, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1175, https://doi.org/10.1007/978-981-15-5619-7_3

35

36

M. Eck et al.

algorithm · Support vector machine · Random forest · Decision tree · Multinomial naive Bayes · Stock market · Natural languages processing

1 Introduction Stock markets play an essential role in growing industries and individuals that ultimately affects the economic growth globally. As the stock is more liquid in nature, a huge amount of transactions is done by the investors [1]. To take the maximum benefit regarding investing in the stock market, the investor tends to keep oneself updated through stock market news [2]. However, there is so much of news around price fluctuations that float in the financial market every day that it becomes difficult for investors to decide [2]. It is supposed to be a good idea to analyze the news articles and to predict the effects of financial news on stock performance. The potential to find atypical revenues in the stock market has always appealed the investors, companies and researchers with the same vigor. With the advent of technological development, there are more opportunities and avenues to address such social needs. Data mining process or KDD is used to analyze the data and get some insights from the dataset. The entire process is depicted in Fig. 1. KDD stands for Knowledge

Fig. 1. Knowledge Discovery in Databases (KDD) process [3]

Prediction of Stock Market Performance …

37

Fig. 2. Sub-processes of Knowledge Discovery in Databases (KDD) process [3]

Discovery in Database and includes the aspects of data mining, additional preprocessing of the data, interpretation of the patterns and evaluating the results [3]. There is a huge amount of data being generated at multiple data points and takes any form like text, audio/video in a structured/unstructured format. Data mining techniques work with such a huge collection of data to extract valuable, intended, meaningful data out of it, termed as knowledge. This could be a summary of available data, or any pattern of data or even trends derived from the data. Knowledge discovery in the database can be a repetitive process for continuous enhancements of the outcome of mining. The sub-processes are depicted in Fig. 2 and described below: (a) The process of data cleaning and data integration are fundamental steps in preprocessing of databases. This helps in removing noisy data and developing a complete data warehouse. (b) Data selection process extracts relevant data from the entire collection by making use of algorithms like Neural network, Decision Trees, Naive Bayes. Clustering, Regression, etc. (c) This data is then transformed into an appropriate form to be used in the mining procedure. Data mapping and code generation are an integral part of the transformation process. (d) Actual work of data mining incorporates the usage of classification or characterization to derive patterns out of transformed data. (e) Summarization or visualization techniques are great tools for Pattern evaluation (f) Finally, data mining results are presented in terms of tables, reports, charts, discriminant rules, classification rules, etc. In this paper, the attempt is to achieve this prediction by analyzing financial news and the reaction of the stock performance. With this information, we teach an algorithm which now can predict future stock performance. This paper intends to present encouraging results of the work done on developing a predictive system that trace the direction of the price fluctuations. The work incorporates building dataset of financial information, building learning algorithm and applying NLP and data mining techniques. As a result, our trained algorithm achieved a high degree of accuracy in predicting the direction of price fluctuation. In the following part, it will be described how our dataset is composed, why we chose those data and where we get the data

38

M. Eck et al.

from. Special about our dataset is, that we did not choose an existing dataset, we created our own. We decided to proceed with the KDD Process and structured the documentation by the same pattern.

2 Literature Review The review of literature clearly indicates the correlation between news feed and stock price fluctuation to emphasize on the power of verbal information on the stock market [4–6]. In comparable works, such as Klein et al., statements are often made through a sentiment analysis [7]. Hereby, messages are searched for relevant sentences. These are then rated based on the words included. The sentiment of the sentence is correspondingly the sum of the sentiments of the used positive and negative words. In the present work, however, the evaluation of the entire message is based on the actual movement of the markets and not on the moods of individual words and sentences. Multiple similar works done independently by Li et al. and Schumaker et al. revealed that stock returns are assured when there is a sentiment element in financial news articles [4, 5, 8, 9]. However, it is challenging to capture the influence of media on stocks and explore such connections. Advanced analytics like artificial intelligence, machine learning, deep learning and natural language processing techniques have been adopted and applied to address these challenging situations [10–18]. Schumaker analyzed financial news articles using support vector machine which is a predictive machine learning approach [5]. The results were quite encouraging as the stock price can be calculated within 20 min of the release of financial news [5]. Jageshwer Shriwas et al. used complex nonlinear multidimensional time series data to propose a model using neural networks to accurately predict the closing price of the BSE. The accuracy of the model determines the benefits to the investors [19]. Karthik et al. used statistical techniques and data mining techniques to analyze stock market data and predict the market price [20]. Data mining techniques like clustering and weighted rule mining were used to analyze data. K-mean clustering algorithm was used to group transactions as per market flow. The authors further recommended the most effective technique for prediction [20]. Stephen Evans used Russell 1000 and Russell 2000 stock indexes to predict the direction of daily return using a classification algorithm which is a data mining technique [21]. Model testing was done using distributed data mining classifiers, which showed 60% accuracy [21]. Anil Rajput et al. used WEKA tool to apply decision tree and rule based classification on historical BSE stock data [22]. Prasanna et al. presented a literature review of various approaches made by researchers to predict stock price [23]. The authors studied the fundamental approach as well as data mining techniques that have been used for the evaluation of historical stock prices and estimated appropriate financial pointers [23]. Pinto et al. suggest a framework for making decision regarding trade and predict stock market direction. The authors

Prediction of Stock Market Performance …

39

have used the power of neural network to predict the closing price of stock market for a typical trading day [24]. Besides, we also reviewed the literature that presented the use of various data mining techniques and algorithms for getting glaring insights from different datasets. These papers also provided some direction on how to locate the data, pre-process it and prepare it for analysis [25–31].

3 Material Methods We decided to proceed with the KDD Process and structured the documentation by this pattern. a. Selection of the dataset The dataset is a combination of financial news and stock data. The financial news consists of various information which is described below: (i) Financial news: We are collecting the financial data from a German financial news provider called OnVista. They have a professional finance news portal where we get our verified financial articles from. Through the Python-based framework, BeautifulSoup, we have the ability to scan the website for the required information. It is possible to take three major information from those articles: 1. The article itself— Through the article, we get important information about what is going on with the company, which the article is about. This information is the base for our dataset. Through Natural Languages Processing, it is possible to reduce the article to a minimum, which includes important verbs and nouns but no filler words. This builds our base information for the dataset. 2. Date and title— It is important to account for date and title. That information must be attached to the article in the dataset. Those key figures are important for the analysis later on. 3. Stock name and notation id— The stock name is also part of the article. It must be recognized and saved separately to work with it. To look up for the stock performance, we also need to extract the internal notation id generated by OnVista. With the notation id, it is possible to get a list of the last year performance. This information gets as well attached to the dataset. (ii) Stock data: Like the financial news, we extracted the stock data from the German financial provider OnVista. This stock data is structured as follows:

40

M. Eck et al.

Table 1. The structure of the captured data Article Title

Stock Date

Text

Length

Name

Development Start value

End value

Development rate

Development rate compared with index

1. Date of the trading day— This is important to have the ability to search for a context between stock data and the articles. 2. Opening and closing price— This data represents the price with which the stock started and closed at the stock market. Those values can also be used to look for extraordinary stock performance. 3. Highest and lowest price— With the highest and lowest price of the day, the span of price performance can be calculated. 4. Volume of trades— Through this indicator, it can be determined how many stocks of this title were traded. With those two parts, the raw data of the dataset is complete. In the next step, those data get edited. In this process, the Python library Pandas and SpaCy are the tools of choice. b. Preprocessing and transformation The last step of creating the dataset is preparing the raw data for the learning algorithm. Therefore, we have to calculate additional fields out of the existing data, like the extraction of the raw article text from the HTML-structure given by OnVista. All prepared data got structured with the help of Python’s library pandas. The structure of the captured data is presented in Table 1 below. Through this tabular structured form, it was possible to go further with SpaCy. SpaCy is a Python library for natural language processing. There are several tools on the market, but SpaCy has a very good German implementation and a specific database for the German language. Within the fact that our data is written in the German language, a tool made for German is the best possible choice. With the help of SpaCy, the raw data got prepared with the following processing steps: 1. 2. 3. 4. 5. 6.

Tokenization Removing of stop-words and punctuations Lemmatization to unify the words Vectorization to turn the words into numerical features Count the occurrences of each feature Transform the occurrences into frequencies.

Prediction of Stock Market Performance …

41

First, a Tokenization was done, where the text got segmented into words. In the same step, we removed so-called stop-words. These are commonly used words like “the”, “a”, “in”, “an”, which we don’t need for the key message of a sentence. Next, the so-called Lemmatization was accomplished. Through this, it is possible to assign the base forms of words. For example, the lemma of “was” is “be” [18]. To train a model, we have to transform the text data into numerical data. Therefore, we processed the vectorization. The library scikit-learn supports the class CountVectorizer, which will build feature vectors of each word and count the occurrences of each feature. The additional class TfidfTransformer takes care of the different lengths of the whole texts of each news article. It will divide the number of occurrences of each word in the article by the total number of words in the whole article. The result is called the term frequency (tf). The “idf” stands for inverse document frequency which means the downscale of words that occur in many articles in the corpus and are therefore less informative. Both classes were used as a combination in the class TfidfVectorizer [16]. To be able to make a prediction if the markets will be stronger or weaker depending on the articles, we need to make a classification of the result of the articles. Those classifications assign categories or labels as parts of a document. Thereby it is possible to classify the text into the two made categories “positive” and “negative”. Each of them describes the performance of the market depending on an individual article. If the price will be higher after 14 days, the article will be rated as “positive” otherwise as “negative”. After those steps, the data is now prepared for training the algorithm. c. Data Mining In the following part, we compared different algorithm to succeed with the best possible result. (i) Selection process We decided to investigate the following commonly used methods [15]: Logistic Regression: Logistic regression is a predictive analysis used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. Multinomial Naive Bayes: The multinomial Naive Bayes classifier is suitable for classification with discrete features for word counts for text classification. The multinomial distribution requires integer feature counts. Decision Tree: Decision trees are a popular and powerful tool used for classification and prediction purposes. They provide a convenient alternative for viewing and managing large sets of business rules, allowing them to be translated in a way that allows humans to understand them and apply the rule constraints in a database so that records falling into a specific category are sure to be retrieved. Random Forest:

42

M. Eck et al.

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes or mean prediction of the individual trees. Support Vector Machine: Support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. They are very efficient in classification with a high precision and low error probability. All the algorithms of data mining mentioned in this section were applied to the dataset created and were implemented using Python library scikit-learn. The results got analyzed in the following captor.

4 Interpretation and Evaluation To apply these models to our database, we have divided them into training and test data. In this way, the trained model can be examined for its validity based on the test data. A common method for checking the classification results is a confusion matrix. Here, the actual classes are compared with the predicted classes by the respective model. This results in the following key features [32]: Accuracy: What percentage of the predictions of all classes are correct? Precision: What percentage of predictions in a particular class are correct? Recall: What percentage of the actual occurrences of a particular class was predicted? Table 2 lists the key figures of all models. It can be observed that the Support Vector Machines return the best result with an accuracy of 70%. Above all, the classification into the class “positive” is very reliable. 71% of all classifications are correct and 81% of all actual occurrences of the class are recognized by the model, which is a good result. Table 2. Comparing the results of various data mining algorithms Logistic regression (%) Accuracy Negative Positive

Multinomial Naive Bayes (%)

Decision Tree (%)

Random Forest (%)

Support Vector Machines (%)

66

58

59

67

70

Precision

67

84

52

73

69

Recall

41

3

52

35

55

Precision

66

58

64

65

71

Recall

85

100

64

90

81

Prediction of Stock Market Performance …

43

Compared with the work of Klein et al., it is clearly evident that the improvement in overall accuracy has been achieved. While these weaken in the precision of the positive values, our Vector Support Machine has been able to achieve largely constant results for positive and negative values.

5 Conclusion The paper showed how financial news can be used to predict the future performance of the stock market. Comparing different models showed that Support Vector Machines delivered the best results. If these results are linked to the theory of money management in the financial sector, this accuracy can make profitable trading on the stock exchange possible. With a conservatively chosen opportunity-risk ratio of 2–1 and the aforementioned accuracy, an expected value of 1.40 is already achieved.

References 1. E.F. Fama, Eficient capital markets: a review of theory and empirical work. J. Financ. 25, 383–417 (1970) 2. J.B. DeLong, A. Shleifer, L.H. Summers, R.J. Waldmann, Noise trader risk in financial markets. J. Polit. Econ. 98, 703–738 (1990) 3. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, From data mining to knowledge discovery in databases. AI Mag. 17(3), 37 (1996) 4. F. Li, Do stock market investors understand the risk sentiment of corporate annual reports? (2006) 54. Working Paper 5. R.P. Schumaker, H. Chen, Textual analysis of stock market prediction using breaking financial news: the AZFin text system. ACM Trans. Inf. Syst. 27, 12:1–12:19 (2009) 6. J. Bollen, A. Pepe, H. Mao, Twitter mood predicts the stock market. J. Comput. Sci. 2, 1–8 (2011) 7. A. Klein, O. Altuntas, T. Hausser, W. Kessler, Extracting investor sentiment from weblog texts: a knowledge-based approach, in IEEE (2011), pp. 1–9 8. R.P. Schumaker, Y.L. Zhang, C.N. Huang, H. Chen, Evaluating sentiment in financial news articles. Decis. Support. Syst. 53, 458–464 (2012) 9. R.P. Schumaker, N. Maida, Analysis of stock price movement following financial news article release. Commun. IIMA 16(1) (2018). Article 1 10. A.S. Abrahams, J. Jiao, G.A. Wang, W.G. Fan, Vehicle defect discovery from social media. Decis. Support Syst. 54, 87–97 (2012) 11. V. Lavrenko, M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen, J. Allan, Language models for financial news recommendation, in Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM) (2000), pp. 389–396 12. T. Loughran, B. McDonald, When is a liability is not a liability? Textual analysis, dictionaries, and 10-ks. J. Financ. 66, 35–65 (2012) 13. M.A. Mittermayer, G.F. Knolmayer, Newscats: a news categorization and trading system, in Proceedings of the 6th International Conference on Data Mining (ICDM) (IEEE), pp. 1002– 1007 14. R.P. Schumaker, H. Chen, Evaluating a news-aware quantitative trader: the effect of momentum and contrarian stock selection strategies. J. Am. Soc. Inform. Sci. Technol. 59, 247–255 (2008)

44

M. Eck et al.

15. S. Bird, E. Klein, E. Loper, Natural language processing with python: analyzing text with the natural language toolkit (O’Reilly Media, Inc., 2009) 16. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011) 17. S. Dumais, J. Platt, D. Heckerman, M. Sahami, Inductive learning algorithms and representations for text categorization (1998) 18. C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, D. McClosky, The Stanford CoreNLP natural language processing toolkit, in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, June 2014 (2014), pp. 55–60 19. J. Shriwas, S. D. Sharma, Stock price prediction using hybrid approach of rule based algorithm and financial news. Int. J. Comput. Technol. Appl. 5(1), 205–211 20. S. Karthik, K.K. Sureshkumar, Analysis of stock market trend using integrated clustering and weighted rule mining technique. Int. J. Comput. Sci. Manag. Res, 1(5) (2012). ISSN 2278–733X 21. S. Evans, Data mining in financial markets (2011) 22. A. Rajput, S.P. Saxena, R. Prasad Aharwal, R. Soni, Rule based classification of BSE stock data with data mining. Int. J. Inf. Sci. Appl. 4(1) (2012). ISSN 0974-2255 23. S.Prasanna, D. Ezhilmaran, An analysis on stock market prediction using data mining techniques. Int. J. Comput. Sci. Eng. Technol. (IJCSET) 24. M.V. Pinto, K. Asnani, Stock price prediction using quotes and financial news. Int. J. Comput. Sci. Eng. Technol. (IJCSET) (2011) 25. N. Sharma, H. Om, Significant patterns for oral cancer detection: association rule on clinical examination and history data. Netw. Model. Anal. Health Inf. Bioinform. 03 (2014). Springer, Wien. Article 50, December 2013 26. N. Sharma, H. Om, Early detection and prevention of oral cancer: association rule mining on investigations. WSEAS Trans. Comput. 13, 1–8 (2014) 27. N. Sharma, H. Om, Correlation neural network model for classification of oral cancer. WSEAS Trans. Biol. Biomed. 11, 45–51 (2014) 28. N. Sharma, H. Om, Using MLP and SVM for predicting survival rate of oral cancer patients. Netw. Model. Anal. Health Inf. Bioinform. 3 (2014). Springer, Wien. Article 58 29. N. Sharma, H. Om, Usage of probabilistic and general regression neural network for early detection-prevention of oral cancer. Sci. World J. (Hindawi Publishing Corporation) 2015, 1–11 (2015). 234191 30. N. Sharma, H. Om, GMDH polynomial and RBF neural network for oral cancer classification. Netw. Model. Anal. Health Inf. Bioinform. 04(1) (2015). Springer, Wien 31. N. Sharma, H. Om, Hybrid framework using data mining techniques for early detection and prevention of oral cancer. Int. J. Adv. Intell. Paradig. Indersci. 09(5/6), 604–622 (2017) 32. S. Dumais, J. Platt, D. Hecherman, S. Sahami, Inductive learning algorithms and representations for text categorization, in Proceedings of the Seventh International Conference on Information and Knowledge Management 148–155. https://doi.org/10.1145/288627.288651

Big Data Management

Automatic Standardization of Data Based on Machine Learning and Natural Language Processing Ananya Banerjee, Kavya SreeYella, and Joy Mustafi

Abstract The proposed system is capable to build an intelligent service using cloud data with the following features: The collaborative cloud data sets can be integrated from various ocular sources and standardized to have a uniformity; The integrated standard data set is then processed and transformed to generate features for machine learning models automatically; The predictive machine learning models can be trained with the stratified random sampled data and ranked features from the transformed datasets. Keywords Natural language processing · Neural networks · Machine learning · Sampled data · Standardization · Metrics · Deep learning

1 Introduction The standardization of data mainly focuses on reading the unclean data from the text file and analyze the structure, content, and quality of data. Based on the context of the column in the given test set of data, the attribute of data is found. In this project magnanimous data is dealt, which needs to be automated mostly to overcome the manual overhead to get the context of the data and replace the anomalies with a proper metric system. Also handling of missing data, cleanup of data, and visualization of data are required to get the data in a proper shape on which we can perform context analysis in NLP.

A. Banerjee (B) · K. SreeYella · J. Mustafi MUST Research Organisation, Hyderabad, India e-mail: [email protected] K. SreeYella e-mail: [email protected] J. Mustafi e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1175, https://doi.org/10.1007/978-981-15-5619-7_4

47

48

A. Banerjee et al.

2 Predictive Models Collection of proper datasets: This is one of the most important initial parts where it is required to search through various machine learning repositories and collect different datasets from different domains so as to get the subject/context of data easily. A few datasets namely— Metro_Interstate_Traffic_Volume.csv, england_crime_dataset, sloan-digital-sky-survey, rainfall_india_in_mm, Heart Disease Data Set_files, winequality-red were collected and analyzed. Each dataset contains different parameters as columns and with different metrics and are totally unmatching ones that fitted quite well for unit-based standardizing. Datasets referenced: Mostly all the datasets were referenced from the UCI Repository, the link of which is given below— http://archive.ics.uci.edu/ml/datasets/. – Architecture Initially, after reading the dataset into a data frame, missing values were filled as null and the dataset was normalized so that the range of most of the values is quite similar to one another. Thereafter, the following steps were done— Entities are derived and extracted—for e.g., units, dollar amounts, key initiatives, etc. Content is categorized–Sentiment Analysis used (positive/negative), also categorized by function, intention or purpose, by industry or other categories for trending and analytics. Content is clustered—identifying main topics of discourse and/or to discover new topics. Data points are clustered based on the similarity of properties or features. Highly dissimilar data points belong to different clusters. Extraction of facts—From the dataset by analyzing the data. Structured information for visualization, trending, or alerts were extracted. Extraction of relationships—explore relationships among data by filling out graph databases. For achieving these all steps, a train test split of data was done with a separate validation set as well, so as to train and build the model with a high precision and accuracy score. Concepts and libraries used—POStagging, Stemming, Tokenizing, Lemmatizing, Sentiment Analysis, SciKitLearn, NumPy, Pandas, Neural Network (RNNs and LSTMs, preferably), etc.

Automatic Standardization of Data Based on Machine …

49

Few datasets were taken on which the methods mentioned above are applied. The detailed description is given below— 1. Metro_Interstate_Traffic_Volume.csv Dataset Description— Column names— holiday—Categorical US National holidays plus regional holiday. temp—Numeric Average temp in Kelvin. rain_1h—Numeric Amount in mm of rain that occurred in the hour. snow_1h—Numeric Amount in mm of snow that occurred in the hour. clouds_all—Numeric Percentage of cloud cover. weather_main—Categorical Short textual description of the current weather. weather_description—Categorical Longer textual description of the current weather. date_time—DateTime Hour of the data collected in local CST time. traffic_volume—Numeric Hourly I-94 ATR 301 reported westbound traffic volume. 2. Heart disease— https://archive.ics.uci.edu/ml/datasets/Heart+Disease Dataset Information– This dataset contains 76 attributes, though generally a subset of 14 is used from the 76 attributes. The final outcome is to detect the presence of heart disease in the patient. It varies from 0 (no presence) to 4. 3. Diabetes—https://archive.ics.uci.edu/ml/datasets/diabetes Column NamesFile Names and format: (1) (2) (3) (4) 4.

Date in MM-DD-YYYY format. Time in XX: YY format. Code. Value. Breast Cancer—https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wiscon sin+%28Diagnostic%29.

50

A. Banerjee et al.

3 Model Estimation 3.1 Model Estimation of a Single Dataset The Metro_Interstate_Traffic_Volume.csv dataset of first 5 column preview is as shown in Fig. 1. As observed, the columns like weather_main and weather_description contain some textual data which has some information about the traffic volume. So, the context of the word description which is impacting traffic volume rate was extracted. Then concatenation of the two mentioned columns was done, then the most commonly occurring as well as the less common words were identified. Next, the text preprocessing was done which involved Stemming and Lemmatization and punctuations, tags, stopwords were removed and a corpus was prepared after cleaning up. BOW (Bag of words) concept was used and for that CountVectoriser was applied to tokenize the text and build a vocabulary of known words. 1-gram model is visualized by a bar chart given in Fig. 2. Next, Tf-Idf scores were used which gave the scores for a particular corpus, and we may use it if we have more textual data. For example, it will give the output in this format [1]. Keywords: cloud 0.581. overcast cloud 0.407. overcast 0.407. cloud overcast cloud 0.407. cloud overcast 0.407. In the same way, if the column names are changed as given as in Fig. 3. After text preprocessing, keywords derived are

Fig. 1 First 5 column preview of Metro_Interstate_Traffic_Volume dataset

Automatic Standardization of Data Based on Machine …

51

Fig. 2 Visualization of 1-gram model of Metro_Interstate_Traffic_Volume dataset

Fig. 3 Metro_Interstate_Traffic_Volume dataset after column name changes

Based on the contextual keywords, the metrics are derived from the dataset. Therefore, from the training data, the type of dataset is understood. In the test data, if any of the keywords are present, it is automatically standardized based on the context analysis done previously using NLP which was used to break down phrases or sentences to n-grams.

3.2 Domain Identification from Multiple Datasets By text processing, the NLP identifies a series of disease-relevant keywords in medical terms based on the datasets of each disease. Datasets are collected from different healthcare domains. As of now, 3 datasets were used to analyze. Heart disease—https://archive.ics.uci.edu/ml/datasets/Heart+Disease.

52

A. Banerjee et al.

Diabetes—https://archive.ics.uci.edu/ml/datasets/diabetes. Breast Cancer—https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wiscon sin+%28Diagnostic%29. Following steps were done after collection of datasets— 1. The column names or the description of columns of individual datasets were grouped using NLP (mostly tokenization). Here 3 groups are formed for each dataset. 2. Labels (category) for each group were created That is, in this case, labels are Heart disease, Diabetes, Breast Cancer to their respective group of words. 3. Data cleaning—removing irrelevant data was done. For example, alphanumeric characters and code numbers were removed. Patients name information was not useful. 4. The Machine Learning techniques were applied next to train the model. 5. Once the text preprocessing methods were applied, the following keywords corresponding to each dataset were extracted as mentioned below— Diabetes dataset—

Breast cancer dataset—

Heart Disease dataset—

Automatic Standardization of Data Based on Machine …

53

6. The model was then applied on test dataset which had a combination of all columns from Heart disease, Diabetes, Breast Cancer datasets taken from other healthcare datasets. 7. Accordingly, predictions were made from the model. For making the predictions, a data dictionary was created which had the key as type or name of the dataset it belongs to. After applying the training model to the test dataset, the model was able to: identify which keyword is coming from which dataset. The output which identifies the type of disease is as given below— If the test dataset contains the following keywords:

Then the output file resembles like blocker –>heart_disease ramus –>heart_disease ingestion –>diabetes Adhesion –>breast_cancer Mitoses –>breast_cancer 8. The same model can be extended to more categories of diseases.

54

A. Banerjee et al.

4 Conclusions There are a few challenges to be resolved going forward: 1. NLP techniques to be applied after identifying the metric of the particular column based on the keywords, so that column names are automatically taken from the train data and applied on the test data. 2. Modifications in preprocessing techniques to remove random expressions. – Working on challenges and possibilities of any further scenarios.

References 1. R.K. Ando, T. Zhang, A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. (JMLR) 6, 1817–1953, p. 14 (2005)

Analysis of GHI Forecasting Using Seasonal ARIMA Aditya Kumar Barik, Sourav Malakar, Saptarsi Goswami, Bhaswati Ganguli, Sugata Sen Roy, and Amlan Chakrabarti

Abstract A precise understanding of solar energy generation is important for many reasons like storage, delivery, and integration. Global Horizontal Irradiance (GHI) is the strongest predictor of actual generation. Hence, the solar energy prediction problem can be attempted by predicting GHI. Auto-Regressive Integrated Moving Average (ARIMA) is one of the fundamental models for time series prediction. India is a country with significant solar energy possibilities and with extremely high weather variability across climatic zones. However, rigorous study over different climatic zones seems to be lacking from the literature study. In this paper, 90 solar stations have been considered from the 5 different climatic zones of India and an ARIMA model has been used for prediction for the month of August, the month with most variability in GHI. The prediction of the models has also been analyzed in terms of Root Mean Square Error. The components of the AR models have also been investigated critically for all climatic zones. In this study, some issues were observed for the ARIMA model where the model is not being able to predict the seasonality that is present in the data. Hence, a Seasonal ARIMA (SARIMA) model has also been used as it is more capable in case of seasonal data and the GHI data exhibits a strong seasonality pattern due to its availability only in the day time. Lastly, a comparison has also been done between the two models in terms of RMSE and 7 days Ahead Prediction. Keywords Solar energy · GHI · Time series · ARIMA

A. Kumar Barik (B) · S. Malakar · A. Chakrabarti A.K. Choudhury School of Information Technology, University of Calcutta, Kolkata, India e-mail: [email protected] S. Goswami Bangabasi Morning College, Kolkata, India B. Ganguli · S. Sen Roy Department of Statistics, University of Calcutta, Kolkata, India © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1175, https://doi.org/10.1007/978-981-15-5619-7_5

55

56

A. Kumar Barik et al.

1 Introduction The conventional energy sources are being used for a long period of time and hence their reserves are about to end. These sources of energy have a lot of advantages like availability, easy productivity. But there are also a number of drawbacks and one of them is it is extensively harmful to our nature which has more impact than its advantages. On the other hand, the unconventional sources of energy have no end. These renewable sources of energy leave no pollutant to the environment, which is directly related to global warming. Solar energy is one of the most important renewable energy sources. Solar energy is derived directly from the sun in the form of radiation [11]. To use solar energy with conventional energy, prediction of solar energy is important for various reasons like planning of operation, energy delivery, use of storage, grid integration, etc. Solar radiation prediction is necessary as it has been identified as one of the most important parameters in the design of solar energy conversion devices [11]. In the context of predicting solar energy, the variability of the data is the main challenge as the weather situation over different states of India, as well as different climatic zones, varies a lot. In India, demand for energy is very high and on the other hand, India is one of the top countries of the world in the sense of solar energy potential due to its large size and a position around the line of tropics. Hence thorough research on solar energy is very important in India. The goal of this work is to find a practical solution for the real-world problem. Countries in the equatorial and tropical zone receives a good amount of solar radiation and since India is one of these countries so this study is quite important in the context of solar energy prediction. Since the GHI data is a time-dependent sequential data, hence a time series model would be useful to predict the GHI for future times. For time series analysis, obtaining continuous solar data is very important to forecast for future days. In this paper, 90 solar stations have been considered from the five different climatic zones of India and a Seasonal ARIMA (SARIMA) model has been used to predict the last week of the month of August. The prediction of the models has also been analyzed in terms of Root Mean Square Error.

2 Literature Study In the literature, many studies have been done over solar radiation prediction and several models have been used. These include: Time Series, Auto-Regressive Integrated Moving Average (ARIMA) Analysis [6, 8, 12], regression analysis [12], Bayesian inference [10]. In this paper, [3], ARMA and ARIMA models are compared in terms of prediction accuracy for the multi-period prediction of GHI. From this analysis, it is clear that the ARIMA model has better accuracy in the multi-period prediction approach.

Analysis of GHI Forecasting Using Seasonal ARIMA

57

In this paper, [4], ARIMA model has been trained on GHI data of March 2016 collected from The Petroleum Institute, Abu Dhabi, UAE and a prediction has been done for 31 March 2016, based on the previous 30 days, and the hourly forecast has a R 2 value of 88.63%. In this paper, [9], weather and GHI data of 5 different station from 2008 to 2010 in India has been considered and four different models, algebraic CurveFit, Multiple Linear Regression, ANN, SVM regression, have been trained to predict the GHI for the year of 2011 and it is observed that SVM regression model has outperformed all the other three models in terms of accuracy. In this paper, [16], Malaysia crude oil production monthly data from January 2005 to May 2010 were analyzed using ARIMA model, where the stationarity of the data and the model parameters are identified using the auto-correlation (ACF) and the partial auto-correlation (PACF) plot. Then the validity of the model was tested using Box-Pierce statistic and Ljung–Box statistic. In this paper, [7], 56 Indian stocks were selected from different sectors and the effectiveness of the ARIMA model was studied. Later, the models were analyzed based on different sectors and the prediction accuracy was also being tested for different spans of previous period data. In this paper, [13], the maternal mortality ratios were studied from the year 2000– 2010 at the Okomfo Anokye Teaching Hospital in Kumasi. ARIMA model was used (according to Box–Jenkins approach [1]) to examine the MMR. In this paper, [15], 3 different models are used to predict GHI for two weather stations Miami and Orlando in the USA. In model 1, GHI data are used to predict next hour GHI through additive seasonal decomposition followed by an ARIMA model. In model 2, GHI has been separated into DHI and DNI and both of them are predicted separately to predict GHI by the same procedure as of model 1 and in model 3 cloud cover effects are considered to predict GHI. In this paper, [5], GHI data of Thailand has been considered, where there were a number of missing values and this paper suggests a process of imputing those missing values and also used a SARIMA model to predict GHI after the missing value imputation.

3 Time Series Analysis Time series analysis is a statistical technique that deals with time series data. Time series data means that data is in a series of particular time periods or intervals. A time series is a sequence of observations taken sequentially in time. Time series are of two types: discrete and continuous. The discrete-time series is the one in which the observations are made at fixed time intervals, whereas in a continuous time series, the observations are made at any time intervals [2]. Time series data also can be classified into two types: stationary and nonstationary. In stationary time series, the mean, variance, and covariance remain constant over the time whereas for the non-stationary time series they are not constant

58

A. Kumar Barik et al.

over time [1]. In time series analysis of non-stationary data, our first objective is to make the data stationary. Different time series forecasting models have been developed so far including Auto-Regressive (AR), Moving Average (MA), Auto-Regressive Moving Average (ARMA), Auto-Regressive Integrated Moving Average (ARIMA), Seasonal Auto-Regressive Integrated Moving Average (SARIMA) [1]. Among these, AutoRegressive (AR), Moving Average (MA), Auto-Regressive Moving Average (ARMA) are generally used for forecasting stationary trends while the others are used for forecasting non-stationary trends. Here, we use SARIMA models, since it can deal with both stationary and non-stationary time series data and also it is able to account the seasonality present in the data, where ARIMA models are not so good when it comes to seasonality though it can handle stationary and non-stationary time series data quite well.

3.1 ARIMA and SARIMA Model Here, we define a process called White Noise Process {t }, where {t } is a sequence of uncorrelated random variables, each with zero mean and variance σ 2 [2].

3.1.1

ARIMA

Now, let {X t } be a sequence of time series observation, where t = {0, 1, 2, . . .}. Now we define a Back-shift Operator B such that B X t = X t−1 , where X t and X t−1 are two consecutive time series observation, then B d X t = X t−d . A non-stationary time series can be made stationary by applying the Differencing Operator ∇ = (1 − B) repeatedly. If the original process {X t } is not stationary, then we can look for dth-order difference process, i.e., ∇ d X t = (1 − B)d X t . If we ever find that the differenced process is a stationary process, i.e., {∇ d X t } is stationary for some positive integer d, where {∇ d−1 X t } is non-stationary then we can say that {X t } is integrated of order d. Now, an auto-regressive process of order p i.e. A R( p) is defined by X t − θ1 X t−1 − θ2 X t−2 − · · · − θ p X t− p = θ (B) X t where θ (B) = 1 − θ1 B − θ2 B 2 − · · · − θ p B p and θ p = 0. Now, a moving average process of order q, i.e., MA(q) is defined by t + φ1 t−1 + φ2 t−2 + · · · + φq t−q = φ(B) t

Analysis of GHI Forecasting Using Seasonal ARIMA

59

where φ(B) = 1 + φ1 B + φ2 B 2 + · · · + φq B q and φq = 0. Hence, the process {X t } is said to be an ARIMA ( p, d, q) process if θ (B) ∇ d X t = c + φ(B) t where θ (B) = 1 − θ1 B − θ2 B 2 − · · · − θ p B p and φ(B) = 1 + φ1 B + φ2 B 2 + · · · + φq B q Here, c is an arbitrary constant, t is an independent random variable of time step t of the white noise process, θ (B) is an auto-regressive polynomial of order p and φ(B) is a moving average polynomial of order q. The model is written as A R I M A ( p, d, q) model, where p, d and q are model orders.

3.1.2

SARIMA

Now, let {X t , t = 0, 1, 2, . . .} be a sequence of time series observations where seasonality is present. Let st be the seasonal component present at time t in {X t } i∗m  such that if m be the period of seasonality then st = st−m and st = 0 for all t=i∗1

i = 0, 1, 2, . . . . Now, a non-stationary seasonal (of order m) time series can be made stationary by applying the Differencing Operator ∇m = (1 − B m ) repeatedly (where B is a Back-shift Operator such that B n X t = X t−n for all n = {0, 1, 2, . . .}), which means we will look for Dth-order difference process, i.e., ∇mD X t = (1 − B m ) D X t for which {∇mD X t } will be seasonal stationary. In other words, if we ever find that {∇mD X t } is seasonal stationary for some positive integer D, where {∇mD−1 X t } is not seasonal stationary then we can say that {X t } is seasonally integrated of order D. Now, a seasonal auto-regressive process of order P, i.e., A R(P)m is defined by X t − Θ1 X t−m − Θ2 X t−2m − · · · − Θ P X t−Pm = Θ(B m ) X t where Θ(B m ) = 1 − Θ1 B m − Θ2 B 2m − · · · − Θ P B Pm and Θ P = 0. Now, a seasonal moving average process of order Q, i.e., M A(Q)m is defined by

60

A. Kumar Barik et al.

t + Φ1 t−m + Φ2 t−2m + · · · + Φ Q t−Qm = Φ(B m ) t , where Φ(B m ) = 1 + Φ1 B m + Φ2 B 2m + · · · + Φ Q B Qm and Φ Q = 0. Hence, the process {X t } is said to be an S A R I M A ( p, d, q) (P, D, Q)m process if Θ(B m ) θ (B) ∇mD ∇ d X t = c + Φ(B m ) φ(B) t , where Θ(B m ) = 1 − Θ1 B m − Θ2 B 2m − · · · − Θ P B Pm and θ (B) = 1 − θ1 B − θ2 B 2 − · · · − θ p B p and Φ(B m ) = 1 + Φ1 B m + Φ2 B 2m + · · · + Φ Q B Qm and φ(B) = 1 + φ1 B + φ2 B 2 + · · · + φq B q Here, c is an arbitrary constant, t is an independent random variable of time step t of the white noise process, Θ(B m ) is a seasonal auto-regressive polynomial of order P, θ (B) is an auto-regressive polynomial of order p, Φ(B m ) is a seasonal moving average polynomial of order Q, φ(B) is a moving average polynomial of order q and m is the period of seasonality. The model is written as S A R I M A ( p, d, q) (P, D, Q)m model, where p, d, and q are model orders, P, D, and Q are seasonality orders and m is the period of seasonality.

4 Data Here we have hourly data of solar radiation of a several number of stations all over India. India lies to the north of the equator between 6◦ 44 and 35◦ 30 north latitude and 68◦ 7 and 97◦ 25 east longitude. We have randomly chosen a total of 90 stations from all the states of India and took the Global Horizontal Irradiance (GHI) data for all these stations of the year 2014 from the website of NSRDB [14]. Later, all the 90 stations are categorized under 5 subgroups according to the 5 major climatic zones of India, viz., “Cold & Cloudy”, “Cold & Sunny”, “Composite”, “Hot & Dry”, “Warm & Humid” according to standard convention. Here is the list of states under each climatic zone and the number of stations considered from each state is mentioned in parenthesis.

Analysis of GHI Forecasting Using Seasonal ARIMA

61

Table 1 An overview of GHI for different climatic zones of India Climatic Zone Minimum Maximum Mean Median Composite Cold & Cloudy Cold & Sunny Hot & Dry Warm & Humid

8 1 1 1 1

974 1044 1053 996 990

462.905 411.958 419.552 459.467 422.099

447 390 394 454 390

Standard deviation 277.629 266.006 267.798 274.974 277.874

Cold & Cloudy: Himachal Pradesh(3), Manipur(3), Meghalaya(3), Mizoram(3), Nagaland(3), Uttarakhand(3). Cold & Sunny: Arunachal Pradesh(3), Jammu & Kashmir(3), Sikkim(3). Composite: Chandigarh(2), Chattisgarh(3), Haryana(3), Madhya Pradesh(3), Maharashtra(2), Punjab(3), Telangana(3), Uttar Pradesh(3). Hot & Dry: Gujarat(3), Karnataka(2), Rajasthan(3). Warm & Humid: Andhra Pradesh(3), Assam(3), Bihar(3), Goa(3), Jharkhand(3), Karnataka(1), Kerala(3), Maharashtra(1), Odisha(3), Puducherry(1), Tamil Nadu(3), Tripura(3), West Bengal(3). Here, the Table 1 shows an overview of our data that how it is spread and how it differs from one climatic zone to another. From Table 1, we can clearly see that the climatic zone Cold & Sunny receives highest GHI among all the zones, whereas, the Warm & Humid, Composite and Hot & Dry climatic zones have higher deviation in data compared to Cold & Cloudy and Cold & Sunny climatic zones. Now, let us have a look at the boxplot of these regions (Fig. 1) which shows the Minimum, Maximum, Median, Quartile Deviation and outlier (if any) for the 5 climatic zones simultaneously. Here, from the boxplot, we can see that Cold & Cloudy and Cold & Sunny climatic zones experiences higher GHI with lower variability whereas, the other 3 climatic zones Composite, Warm & Humid and Hot & Dry have relatively lower GHI with significantly higher variation.

5 Methodology Here, in the analysis and forecasting process, GHI is used as the input of the model. Since GHI is not available for 24 h of a day, hence at first we extracted the GHI values only for the day time, i.e., from 6:30 to 17:30. Now, since our focus is to forecast GHI based on a specific month, hence we select the month of August as this month experiences a lot of variation in the data as compared to the other months all over the country. Then, we run a check if there are any missing values present in our data and

62

A. Kumar Barik et al.

Fig. 1 Boxplot showing the variation of GHI data among the 5 major climatic zones of India

if so then take the data up to the very previous time point of the very first missing value and fit a SARIMA model and then predict the next point and this prediction is used to replace the next missing value.By repeating this procedure, we are able to impute all the missing values present in our data. Then we split the data into train and test set such that the test set always contains the last 7 days of the corresponding month. Then, SARIMA model parameters are selected based on the train set and the best selection is done on the basis of AIC (Akaike Information Criterion). Then, we perform the forecast for the last 7 days of that month using those selected parameters. The detailed procedure for the simulation is shown in the following figure (Fig. 2).

6 Discussion So here, we use ARIMA model to forecast the Global Horizontal Irradiance (GHI) for the next 7 days. Using the process described earlier, GHI is used the model input to perform the forecast. Here, we chose the month of August for the forecasting process. We train our model on the first 3 weeks of the month of August of 2014 at different stations all over India. The ARIMA process parameters are selected based on AIC. A SARIMA model is also used as it is better capable to exhibit the seasonality that is present in our data.

Analysis of GHI Forecasting Using Seasonal ARIMA

63

Fig. 2 Steps of analysis Fig. 3 Boxplot showing the variation of RMSE for ARIMA model among the 5 major climatic zones of India

6.1 Analysis of Results Here, a boxplot is shown (Fig. 3) which is created using the RMSEs of the different climatic zones obtained for ARIMA model. From the boxplot we see that minimum RMSE is obtained for the “Composite” zone where maximum is for the “Warm & Humid” zone. The “Cold & Cloudy” zone has received the minimum variability of RMSE due to less variation in atmospheric situation whereas the “Warm & Humid” zone receives the maximum variability since

64

A. Kumar Barik et al.

Fig. 4 Boxplot showing the variation of RMSE for SARIMA model among the 5 major climatic zones of India

it spreads all over the coastal area of India and experiences a very unstable atmosphere all over the year. Now, another boxplot is shown (Fig. 4) which is created using the RMSEs of the different climatic zones obtained for SARIMA model. From this boxplot, we can see that the variation of RMSE is very low for the “Cold & Cloudy” and “Hot & Dry” zone, whereas it is quite high for the “Cold & Sunny” zone. Minimum RMSE is observed in the “Composite” zone.

6.2 Analysis of Components of ARIMA Now, we took the average of AR orders fitted by ARIMA for all the stations of each climatic zone and also the average of their corresponding root mean square error (RMSE). Hence, we plot a line diagram (Fig. 5) using the average RMSEs with respect to the corresponding average of AR orders for each climatic zones. Here, we see that there is a direct relationship between the order of auto-regressive (AR) process and the RMSE of the ARIMA process. With the increase of autoregressive order, the RMSE of the ARIMA process also increase. Again, we took the average of MA orders fitted by ARIMA for all the stations of each climatic zone and also the average of their corresponding root mean square error (RMSE). Hence, we plot a line diagram (Fig. 6) using the average RMSEs with respect to the corresponding average of MA orders for each climatic zone. Here, we also see that for the moving average process and the RMSE of the ARIMA process, there is also a direct relationship between them. With the increase of moving average order the RMSE of the ARIMA process also increase.

Analysis of GHI Forecasting Using Seasonal ARIMA

65

Fig. 5 Change of RMSE with respect to AR order

Fig. 6 Change of RMSE with respect to MA order

Next, we make two groups on the basis of the observed GHI data of a station is stationary or not and also record the RMSE of the ARIMA model for the corresponding station and took average of the RMSE for both groups for each climatic zone. Then, we plot a multiple bar diagram (Fig. 7) showing the relation between the stationarity of the observed data of each climatic zone and the RMSE of the corresponding climatic zones. We also see that RMSE of the stations for which obtained data is stationary, are lower than that of the stations for which obtained data is not stationary.

66

A. Kumar Barik et al.

Fig. 7 Change of RMSE for stationary and non-stationary data

Fig. 8 Comparison of ARIMA and SARIMA model prediction for Uttar Pradesh (26.65, 80.45) from Composite zone

6.3 Issues with ARIMA In this analysis, we have found that there are some stations in different climatic zones for which ARIMA model is not working good enough, actually the prediction does not reflect the seasonality that is present in the data. Hence, we go for SARIMA as it can deal with seasonality better than ARIMA and we observed that the problem with ARIMA can be recovered with SARIMA and the predictions are pretty good. Here, we are giving some cases, where the prediction of SARIMA is far better than ARIMA (Figs. 8, 9, 10 and 11).

Analysis of GHI Forecasting Using Seasonal ARIMA

67

Fig. 9 Comparison of ARIMA and SARIMA model prediction for Himachal Pradesh (33.05, 76.35) from Cold & Cloudy zone

Fig. 10 Comparison of ARIMA and SARIMA model prediction for Gujarat (22.35, 70.75) from Hot & Dry zone

Fig. 11 Comparison of ARIMA and SARIMA model prediction for Jharkhand (22.95, 86.05) from Cold & Cloudy zone

7 Conclusion So, in this paper, we took 90 stations from all over India and classify them into 5 Climatic Zones “Composite”, “Cold & Cloudy”, “Cold & Sunny”, “Hot & Dry” and “Warm & Humid” according to standard convention and use a Python program auto_arima from the pyramid package and choose the best model on the basis of AIC. The model has obtained a minimum RMSE for the “Composite” zone whereas maximum for “Warm & Humid” zone. Again the variability of RMSE is minimum for “Cold & Cloudy” and maximum for “Warm & Humid”. Then we observe that

68

A. Kumar Barik et al.

there is a direct relationship between the AR and MA orders with the RMSE of the ARIMA Model. We also observed that for non-stationary data, the ARIMA model have always more RMSE than for stationary data in each climatic zones. But we observed that the ARIMA model is not so good for the GHI data, as it has a strong seasonality pattern due to its availability only in the day time. Hence, a SARIMA model has been chosen by using the same auto_arima function by setting the seasonality parameter as True and the best model is chosen for which AIC is minimum. This model has also obtained the minimum RMSE for the “Composite” zone, whereas maximum for “Warm & Humid” zone. Again the variability of RMSE is very low for “Cold & Cloudy” and “Hot & Dry” whereas maximum for “Cold & Sunny”. Hence, we can say that the prediction of GHI in “Cold & Cloudy” zone will be more accurate whereas “Warm & Humid” will be the most challenging area in terms of GHI prediction. Acknowledgments This article results from the INDO-USA collaborated project, named, LISA 2020 on Renewable Energy and joint collaboration with the University of Colorado, which was funded by The United States Agency for International Development (USAID) and the research has been carried out in the Data Science Laboratory, University of Calcutta supported by TEQIP Phase 3.

References 1. G.E. Box, G.M. Jenkins, G.C. Reinsel, G.M. Ljung, Time Series Analysis: Forecasting and Control (Wiley, New York, 2015) 2. P.J. Brockwell, R.A. Davis, S.E. Fienberg, Time Series: Theory and Methods: Theory and Methods (Springer Science & Business Media, New York, 1991) 3. I. Colak, M. Yesilbudak, N. Genc, R. Bayindir, Multi-period prediction of solar radiation using ARMA and ARIMA models. In: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA) (IEEE, 2015), pp. 1045–1049 4. S. Hussain, A. Al Alili, Day ahead hourly forecast of solar irradiance for Abu Dhabi, UAE. In: 2016 IEEE Smart Energy Grid Engineering (SEGE) (IEEE, 2016), pp. 68–71 5. V. Layanun, S. Suksamosorn, J. Songsiri, Missing-data imputation for solar irradiance forecasting in thailand. In: 2017 56th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE) (IEEE, 2017), pp. 1234–1239 6. L. Martín, L.F. Zarzalejo, J. Polo, A. Navarro, R. Marchante, M. Cony, Prediction of global solar irradiance based on time series analysis: application to solar thermal power plants energy production planning. Sol. Energy 84(10), 1772–1781 (2010) 7. P. Mondal, L. Shit, S. Goswami, Study of effectiveness of time series modeling (arima) in forecasting stock prices. Int. J. Comput. Sci. Eng. Appl. 4(2), 13 (2014) 8. A. Moreno-Munoz, J.J.G. de la Rosa, R. Posadillo, V. Pallares, Short term forecasting of solar radiation. In: 2008 IEEE International Symposium on Industrial Electronics, June 2008 (2008), pp. 1537–1541. https://doi.org/10.1109/ISIE.2008.4676880 9. S. Pai, S. Soman, Forecasting global horizontal solar irradiance: a case study based on Indian geography. In: 2017 7th International Conference on Power Systems (ICPS) (IEEE, 2017), pp. 247–252 10. C. Paoli, C. Voyant, M. Muselli, M.L. Nivet, Forecasting of preprocessed daily solar radiation time series using neural networks. Sol. Energy 84(12), 2146–2160 (2010)

Analysis of GHI Forecasting Using Seasonal ARIMA

69

11. A. Qazi, H. Fayaz, A. Wadi, R.G. Raj, N. Rahim, W.A. Khan, The artificial neural network for solar radiation prediction and designing solar systems: a systematic literature review. J. Clean. Prod. 104, 1–12 (2015) 12. G. Reikard, Predicting solar radiation at high resolutions: a comparison of time series forecasts. Sol. Energy 83(3), 342–349 (2009). https://doi.org/10.1016/j.solener.2008.08.007, http://www. sciencedirect.com/science/article/pii/S0038092X08002107 13. S.A. Sarpong, Modeling and forecasting maternal mortality; an application of arima models. Int. J. Appl. 3(1), 19–28 (2013) 14. M. Sengupta, Y. Xie, A. Lopez, A. Habte, G. Maclaurin, J. Shelby, The national solar radiation data base (NSRDB). Renew. Sustain. Energy Rev. 89, 51–60 (2018). https://doi.org/10.1016/ j.rser.2018.03.003 15. D. Yang, P. Jirutitijaroen, W.M. Walsh, Hourly solar irradiance time series forecasting using cloud cover index. Sol. Energy 86(12), 3531–3543 (2012) 16. N.M. Yusof, R.S.A. Rashid, Z. Mohamed, Malaysia crude oil production estimation: an application of arima model. In: 2010 International Conference on Science and Social Research (CSSR 2010) (IEEE, 2010), pp. 1255–1259

Artificial Intelligence and Data Analysis

Scoring Algorithm Identifying Anomalous Behavior in Enterprise Network Sonam Sharma and Garima Makkar

Abstract Enterprise cyber-attacks play a critical role because detecting intrusions in an organization is a usual activity in today’s scenario. We need a security system using which we utilize a variety of algorithms and techniques to detect securityrelated anomalies and threats in an enterprise network environment. The system helps in providing scores for anomalous activities and produces alerts. The system receives data from multiple intrusion detection or prevention systems which are preprocessed and generates a graphical representation of entities. The proposed scoring algorithm provides a mechanism which aids in detecting anomalous behavior and security threats in an enterprise network environment. The algorithm employs User/Entity behavioral analytics (UEBA) to analyze network traffic logs and user activity data to learn from user behavior to indicate a malicious presence in your environment, whether the threat is previously known or not. Graph-based anomalous detection technique has been applied in this approach, the graph represents each entity’s behavior. Keywords Anomaly detection · Graph · Intrusion detection · Enterprise network · Scoring algorithm · Risk score

1 Introduction Cyber-threats can be described in many forms such as cyber-spying, which is commonly referred to as cyber-espionage, misuse of privileged and confidential information, physical loss and theft. Out of these, “Intrusion” is the most common type of attack. In this type of attack, a cyber-criminal obtains secrets and information using which it enters a network and gains access to the internal information. One of S. Sharma (B) · G. Makkar Tata Consultancy Services, New Delhi, India e-mail: [email protected] G. Makkar e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1175, https://doi.org/10.1007/978-981-15-5619-7_6

73

74

S. Sharma and G. Makkar

the most common ways of detecting such activities is by using Intrusion Detection and Prevention Systems. We can identify anomalous behavior or any suspicious activity by inspecting log files. But by inspecting these log files, we cannot detect any unusual pattern. With the sophistication of the cyber-attacks, each intrusion pattern varies which thus makes the signature matching method inefficient since it works on known attacks. In an enterprise, an anomaly can be defined with multiple parameters. These parameters include generic rules, which can help us identify known attacks, specific rules which are internally defined by an enterprise and unknown attacks identification by understanding patterns in the data. Anomalous behavior in an enterprise network is considered among entities that include host machines, servers, network addresses or ports, domain names, cloud servers, etc. A behavior can be considered as anomalous if certain rules are violated, certain policies defined by enterprise, or certain threshold is crossed, change in behavior of a host by analyzing its past behavior (number of login attempts at an irregular time interval, increase in the number of data packets transfer within an irregular time interval). Analyzing specific entities and individually identifying their behavior to reach a specified conclusion is referred to as User and Entity Behavior Analysis (UEBA). Here our goal is to an identify anomalous host which is a malicious user or an infected device with malware. To understand the anomalies we need to understand the network and environment of entities first and understand their behavior. This helps us in understanding what makes the behavior of entity normal or anomalous. We need a system that can provide an in-depth understanding of the network, which can provide us key factors affecting different attacks and identify policy violations faster. This in turn can help us to prioritize the resources according to risks. As the underlying structure is of connected systems, we have used a graphical approach. We have loaded the data into a graph database as connected components. Using a graph-based clustering approach, we have identified anomalous system behaviors. All participating entities act as nodes in the graph, and the features of the entities act as edges. All participating machines in the network traffic define every connection seen on the computing network. Edges connecting these participating machines define application layer protocols used in the network, e.g. all HTTP sessions with their requested URIs, DNS requests, summaries of files transferred over the network, a record of SSL sessions. Additionally, this system helps in monitoring the risk of enterprise assets based on interaction among them backed by domain knowledge. This system will be able to provide a 360° view of the network activity and understand the characteristics of the network such as the weak links of the network. It will also be able to monitor network activities in real time. It will also be able to identify an attack on any system based on its interactions. This paper is organized in sections, where in Sect. 2 we discuss about literature review and critical research gaps highlighted in this paper. In Sect. 3, we the objective and problem statement. The methodology used to solve the problem is used in the next section, i.e. Section 4. Finally, we discuss results and conclusions in Sects. 5.

Scoring Algorithm Identifying Anomalous …

75

2 Literature Review Multiple approaches have been proposed for providing risk scores in a computer network. Most of the approaches have been dependent on graphs wherein abstractions and semantics vary based on the understanding and kind of analytics used to verify the proof of concept in place. Risk scoring for entities in [1] is integrally dependent on the reputation of the entity, where reputation of an entity is the probability that is calculated using belief propagation algorithm. The interconnection between entities is created by using a bipartite graph. It is explained here that if there is a natural correlation between reputation and risk, so risk score of an entity is dependent on it. The approach that we discuss here differs from the fact that we are using property graph model where nodes are IP address of the hosts and edges connecting each host are an interaction between them. We are using an internal graph algorithm to get the community of similar hosts which helps us in scoring the entities. References [2–4] have explained User Behavior Analytics based on statistical and machine learning approaches whereas here we differ from the fact that our approach is truly based on Graph Analytics and scoring over the build graph using graph algorithms Many traditional approaches focused on rule-based approaches for detection intrusion in the network which are too impractical to be used in practice because constructing normal boundaries is hard since they require a threshold to consider when the activity is abnormal and real data of mostly noisy. The distribution of behavior, which is considered normal and abnormal is dynamic when it comes to an enterprise network.

3 Problem Statement The techniques to detect intrusions in an enterprise network are either host or instance based. The goal of host-based intrusion detection system is to collect activities of the host machines and analyze their behavior. Essentially, these systems can point activities related to a particular entity. The methodology which is being proposed in this paper implements a host-based intrusion detection. We have used connected data, where each host is the connected node or the participating entity in the enterprise network. An entity is defined as the host machine in the network, e.g., it can be a server where an important client data is saved for which access is given to some employees in an enterprise. Important features are extracted from the dataset, which is further used to carry Host or “IP” based Intrusion Detection. The quantity of traffic data is generally large, and to analyze such dataset it takes a lot of time. Also, if we manually analyze such dataset we might lose some necessary information in the haystack of logs. Hence, we propose a system where we group similar instances together and labeling the groups. The proposed algorithm gives us information about

76

S. Sharma and G. Makkar

abnormal activity that took place in an enterprise during the time period for which the log was collected by assigning a score to each machine.

4 Methodology In this section, we explain the procedure which we followed to devise the scoring algorithm in place. (A) Architecture For building the graph we have used Neo4J graph database. It helps in creating the property graph which proves essential when it comes to the data where each Nodes and Edges of the graph have multiple individual defining attributes. The architectural flow is explained in Fig. 1. (B) Dataset For the proposed algorithm, we have used open-source data. Bro IDS is used to capture and prepare the following dataset. Bro is a Unix based open-source intrusion detection system. It gathers the information from the entire network itself as data travels across the network system. This system helps in capturing different log files related to each network protocol that is being used in the network. For our experiment, we have used only three log files, namely, Connection, HTTP, and DNS. All log files contain timestamp, which helps us in getting the information on activity timeline. And it also helps us in understanding that when a connection between two machines is established, what is the behavior of the machines in the duration, that is, what kind of activities took place in which the connection persists. CONN.LOG file helps us

Fig. 1. The above figure explains the flow of our architecture and technologies

Scoring Algorithm Identifying Anomalous …

77

in analyzing the connection establishment among the machine and HTTP and DNS log files help us in analyzing activities related to the captioned protocols. i.

CONNECTION LOG: This log file helps in collecting plenty of information and useful statistics in a summarized form about the details of each UDP and TCP connection in a single line. For example, to find all the IP addresses or host machines that send more than 1 KB back to a client, this file can help in getting the same information by analyzing resp_bytes column. ii. HTTP LOG: All hypertext transmission requests and responses summary over the network is getting accumulated in this file. It explains the activity that is occurring for each particular UID. For example, if we are interested in the Hist header of the HTTP request “Host” field helps us in getting the same. iii. DNS LOG: This log records all the DNS queries along with their responses. For example, whether that query uses TCP or whether it is referring to some spoofed source address, etc. All such questions can be answered using the content present in the DNS log. The dataset describes the information about each connection that is being made and what are the activities that are being performed in an enterprise. (C) Graph structure The structure of the graph we present here is defined so as to get the clear picture how the hosts are connected with each other and what are the protocols that are being used among then to get the clear understanding on the number of interactions among them and also how likely any new host has the possibility to connect based on the hop relationships of the interactions. Figure 2 provides a clear view of the structure of the graph used and Fig. 3 shows the example of the graph from Neo4j.

Fig. 2. Graph structure. Host node is the IP address of the connected machines. Edges are HTTP DNS and CONNECTION file which provide information on respective protocols whereas connection file contains all connection information which has happened among two hosts

78

S. Sharma and G. Makkar

Fig. 3. Screentshot view of an example graph from Neo4J database

(D) Preprocessing and loading the dataset The log files are cleaned and converted to csv files, which are then loaded into the graph database. This will help to visualize data in the form of nodes and edges making data user-friendly. The reason to use graph database is that a graph database can give a better representation of systems that are physically connected together. In the graph database, the connections are not restricted and hence can model realtime interactions in a more accurate manner. Graph database algorithms work on inherently higher order information which is found as the graph structure. Here the nodes represent the hosts/departments while edges represent the interactions among different hosts in the network of underlying enterprise, i.e., network traffic. (E) Model development Our objective is to create a scoring mechanism that provides risk scores to each participating entity in the enterprise network that whether it is malicious or not. The model developed is based on UEBA with the objective of detecting any kind of anomalous behavior going with an enterprise network––be it on user or system level. In order to achieve this objective, our proposed methodology aims to provide each entity with a score on the basis of which an anomaly is being detected. (F) How the score is calculated? The UEBA score calculated in our analysis is based on the following parameters: (1) A regular activity. (2) Activity which is not regular for the host but is regular within the cluster. (3) Activity which is neither regular for the host nor within the cluster, however it is the kind of activity regular in different kinds of cluster of an enterprise. (4) Unknown patterns.

Scoring Algorithm Identifying Anomalous …

79

For the calculation of score, the following steps have been taken to reach a single value denoting each host in our enterprise: (1) Import log files into the Graph Database engine. (2) Apply a Community Detection Algorithm. We used Louvain Community Detection algorithm. This method of detection helps in grouping nodes with similar activities happening in a large network of an enterprise using the following formula:    ki k j  1  Ai j − d ci c j , Q= 2m i j 2m where m is the sum of all weights of the edges in the graph, Ki and Kj represent the sum of the weights of edges connecting to nodes I and j, respectively, Aij represent the weight of edges between two nodes i and j respectively, d is the delta function, ci and cj represent the communities to which nodes i and j belongs, So the application of the algorithm results in the allocation of each node with some cluster number based on their activities, i.e., nodes with the same activities are matched together and given the same cluster number. (3) Now consider only three nodes/hosts say A, B, and C (just random names). (3.a) For each time frame “t”, we will check how a node “A” is connected to another node “B” and whether node “B” is in the same cluster of node “A”. This is termed as “Same_Grp_Simi_Freq”. (3.b) Also, for the same time frame, we will check how many nodes like “A” are connected to other nodes like “B” and whether node B is in a different cluster from that of node “B”.This is termed as “Diff_Grp_Simi_Freq”. (3.c) And if node “B” is in the same cluster of that of “A” then what’s the outdegree of node “A”.This is termed as “Same_Grp_Outdegree_Freq”. The “outdgree” here is defined as the number of connections outgoing a particular node––in our case it is a node. (3.d) However, if Node “B” is in a different cluster of that of ‘A’ then what’s the outdegree of node “A”. This is termed as “Diff_Grp_Outdegree_Freq”. (3.e) Count of number of times Node “A” is connected to some other node in given time frame “t”. This frequency is termed as “act_timeframet”. (3.f) Count of number of times Node “A” is connected to some other node in entire time frame “T”. This total frequency is called as “act_freq”. (4) Using all these six measures, a final score is calculated for each node based on the following formula:   S1 = AB$Diff_Grp_Simi_Freq/AB$Diff_Grp_outdeg_Freq   + AB$Same_Grp_Simi_Freq/AB$Same_Grp_out deg _Freq

80

S. Sharma and G. Makkar

  S2 = AB$act_timeframet/$act_freq FINAL_ SCORE = S1 ∗ S2 where AB = connection of node A to node B This is how the score for each node is calculated for each time he performs any activity on the system. And any fluctuation from his normal activity is then considered as an alert for an enterprise so that his activity for that particular time period can be checked.

5 Results and Conclusion We have applied the underline scoring algorithm in our graph which is created using MACCDC dataset. The dataset provides the log files which contain logs for a time period of 8 h. For each connected host, we detect the anomaly and find the hosts which show the most anomalous behavior for the defined time span. For our experimentation, we selected top risky users from the graph which we found by manually inspecting the dataset. Figure 4 displays the anomalous or unexpected behavior of the user in the given timeline. In this paper, we have discussed the methodology of scoring the enterprise entity hosts based on their activities. The defined scoring mechanism can help in grouping the host machines which have similar behavior and also at what time frame the most anomalous activities occur in the network. This scoring methodology is the part of the frame, which further helps in generating alerts when most anomalous activity occurs in the enterprise network. It is developed on the top of traditional IPS/IDS. The key advantage of this scoring methodology is that it helps in finding the most risky asset not only by considering the behavior of the host but by incorporating the

Fig. 4. Graph displays the anomalous or unexpected behavior of the user in the given timeline. The anomalous or risky hosts are shown as peeks in the graph

Scoring Algorithm Identifying Anomalous …

81

activities or behavior of neighbors it is connected to. This further helps cybersecurity analysts to act on time and protect the asset at risk from further intrusion at an early stage.

References 1. X. Hu, T. Wang, M.P. Stoecklin, D.L. Schales, J. Jang, R. Sailer, Asset Risk Scoring in Enterprise Network with Mutually Reinforced Reputation Propagation, in 2014 IEEE Security and Privacy Workshops (San Jose, CA, 2014), pp. 61–64 2. X. Xi, T. Zhang, D. Du, G. Zhao, Q. Gao, W. Zhao, S. Zhang, Method and system for detecting anomalous user behaviors: an ensemble approach, in SEKE (2018), pp. 263–262 3. M. Shashanka, M.Y. Shen, J. Wang, User and entity behavior analytics for enterprise security, in 2016 IEEE International Conference on Big Data (Big Data) (IEEE, 2016), pp. 1867–1874 4. O. Carlsson, D. Nabhani, User and Entity Behavior Anomaly Detection using Network Traffic (2017)

Application of Bayesian Automated Hyperparameter Tuning on Classifiers Predicting Customer Retention in Banking Industry Akash Sampurnanand Pandey and K. K. Shukla

Abstract The paper aims to demonstrate the comparison of accuracy metrics achieved on nine different fundamental Machine Learning (ML) classifiers. Bayesian Automated Hyperparameter Tuning, with Tree-structured Parzen Estimator, has been performed on all of nine ML classifiers predicting the customers likely to be retained by the bank. After visualizing the nature of dataset and its constraints of class imbalance and limited training examples, Feature Engineering has been performed to compensate for the constraints. The ML techniques comprise first using six classifiers (namely, K-Nearest Neighbors, Naive Bayes, Decision Tree, Random Forest, SVM, and Artificial Neural Network––ANN) individually on the dataset with their default hyperparameters with and without Feature Engineering. Second, three boosting classifiers (namely, AdaBoost, XGBoost, and GradientBoost) were used without changing their default hyperparameters. Thirdly, on each classifier, Bayesian Automated Hyperparameter tuning (AHT) with Tree-structured Parzen Estimator was performed to optimize the hyperparameters to obtain the best results on the training data. Next, AHT was performed on the three boosting classifiers as well. The crossvalidation mean training accuracy achieved is comparatively quite better than those achieved on this dataset so far on Kaggle and other research papers. Besides, such an extensive comparison of nine classifiers after Bayesian AHT on Banking Industry dataset has never been made before. Keywords Bayesian automated hyperparameter tuning · Boosting methods · Outlier detection · Tree-structured Parzen estimator

A. S. Pandey (B) · K. K. Shukla Department of Computer Science and Engineering, Indian Institute of Technology (BHU), Varanasi, India e-mail: [email protected] K. K. Shukla e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1175, https://doi.org/10.1007/978-981-15-5619-7_7

83

84

A. S. Pandey and K. K. Shukla

1 Introduction In an industry like banking, where acquiring a new customer has come to a saturation because of intense competition and innovative business models, retention of existing customers looks like the lower hanging fruit. Most of the work in Customer Retention has been performed on telecom industry datasets [1–13]. However, the same in the Banking industry [14] is very limited. Considering the set of constraints is different on the banking dataset [15], particularly the imbalance ratio and less number of instances, it has been analyzed comprehensively in this paper. As per research on Customer Relationship Management [16, 17], the success rate of selling to a customer you already have is 60–0% while the success rate of selling to a new customer is around 5–20%. Besides, acquisition of a new customer costs around 5 times that of retention [16, 17]. Keeping this critical information in mind, this paper aims to present a comparative analysis of several Machine Learning models of predicting the customers likely to leave a bank. The bank can then focus on making special efforts to retain such customers. The dataset is available on Kaggle [15]. The goal is to find the best metric values. These ML models range from traditional popular classifiers to boosted and bagged classifiers with and without Feature engineering, Outlier Removal, and Bayesian Automated Hyperparameter Tuning [18–20]. The tuning of hyperparameters helps to claim with certainty if one model outperforms the other. Otherwise, it becomes a comparison of one model with specific hyperparameters with the other model having its own set of hyperparameters. The accuracy metrics used are 10-fold Cross-validation Training Accuracy, Standard Deviation of 10-fold Cross-validation Train Accuracy, Test Accuracy, Precision, Recall, and F-score. Related Works: In the past few years, ANN without hyperparameter tuning has been applied to check for customer retention for telecom dataset [1]. SVM method has been applied as well [1, 18, 19]. Methods of boosting the performance of Random Forest has also been discussed [12, 13]. A few data mining techniques for telecom datasets have been analyzed [5, 6, 10, 11]. There have been works on Customer Relation Management which, inter alia, analyzes the performance of some classifiers, but it is very particular to the telecom industry dataset and without any hyperparameter tuning [1] or Feature Engineering as discussed in this paper. Literature on Automated Hyperparameter tuning methods is relatively limited, particularly for Tree-structured Parzen Estimator [18–20]. Methodology: This is a brief outline of the steps followed, each of which has been elaborated further with citations in the respective sections. In the first phase of the analysis, data visualization was done using 1D and 2D scatter plots to check for any imbalance or anomaly. Covariance matrix was plotted to check for high weighted features.

Application of Bayesian Automated Hyperparameter Tuning …

85

Fig. 1. Header of the dataset describing its features

Outlier detection was performed using three methods, namely, SVM method, ZScore, and plotting Principal Component Analysis (PCA) with 2 dimensions to check for outliers. Details and conclusions of these three ways have been mentioned in the respective sections. In the second phase, six (Artificial Neural Network, SVM, Random Forest, Decision Trees, Naive Bayes, K-Nearest Neighbors) traditional classifiers were applied to the dataset using default hyperparameters and a comparison was made. The models were applied to the dataset with and without Sampling. Undersampling was done using Tomek Link and oversampling was done using SMOTE with varying minority to majority class ratio to make sure that the artificial data generated by oversampling does not adversely affect the training. In the third phase, three boosting techniques of AdaBoost, XGBoost and GradientBoost were applied using their default set of hyperparameters. In the fourth phase, Automated Hyperparameter tuning was done using Bayesian search algorithms for all the six (ANN, SVM, RF, DT, NB, KNN) traditional classifiers to find the best performance metric for all individual classifiers. The surrogate function in Bayesian Hyperparameter Optimization used is Tree-structured Parzen Estimator. In the fifth phase, Automated Hyperparameter tuning was done on the boosting techniques, namely, AdaBoost, XGBoost, and GradientBoost, and the accuracy metrics were compared with those of traditional default models. The 10-fold Cross-validation Training Accuracy achieved has been far better than that received on this dataset so far, as compared with the results available on Kaggle and the other research papers [14] (Fig. 1).

2 Data Visualization Here a distinction was made between the redundant columns and useful columns. Redundant columns are CustomerID and Surname. Useful ones as seen from the head are CreditScore, Geography, Gender, Age, Tenure, Balance, Number of

86

A. S. Pandey and K. K. Shukla

Products, Has Credit Card, Is Active Member, and Estimated Salary. Our Dataset has 7963 instances of not exited (Class 0) while 2037 instances of exited (Class 1). Here, a few variables are numerical, while some are non-numerical. For nonnumerical categorical features, Numerical and Categorical Encoding [21] was performed before training the models.

2.1 Imbalanced Dataset Firstly, since it is a customer retention dataset, it was expected to be skewed in favor of the majority class (Class 0). The ratio of label 1to that of label 0 was found to be 1:4. Class 0: 79.63% and Class 1: 20.37%.

2.2 Correlation Matrix Correlation Matrix was made and analyzed. It shows the linear dependence of one variable with the other. Hence, from the nature of the matrix, it can be seen which features are significant and which are not, using the coefficients. In the dataset, there neither was a significant dependence of any feature to explore a causal relationship with the label nor was dependence so insignificant for the feature to be called redundant. Only Age had a coefficient of 0.29 and was the highest with respect to Exited (or the label) (Fig. 2).

2.3 Box Plots for Individual Features 1D Box plots for individual features were drawn as the number of features was less. These plots help detect outliers using the interquartile range [22]. The box plot was plotted only for Age as it has the highest coefficient with respect to Exited (or the label) on the correlation matrix. However, the number of outliers was too less to be removed from an already small dataset (Fig. 3).

2.4 Scatter Plot for Critical Feature with the Target Variable Similarly, the label wise scatter plot was plotted between Age and Label. Here too the outliers were quite insignificant in number (Fig. 4).

Application of Bayesian Automated Hyperparameter Tuning …

87

Fig. 2. Correlation Matrix describing linear relationships between features

Fig. 3. Green points are the outliers for age feature, outside interquartile range

3 Evaluation of Performance or Accuracy Metrics CV-Train Accuracy: It is the mean of 10 accuracies received after the training data is randomly split into 10 parts with 9 parts being used for training and 1 part is used for testing at a time. The process of training on 9 parts and testing on 1 part is repeated for all 10 combinations of 9 train set and 1 test set thus giving 10 accuracies whose mean is what we count as a performance metric in our calculations [23].

88

A. S. Pandey and K. K. Shukla

Fig. 4. Outliers for age versus label (Class 0 and Class 1)

CV-Std Deviation: It is the standard deviation of the above mentioned 10 training accuracies. Lower the value implies that the classifier is more robust. Robustness here means that for any split of train and test data, the achieved accuracy does not vary much. This makes the classifier more reliable [23]. Confusion matrix: It is derived for the test set as defined below. The test set is split (20% of total data) from the dataset initially before performing any kind of sampling technique. Actual negative (Class 0)

Actual positive (Class1)

Predicted negative (Class0)

True negatives (TN)

False negatives (FN)

Predicted positive (Class1)

False positives (FP)

True positives (TP)

TestAccuracy = Precision = Recall =

True Negative (TN) + TruePositive (TP) TN + TP + FN + FP True Positive (TP) Total Predicted Positive (TP + FP) True Positive (TP) Total Actual Positive (FN + TP)

F-Score =

2(Precision ∗ Recall) (Precision + Recall)

Application of Bayesian Automated Hyperparameter Tuning …

89

4 ML Classifiers Using Default Hyperparameters 4.1 Artificial Neural Network (ANN) The ANN algorithm [24] creates a network of series of layers with a certain number of nodes in each layer. Starting from the input layer, a relationship is formed that connects each node to its previous layer using a set of parameters or weights as input and the activation as the output. Then the final predicted output layer is compared with the actual output and an error for the wrong prediction is calculated and the network is trained again. There is a penalty for wrong prediction. The aim is to find a set of parameters that minimize the error. This process is referred to as backpropagation.

4.2 SVM The SVM algorithm [25, 26] creates a line or a hyperplane, which separates the data into classes. The hyperplane is formed in such a way that the margin between the data points of the two classes is maximum. These data points which are nearest to the opposite class are called support vectors.

4.3 Naive Bayes In Naive Bayes [27], we deal with the probability distributions of the variables in the dataset and predict the probability of the response variable belonging to a particular value, given the attributes of a new instance. It utilizes Bayes Theorem.

4.4 Random Forest Random forest classifier [28] runs a series of decision tree classifiers on a randomly selected subset of the training set for each decision tree. It then aggregates the votes from different decision trees to decide the final class of the test set based on the majority. Thus a number of weak estimators when combined form strong estimator.

90

A. S. Pandey and K. K. Shukla

Table 1. Comparison table without feature engineering and without automated hyperparameter tuning Classifier

CV-train accuracy

CV-std deviation

Test accuracy

Precision

Recall

F-score

ANN

83.560

0.010

84.200

0.769

0.313

0.445

SVM

85.220

0.010

86.350

0.789

0.444

0.568

Naive Bayes

82.162

0.009

82.950

0.625

0.395

0.484

Random forest

85.030

0.010

86.150

0.731

0.498

0.593

KNN

82.587

0.005

82.700

0.606

0.414

0.492

Decision trees

79.212

0.012

80.200

0.509

0.572

0.539

4.5 KNN In the case of KNN classification [29], a majority voting is applied over the k-nearest data points or k neighbors with minimum distance. We have selected odd numbers as k. In KNN, the computations happen only on run time.

4.6 Decision Tree Decision tree [30] is derived from a continuous split of independent variables with each node or a conditional control statement having a condition over a feature. Based on this condition, nodes decide which node to navigate next to. Once the leaf node (the node at the end) is reached, the output is predicted (Table 1).

5 Feature Engineering Feature Engineering is one of the most fundamental and essential ways to improve the prediction accuracies of Machine Learning problems. In Feature Engineering, the user uses the domain knowledge from the training dataset to alter features essential for increasing efficiency of predictions. Since the bank customer dataset is unbalanced, Feature Engineering becomes even more salient.

5.1 Sampling of Imbalanced Dataset Sampling of Imbalanced Dataset: This is a data analysis technique to handle class imbalance. Our Dataset had 7963 instances of not exited (Class 0) while 2037 instances of exited (Class 1). We deployed two methods to handle the skewed Dataset.

Application of Bayesian Automated Hyperparameter Tuning …

5.1.1

91

Tomek Links

It is an undersampling technique to remove the overlap between classes. Under this technique [31], all majority class links are removed until minimally distanced nearest-neighbor pairs are of the same class. By this, we simultaneously reduce the imbalance of Dataset and also help in establishing well-defined clusters.

5.1.2

SMOTE

It is a statistical way of oversampling in a balanced way, and refers to the Synthetic Minority Oversampling Technique [32]. The technique generates synthetic examples that lie on the vector joining any two instances of the class to oversample (The minority class). After SMOTE, the number of Class 1 instances was equal to that of Class 0. Application of SMOTE almost invariably increased the performance of algorithms on our Dataset.

5.2 Outlier Detection Outliers are extreme observations that diverge from the overall distribution of a dataset. They are generally a result of experimental errors and hence need to be removed. Following techniques were tried to remove outliers in our Dataset.

5.2.1

SVM Method

This method uses SVM’s decision function [25, 26], which is the signed distance of an observation from the hyperplane generated by SVM. Those signed distances that lie beyond the interquartile range are considered outliers. Under this technique, instances having abnormal behavior in their decision function values in comparison to the remaining data are removed. But this does not seem to make any considerable difference for this dataset. Though it can be a very useful transformation in some other large dataset (Fig. 5).

5.2.2

Z-Score

Z-Score [33] of an instance indicates how many standard deviations the concerned observation is away from the mean. Mathematically, an element having a higher Z-score can be considered an outlier. This method was rejected due to its nature of detecting outliers (Table 2).

92

A. S. Pandey and K. K. Shukla

Fig. 5. Outlier points for decision function using the SVM method

Table 2. Comparison table with feature engineering and without automated hyperparameter tuning Classifier

CV-train accuracy

CV-std deviation

Test accuracy

Precision

Recall

F-score

ANN

84.72

0.010

79.95

0.503

0.775

0.610

SVM

86.58

0.067

85.5

0.681

0.566

0.618

Naive Bayes

74.69

0.021

73.05

0.404

0.698

0.512

Random forest

89.86

0.103

86.45

0.712

0.555

0.624

KNN

85.60

0.030

77.7

0.462

0.624

0.531

Decision trees

86.05

0.084

79.25

0.489

0.560

0.522

5.2.3

PCA Analysis and Dimensionality Reduction

PCA or Principal Component Analysis [34] is a method of Dimensionality Reduction by finding the most uncorrelated features (known as Principal Components) from a large set of data. By using PCA, we condensed our Dataset to two principal components, which becomes easy to interpret by using a 2D plot. Instances showing outlier behavior in these two principal components were removed. But this does not seem to make any considerable difference for this dataset as there are already a lesser number of training examples and removal may lead to further information loss. Though it can be a very useful transformation in large datasets with a large number of features (Fig. 6).

6 Boosting Algorithms It is a form of Ensemble learning. It gives low error and higher consistency and also reduces bias and variance error. Three types of boosting were performed.

Application of Bayesian Automated Hyperparameter Tuning …

93

Fig. 6. PCA plotted for 2 dimensions for visualization

6.1 AdaBoost AdaBoost or Adaptive Boosting [35], an ensemble learning technique, works by stacking multiple estimators trained with random subsets of data. The key in AdaBoost lies in random subsets being not so random. At each successive estimator, the weights of observations(the probability of being selected) increase for the wrongly classified examples.

6.2 XGBoost It is one of the fastest implementations of gradient boosted trees. XGBoost [36] is a highly effective algorithm. It has high predictive power and much faster than the other gradient boosting techniques. It also includes a variety of regularization which reduces over-fitting and improves overall performance.

6.3 GradientBoost It is a boosting technique for regression and classification problems [37]. It combines several weak learners to form a strong learner. Here regression trees are used as a base learner and each following tree in series is built on the errors calculated by the previous tree (Table 3).

94

A. S. Pandey and K. K. Shukla

Table 3. Comparison table of accuracy metric of the three boosting techniques without using automated hyperparameter tuning Classifier

CV-train accuracy

CV-std deviation

Test accuracy

Precision

Recall

F-score

AdaBoost

88.94

0.014

84.9

0.634

0.631

0.632

XGBoost

90.00

0.101

86.10

0.700

0.567

0.627

GradientBoost

89.70

0.097

86.35

0.690

0.590

0.636

7 Automated Hyperparameter Tuning (AHT) Hyperparameters, unlike parameters, are set before training a model and are not changed during training. The aim of AHT is to find the hyperparameters of a particular ML algorithm that return the best performance of that algorithm on a given dataset. For a complete list of hyperparameters in all the classifiers, (except ANN) the Python documentation of the library SciKit-Learn can be visited [38, 39]. For ANN [24], the hyperparameters considered are: 1. Number of hidden layers, Units in hidden layers and batch size. Default values taken prior to AHT are: Number of hidden layers: 2, Units in hidden layer one: 6 and units in hidden layer two: 6, batch size: 10. Bayesian Optimization Grid Search and Random Search are sometimes better than manual tuning of hyperparameters. However, these are not very efficient as their pick for a set of hyperparameters is completely uninformed by past evaluations. However, Bayesian optimization [18] keeps track of the past evaluation results and uses them to make a belief or a probabilistic model mapping hyperparameters to a probability of a score on the objective function. One of the types of Bayesian Hyperparameter optimization is Sequential Model Based Optimization(SMBO) [20]. It has five aspects. 1. A domain of hyperparameters over which to search. 2. An objective function whose inputs are hyperparameters and output is a score that we intend to minimize. In the given case, the score that we intend to minimize is (– 1*CV- Train Accuracy). In other words, the intent is to tune the hyperparameters such that CV-Train Accuracy is maximized. 3. The surrogate model of the objective function. It is basically a high-dimensional mapping of hyperparameters to the probability of a score on the objective function. The surrogate model used in this case is Tree-structured Parzen Estimator. 4. A criterion also called the selection function for suggesting the next set of hyperparameters to be chosen next from the surrogate model. The Selection Function used is Expected Improvement Function. 5. A history of (score, hyperparameters) pairs is used by the algorithm to update the surrogate model so as to improve it with each iteration. This is stored in the database provided by the python library.

Application of Bayesian Automated Hyperparameter Tuning …

95

Tree-structured Parzen Estimator: The methods of SMBO differ in how they construct the surrogate model p (y | x). The Tree-structured Parzen Estimator [19] builds a model by applying Bayes rule but with some changes as outlined here [19].

8 Results for Automated Hyperparameter Tuning (AHT) All the codes for AHT have been run for a maximum of 100 evaluations within a range that also comprises of default hyperparameters. The replication of results will depend on the randomness of the Train test split as well as the randomness of the split during 10-fold cross-validation.

8.1 AHT on ANN Optimum Hyperparameter Suggestion:{number of hidden layers:3, units in hidden layer one: 6, units in hidden layer two: 8, units in hidden layer three: 8, batch size: 10}

8.2 AHT on SVM Optimum Hyperparameter Suggestion: {‘C’: 2.7029833026257085, ‘gamma’: 1.090192166611637, ‘kernel’: ‘rbf’}

8.3 AHT on Random Forest Optimum Hyperparameter Suggestion:{‘max_depth’: 10, ‘min_samples_leaf’: 2, ‘min_samples_split’: 6, ‘n_estimators’: 96}

8.4 AHT on Naive Bayes Optimum Hyperparameter Suggestion: {‘var_smoothing’: 6.2125058966706e-08}

96

A. S. Pandey and K. K. Shukla

Table 4. Comparison table with feature engineering and with AHT using the optimum hyperparameter suggestion Classifier

CV-train accuracy

CV-std deviation

Test accuracy

Precision

Recall

F-score

ANN

86.39

0.009

84.85

0.645

0.558

0.598

SVM

90.95

0.031

80.25

0.534

0.375

0.441

Naive Bayes

74.69

0.021

73.05

0.404

0.698

0.512

Random forest

90.04

0.095

86.05

0.684

0.577

0.625

KNN

90.21

0.032

79.85

0.502

0.513

0.507

Decision trees

87.71

0.080

83.45

0.588

0.609

0.599

Here, the CV-Train Accuracy results are better than those without Automated Hyperparameter Tuning

8.5 AHT on Decision Trees Optimum Hyperparameter Suggestion: {‘min_samples_leaf’: 2, ‘min_samples_split’: 4, ‘min_weight_fraction_leaf’: 0.0008449296037830907}

8.6 AHT on KNN Optimum Hyperparameter Suggestion: {‘algorithm’: ‘auto’, ‘leaf_size’: 27, ‘n_neighbors’: 1, ‘p’: 1, ‘weights’: ‘distance’} (Table 4).

9 Results for AHT on Boosting Algorithm 9.1 AHT on AdaBoost Optimum Hyperparameter Suggestion: {‘learning_rate’: 1.498409598273652, ‘n_estimators’: 88}

9.2 AHT on XGBoost Optimum Hyperparameter Suggestion: {‘colsample_bytree’: 0.7749664094302052, ‘gamma’: 0.7527030392292601, ‘max_depth’: 4, ‘subsample’: 0.7665132392268867}

Application of Bayesian Automated Hyperparameter Tuning …

97

Table 5. Comparison table of accuracy metric of the three boosting techniques using optimum hyperparameter suggestion Classifier

CV-train accuracy

CV-std deviation

Test accuracy

Precision

Recall

F-score

AdaBoost

89.38

0.080

85.10

0.665

0.558

0.607

XGBoost

90.31

0.010

86.20

0.702

0.573

0.631

GradientBoost

90.34

0.097

85.85

0.682

0.562

0.617

Here, the CV-Train Accuracy results are better than those without Automated Hyperparameter Tuning(AHT)

9.3 AHT on GradientBoost Optimum Hyperparameter Suggestion: {‘learning_rate’: 0.150628410494514, ‘max_depth’: 8, ‘n_estimators’: 117, ‘subsample’: 0.8277284379389205} (Table 5)

10 Conclusion The accuracy metric pursued during the automated hyperparameter tuning is the mean of 10 accuracies reported after 10-fold cross-validation. One can pursue any other metric as deemed fit. The hyperparameter tuning definitely raises the pursued metric by a good margin. The increase in the training accuracy for the classifiers can be appreciated in Figs. 7 and 8. Depending on the computational means available, the search range of hyperparameters and the maximum number of evaluations by the

Comparison With and Without AHT 100 90

84.72 86.39

86.58

90.95

89.86 90.04

85.6

90.21

86.05 87.71

74.69 74.69

80 70 60 50 40 30 20 10 0 ANN

SVM

Naïve Bayes Without AHT

Random Forest

KNN

Decision Tree

With AHT

Fig. 7. Graphical comparison of classifiers before and after automated hyperparameter tuning

98

A. S. Pandey and K. K. Shukla

AHT On Boosters 90.5

90.34

90.31 90

90

89.7 89.38

89.5

89

88.94

88.5

88 AdaBoost

XGBoost without AHT

Gradient Boost

With AHT

Fig. 8. Graphical comparison of boosting classifiers before and after automated hyperparameter tuning

objective function in the Automated Hyperparameter tuning can be further increased (Here, the number of evaluations done was 100) to increase the pursued metric. This can help in optimizing the performance of any classifier. This comparison of the classifiers before and after automated hyperparameter tuning can help in deciding the best suitable classifier on a case-by-case basis to suit the requirements of the case. Besides, the accuracy metric to be pursued while tuning can also be decided as per the requirements of the case. The standard deviation of 10-fold cross-validation accuracy, Test Accuracy, Precision, Recall, and F-Score can be checked to decide the choice of a classifier based on the requirement of the case.

References 1. T. Vafeiadis, K.I. Diamantaras, G. Sarigiannidis, K.C. Chatzisavvas, A comparison of machine learning techniques for customer churn prediction. Simul. Model. Pract. Theory 55, 1–9 (2015) 2. S.A. Qureshi, A.S. Rehman, A.M. Qamar, A. Kamal, A. Rehman, Telecommunication subscribers’ churn prediction model using machine learning, in 2013 Eighth International Conference on Digital Information Management (ICDIM) (IEEE, 2013), pp. 131–136 3. K. Kim, C.-H. Jun, J. Lee, Improved churn prediction in telecommunication industry by analyzing a large network. Exp. Syst. Appl. 4. C. Kirui, L. Hong, W. Cheruiyot, H. Kirui, Predicting customer churn in mobile telephony industry using probabilistic classifiers in data mining. Int. J. Comput. Sci. Issues (IJCSI) 10(2) 5. G. Kraljevi´c, S. Gotovac, Modeling data mining applications for prediction of prepaid churn in telecommunication services. AUTOMATIKA: casopis za automatiku, mjerenje, elektroniku, raˇcunarstvo i komunikacije 51(3), 275–283 (2010)

Application of Bayesian Automated Hyperparameter Tuning …

99

6. R.J. Jadhav, U.T. Pawar, Churn prediction in telecommunication using data mining technology. IJACSA Editorial 7. D. Radosavljevik, P. van der Putten, K.K. Larsen, The impact of experimental setup in prepaid churn prediction for mobile telecommunications: What to predict, for whom and does the customer experience matter? Trans. MLDM 3(2), 80–99 (2010) 8. Y. Richter, E. Yom-Tov, N. Slonim, Predicting customer churn in mobile networks through analysis of social groups, in SDM, vol. 2010 (SIAM, 2010), pp. 732–741 9. S¸. G¨ursoy, U. Tu˘gba, Customer churn analysis in telecommunication sector. J. School Bus. Admin. Istanbul Univer. 39(1), 35–49 (2010) 10. K. Tsiptsis, A. Chorianopoulos, Data Mining Techniques in CRM: Inside Customer Segmentation (Wiley, New York, 2011) 11. F. Eichinger, D.D. Nauck, F. Klawonn, Sequence mining for customer behaviour predictions in telecommunications, in Proceedings of the Workshop on Practical Data Mining at ECML/PKDD (2006), pp. 3–10 12. A. Lemmens, C. Croux, Bagging and boosting classification trees to predict churn. J. Mark. Res. 43(2), 276–286 (2006) 13. Y. Xie, X. Li, Churn prediction with linear discriminant boosting algorithm, in 2008 International Conference on Machine Learning and Cybernetics, vol. 1 (IEEE, 2008), pp. 228–233 14. U.D. Prasad, S. Madhavi, Prediction of churn behaviour of bank customers using data mining tools. Indian J. Mark. 42(9), 25–30 (2011) 15. Dataset available on. https://www.kaggle.com/barelydedicated/bank-customer-churn-mod eling 16. Invesp Consulting. https://www.invespcro.com/blog/customer-acquisition-retention/ 17. The Chartered Institute of Marketing, Cost of customer acquisition versus customer retention (2010) 18. B. Shahriari, K. Swersky, Z. Wang, R.P. Adams, N. de Freitas, Taking the human out of the loop: a review of bayesian optimization. Proc. IEEE 104(1), 148–175 (2016) 19. J.S. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter optimization. in Advances in Neural Information Processing Systems (2011), pp. 2546–2554 20. F. Hutter, H.H. Hoos, K. Leyton-Brown, Sequential model-based optimization for general algorithm configuration, in International Conference on Learning and Intelligent Optimization (Springer, Heidelberg), pp. 507–523 21. K. Potdar, T.S. Pardawala, C.D. Pai, A comparative study of categorical variable encoding techniques for neural network classifiers. Int. J. Comput. Appl. 175(4), 7–9 (2017) 22. Interquartile Range Upton, Graham; Cook, Ian Understanding Statistics (Oxford University Press, 1996) 23. T. Wong, N, Yang, Dependency analysis of accuracy estimates in k-fold cross validation. IEEE Trans. Knowl. Data Eng. 29(11), 2417–2427 (2017) 24. J. Schmidhuber, Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015) 25. C. Cortes, V.N. Vapnik, Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 26. A. Ben-Hur, D. Horn, H. Siegelmann, V.N. Vapnik, Support vector clustering. J. Mach. Learn. Res. 2, 125–137 (2001) 27. T.R. Patil, S.S. Sherekar, Performance analysis of Naive Bayes and J48 classification algorithm for data classification. Int. J. Comput. Sci. Appl. 6(2). ISSN: 0974-1011 28. T.K. Ho Random decision forests, in Proceedings of the 3rd International Conference on Document Analysis and Recognition (Montreal, QC, 1995), pp. 278–282 29. N.S. Altman, An introduction to kernel and nearest-neighbor nonparametric regression. Am. Statist. 46(3), 175–185 (1992) 30. L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees (Wadsworth and Brooks/Cole Advanced Books and Software, Monterey, CA, 1984) 31. T. Elhassan, M. Aljurf, Classification of imbalance data using Tomek Link (T-Link) combined with Random Under-Sampling (RUS) as a data reduction method

100

A. S. Pandey and K. K. Shukla

32. S. Visa, A. Ralescu, Issues in mining imbalanced data sets-a review paper, in Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, vol. 2005 (2005), pp. 67–73). sn 33. M.R. Spiegel, L.J. Stephens, Schaum’s outlines statistics, 4th edn. (McGraw Hill, 2008) 34. I. Jolliffe, Principal component analysis, in International Encyclopedia of Statistical Science, ed. by M. Lovric (Springer, Heidelberg, 2011) 35. R. Schapire, Y. Singer, Improved boosting algorithms using confidence-rated predictions (1999) 36. https://github.com/dmlc/xgboost 37. J.H. Friedman, Greedy function approximation: a gradient boosting machine (1999) 38. Scikit learn documentation credits/link. https://scikit-learn.org/stable/documentation.html 39. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Perrot, É. Duchesnay, Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

Quantum Machine Learning: A Review and Current Status Nimish Mishra, Manik Kapil, Hemant Rakesh, Amit Anand, Nilima Mishra, Aakash Warke, Soumya Sarkar, Sanchayan Dutta, Sabhyata Gupta, Aditya Prasad Dash, Rakshit Gharat, Yagnik Chatterjee, Shuvarati Roy, Shivam Raj, Valay Kumar Jain, Shreeram Bagaria, Smit Chaudhary, Vishwanath Singh, Rituparna Maji, Priyanka Dalei, Bikash K. Behera, Sabyasachi Mukhopadhyay, and Prasanta K. Panigrahi Abstract Quantum machine learning is at the intersection of two of the most sought after research areas—quantum computing and classical machine learning. Quantum machine learning investigates how results from the quantum world can be used to solve problems from machine learning. The amount of data needed to reliably N. Mishra Department of Computer Science and Engineering, Indian Institute of Information Technology, Kalyani, West Bengal, India M. Kapil Department of Physics, Indian Institute of Technology Guwahati, Guwahati 781039, Assam, India H. Rakesh Department of Computer Science and Engineering, Nitte Meenakshi Institute of Technology, Yelahanka, Bangalore, India A. Anand Department of Mechanical Engineering, Indian Institute Of Engineering Science And Technology, Shibpur 711103, West Bengal, India N. Mishra Department of Electronics and Telecommunication, International Institute of Information Technology, Bhubaneswar 751003, Odisha, India A. Warke · V. Kumar Jain Department of Physics, Bennett University, Greater Noida 201310, Uttar Pradesh, India S. Sarkar · R. Gharat Department of Physics, National Institute of Technology Karnataka, Karnataka 575025, India S. Dutta Department of Electronics and Telecommunication, Jadavpur UniverGreater Noidasity, Kolkata, India S. Gupta Department of Physics, Panjab University, Chandigarh 160036, Chandigarh, India A. Prasad Dash · S. Roy Department of Physical Sciences, Indian Institute of Science Education and Research Berhampur, Berhampur 760010, Odisha, India © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1175, https://doi.org/10.1007/978-981-15-5619-7_8

101

102

N. Mishra et al.

train a classical computation model is evergrowing and reaching the limits which normal computing devices can handle. In such a scenario, quantum computation can aid in continuing training with huge data. Quantum machine learning looks to devise learning algorithms faster than their classical counterparts. Classical machine learning is about trying to find patterns in data and using those patterns to predict further events. Quantum systems, on the other hand, produce atypical patterns which are not producible by classical systems, thereby postulating that quantum computers may overtake classical computers on machine learning tasks. Here, we review the previous literature on quantum machine learning and provide the current status of it. Keywords Quantum machine learning · Quantum renormalization procedure · Quantum hhl algorithm · Quantum support vector machine · Quantum classifier · Quantum artificial intelligence · Quantum entanglement · Quantum neural network · Quantum computer

Y. Chatterjee Department of Physics and Astronomy, National Institute of Technology Rourkela, Odisha 769008, India S. Raj School of Physical Sciences, National Institute of Science Education and Research, HBNI, Jatni 752050, Odisha, India S. Bagaria Department of Chemical Engineering, MBM Engineering College, Jodhpur, India S. Chaudhary Department of Physics, Indian Institute Of Technology, Kanpur, Kalyanpur, Kanpur, India V. Singh Indian Institute of Technology (Indian School of Mines), Dhanbad 826004, Jharkhand, India R. Maji Department of Physics, Central University of Karnataka, Karnataka 585367, India P. Dalei · B. K. Behera (B) Bikash’s Quantum (OPC) Private Limited, Balindi, Mohanpur, Nadia 741246, West Bengal, India e-mail: [email protected] B. K. Behera · P. K. Panigrahi Department of Physical Sciences, Indian Institute of Science Education and Research Kolkata, Mohanpur 741246, West Bengal, India e-mail: [email protected] S. Mukhopadhyay BIMS Kolkata, Kolkata 700 097, West Bengal, India e-mail: [email protected]

Quantum Machine Learning: A Review and Current Status

103

1 Introduction Everyday experience in our life makes up our classical understanding, however, it’s not the ultimate underlying mechanism of nature. Our surrounding is just the emergence of the underlying and more basic mechanics known as quantum mechanics. Quantum phenomenons don’t match with our everyday intuition. In fact, for a very long time in the history of science and human understanding, these underlying mechanics were hidden from us. It is only in the last century we came to observe this aspect of nature. As the research progressed, we developed theories and mathematical tools from our renowned scientists. Quantum theory being a probabilistic theory attracts a lot of philosophical debates with it. Many quantum phenomenona such as the collapse of the wave function, quantum tunnelling, quantum superposition, etc still fascinates us. The true quantum nature of reality is still a mystery to our understanding. Quantum technologies aim to use these physical laws to our technological advantage. It is in the last 10 to 20 years the applications based on quantum mechanical laws have improved leaps and bound with the aim to replace or go parallel to the classical machines. Today quantum technologies have three main specializations: quantum computing (using quantum phenomena to computational tasks), quantum information (using quantum phenomena to aid in transfer, storage and reception of information) and quantum cryptography (using quantum phenomena to devise superior cryptography techniques). The power of quantum computation comes from the expansive permutations which make quantum computers twice as memory-full with the addition of each qubit. To specify N bits classical bits system, we need to have N bits of binary numbers. Now, we know in quantum systems the two possible definite states that are |0 and |1. A general state of a bipartite quantum system can be represented as  = α |00 + β |01 + γ |10 + δ |11, we can easily see that from a two-qubit quantum system we get four classical bits of information (α, β, γ, δ). Similarly, from the N-qubit quantum system, we can get 2 N bits of classical information. A mathematical model of computation that conceptually defines the aspect of a machine: contains an automaton, a strip with some symbols, a read-write head and a set of rules (encapsulated by the transition function) which can manipulate symbols on the strip, can be defined as a Turing machine. Quantum computers are universal Turing machines. Quantum mechanics allows superposition of quantum states resulting in quantum parallelism which can be exploited to perform probabilistic tasks much faster than any classical means. Quantum computers are known to solve problems that cannot be solved using a classical computer. One such example includes the factorization of large numbers using Shor’s algorithm [1]. Moreover, if classical and quantum computers are simultaneously utilized for the same purpose, cases can exist in which quantum algorithms prove to be more efficient. These algorithms belong to a particular complexity class called BQP (Bounded-error Quantum Polynomial time) [2]. Another difficult problem in the classical computation model, that is, solving the Pell’s equation, is efficiently solvable in quantum computation model. Similarly, the Non-Abelian

104

N. Mishra et al.

hidden subgroup problem is shown to be efficiently solved using a quantum algorithm [2]. Quantum computers have shown remarkable improvements in the field of optimization and simulation. It includes computing the properties of partition functions, performing approximate optimization and simulating different quantum systems. Quantum simulations have applications in the field of quantum optics and condensed matter physics as well [3]. “Give machines the ability to learn without explicitly programming them”— Arthur Samuel, 1955. The idea of machine learning can be derived from this statement. Briefly, an algorithm that holds the capacity to analyze a huge amount of data, make a pattern out of it and predict the future outcomes can be termed as a machine learning algorithm. The entire concept of machine learning is to optimize a constrained multivariate function; the algorithms to achieve such optimizations are the core of data mining and data visualization in use today. The decision function works on mapping input points to output points. This is the result of optimization. There are several exceptions to this simplistic rule, however, such optimizations are central to learning theory. Applications of such algorithms lead to artificial intelligence. Classically, there are three types of methods in machine learning—1. Supervised Machine Learning [4] where we teach machines to work on the basis of data which are already labeled with some of these’ characteristics, 2. Unsupervised Machine Learning [5] where no labelled data is provided to machines, they analyze these data on the basis of similarities and dissimilarities of their classes, 3. Reinforcement Learning [6] where machines analyze our feedbacks and learn. In quantum machine learning, the most common supervised machine learning algorithm is Quantum Support Vector Machine [7] wherein higher dimension vector space optimisation boundary is used to classify the classes of labelled data, Principal Components Analysis [8] is one of the most common algorithms. It makes a pattern of huge not-labelled data and effectively reduces it to make it easier for further analysis, this is essentially the quantum counterpart of Unsupervised Machine Learning. Quantum computing also servers useful for financial modeling [9]. The randomness that is inherent to quantum computers is analogous to the stochastic nature of financial markets [10]. Today classical computers conduct high frequency stock exchange worth millions of dollars every second. The power of quantum computers can be used to solve such systems. Weather forecasting has been a long goal for scientists, however, predicting weather conditions require taking into account an enormously large number of variables causing classical simulations to be unreasonably lengthy. With their parallel computing power, quantum computers can be used to create better climate models. Various difficulties arise when we try to model complex molecules in classical computers. Chemical reactions lead to the formation of highly entangled quantum superposition states, and are thus quantum in nature [11]. Such states can be modeled accurately with the help of a quantum computer. Yet another fascinating specialization in the field of quantum technologies is quantum cryptography. Quantum cryptography can be defined as the utilization of quantum mechanical properties in order to carry out cryptographic tasks. Publication of Wiesner’s paper in 1983, led to the origination of this field. A well known example

Quantum Machine Learning: A Review and Current Status

105

of quantum cryptography is the quantum key distribution (QKD). In 1984, thanks to Bennett and Brassard, it was possible to obtain a complete protocol of extremely secure quantum key distribution. It is currently known to be the BB84 protocol [12]. This protocol drew a significant amount of attention from the cryptographic community as such security was unattainable by classical means [13]. In this paper, we aim to give a brief description of Quantum Machine Learning and its correlation with AI to unleash the future scope and application of these in human life. We will see how the quantum counterpart of machine learning is way faster and more efficient than the classical machine learning. In the following section, we describe the basic fundamentals of classical machine learning and its methods. Detailed discussion regarding ways by which a machine can learn has also been described in this section. In Sect. 3, aspects of classical machine learning which can be understood and applied to the quantum domain and its implementation have been discussed in detail. Furthermore, subsection on quantum neural networks covers a general introduction to neural networks and variants as they stand in deep learning, background work which has already been processed on quantum neural networks, quantum neuron and quantum convolutional neural network as a mark of deep learning’s transition to quantum computers. In Sect. 4, we discuss in detail about how learning and renormalization procedures invite the application of machine learning. Using previously established relationships between the inputs and the outputs, machine learning derives patterns which are then used to make sense of previously unknown inputs. Some features are far too complex for standard numerical modeling methods. This is where techniques of machine learning play a vital role in solving problems. In Sect. 5, we discuss in detail about quantum HHL algorithm’s much needed importance in solving linear systems. It is known to be a revolutionary algorithm that attempts to estimate solutions of linear systems of equations in logarithmic time. It is applicable as a subroutine in several quantum machine learning algorithms. As we know, data classification is one of the most important tools of machine learning today, we discuss in detail in Sect. 6 about quantum support vector machine. Quantum support vector machines are data classification algorithms that allow classification of data into one of two different categories. This can be done by drawing a hyperplane between our training data. This helps in the identification of data that belongs to a specific category. After this, we can enter our data to be classified and get our result based on its position relative to the hyperplane. Many quantum SVM algorithms exist today and quantum SVMs have been experimentally tested and turned out to be successful. In the next section, i.e. Sect. 7, we discuss in detail about quantum classifiers and its recent developments. A quantum computing algorithm which analyzes quantum states of existing data in order to determine or categorize new data to respective classes is known to be a quantum classifier. Approaches in quantum classifiers have been discussed in explicit detail in this section. In Sect. 8, an overview of applications of quantum machine learning to physics have been discussed. Machine learning methods can efficiently be used in vivid quantum information processing problems which include quantum signal processing, quantum metrology, quantum control, etc. The following section, i.e. Sect. 9,

106

N. Mishra et al.

we firstly cover basics of how machine learning can be applied to the idea of quantum artificial intelligence. This section covers two to three AI simulations and explores whether these simulations can be quantized. It explores the possibility of relating quantum computing to artificial intelligence. Lastly, it cites some new developments in the field of quantum artificial intelligence. Section 10 covers some examples of how entangled-state helps ML to be more accurate, efficient and sensitive. Likewise, machine learning also can be used to measure how entangled a state is, so both can be used to make each other better and more efficient than before. Section 11 describes the motivation of quantum neural networks through classical neural networks. Background work in quantum neural networks has also been discussed in detail. We then put forth the concept of quantum neuron which is then followed by the comprehensive discussion of quantum convolutional neural networks. Section 12 reports the use of artificial neural networks to solve many-body quantum systems. This happens to be one of the most challenging areas of quantum physics. Recent use of neural networks for the variational representation of quantum many-body states has initiated a huge attentiveness in this field. Since neural networks assist in the representation of quantum states efficiently, the question of whether or not a simulation of various quantum algorithms is possible has been addressed in Sect. 13. In the next section, i.e. Sect. 14, we report in detail about learning algorithms that are implemented on real quantum hardware. Recent developments in this area of research have been discussed as well. The last section narrates the use of machine learning frameworks, especially RBM (Restricted Boltzmann Machine) networks to represent the quantum many-body wave functions. The possibility of using neural networks to study quantum algorithms and the recent developments in this direction are discussed briefly, thus giving a proper conclusion and a futuristic vision.

2 Classical Machine Learning In this section, we discuss basic machine learning types and models to set the context for various methods by which machines learn. Broadly, a machine may learn either from data or by interaction. We discuss both the learning methods in detail in Sect. 2.1. After this, we discuss the most widely used machine learning models that implement the fore mentioned learning types in Sect. 2.2. Machine learning algorithms were built decades ago when fast computation was a difficult task. Nowadays, with increased computational capabilities, implementing these algorithms successfully is a fairly achievable task. A certain characterization on the basis of ease or difficulty in implementation and computational resources required for implementation can be done for ML algorithms. This is discussed under Sect. 2.3, i.e Computational learning theory.

Quantum Machine Learning: A Review and Current Status

107

2.1 How Does a Machine Learn?—Learning from Data and Learning from Interaction Machine learning has mainly three canonical categories of learning— supervised, unsupervised and reinforcement learning. Fundamentally, supervised and unsupervised learning are based on data analysis and data mining tasks. Whereas reinforcement learning is an interaction-based learning, where learning enhances sequentially at every step. We discuss each learning type in the following sections.

2.1.1

Supervised Learning

In supervised learning, we are provided with a training set D which contains a number of input–output pairs (x, t). The input x could be in general an n-dimensional vector. The primary aim is to infer relationship—linear or nonlinear—between the inputs and outputs, and predict the output for yet unobserved input values. We want to be able to predict the output tˆ(x) for any input x. A potent example of this is spam classifier. Based on a training set of emails classified as spam or not-spam, the classifier labels future emails as either spam or not-spam. To characterize the quality of the prediction made, a loss function is used. Depending on the context, a variety of loss functions can be used which quantify how far the prediction tˆ(x) has been from the actual output t (x). The goal is to minimize this loss function f (tˆ(x), t (x)) Predicting the probability distribution function p(x, t), is a three-step process: 1. Model Selection: We take the probability distribution function to be from a family of functions parameterized by some vector . This is also called inductive bias. We can specify the particular parametric family of distribution functions in two ways—in generative models, we specify the family of point distributions p(x, t|θ), while in discriminative models, we parameterize the predictive distribution p(t|x, θ 2. Learning: The second step is learning where given a training set D, we optimize a learning parameter (here, the loss function f (tˆ(x), t (x). Thus, we find out the parameter θ and by extension the distribution from the family of distribution parameterized by . 3. Inference: The third and the final step is inference. Here the trained model is employed to predict the output tˆ(x) in line with minimizing the loss parameter. Had we used the generative model in step 1, we would need to use marginalization to get to the actual predictive distribution p(t|x), while the discriminative model would directly yield the predictive distribution.

108

2.1.2

N. Mishra et al.

Unsupervised Learning

The main difference between unsupervised learning and supervised learning is the fact that in unsupervised learning, we do not have labelled data points. The training set D contains a set of inputs {x}. Thus, we only have the input data points. The general goal of the process is to extract useful properties from this data. We are interested in the following tasks: 1. Density estimation: Based on the training set, here we directly try to estimate the probability distribution p(x). 2. Clustering: We could want to segregate the data in various clusters based on their similarities and differences. Here, the notion of what similarity is and what difference is, depends on the case at hand, and the particular domain in which you are working. Thus, even though the input data set was not endowed with any labels or classifications, through clustering we readily partition the data points into groups. 3. Dimensionality Reduction: The process involves representing the data point xn in different spaces, thus being better able to visualize the correlations between various factors. This representation is generally done in a space of lower dimensions that better helps in extracting the relationship between the various components. 4. Generation of new samples: Given the data set D, we could want to generate new samples that would follow the probability distribution of the training data approximately. A relevant example is how sports scientists predict the actions of athletes using the same. Just as was the case for supervised learning, here too it is a three-step process: Model selection, learning and using the learned model for clustering or generation of new samples.

2.1.3

Reinforcement Learning

Reinforcement Learning includes making sequential decisions in order to optimize some parameters. It differs from supervised learning, in that, supervised learning included a training set D with input–output pairs where the output is the “correct” answer or the correct characterization of the input. In the case of reinforcement learning, suboptimal paths taken by the algorithms need not be corrected immediately. Reinforcement learning could be seen as the middle ground between supervised and unsupervised learning as there does not exist immediate correct output to the input but there exists some sort of supervision in terms of if the series of steps taken are in the right direction or not. The reinforcement algorithm receives feedback from the environment in place of a desired output for each input. And, this happens only after the algorithm has chosen an output for the given input. The feedback tells the algorithm about how well the chosen steps have helped or harmed in fulfilling the goals. The environment with which the algorithm interacts is formulated as Markov Decision Process.

Quantum Machine Learning: A Review and Current Status

109

2.2 Machine Learning Models 2.2.1

Artificial Neural Networks

Artificial Neural Networks belong to a class of computational models inspired by the biological neural network. Broadly, they mimic the workings of the natural neural network. Just as is required for a certain excitation for a natural neuron to fire, and the series of neurons firing determine the action that is to be taken, for artificial neural networks too, the system abides by the same set of rules at a broader level. Depending on the particular type of problem we are facing, different types of neural network models are used. A general architecture of a neural network could be understood in terms of layers of neurons which fire according to the firing of the neurons in the previous layer. Feed-Forward Neural Networks are used as classifiers. If one wanted to map any input x to an output category y, one could define a function y = f (x; θ) that computes and learns the value of the parameter θ that results in the best function approximation. The nomenclature is derived from the fact that the signal flows in only one direction. Given a particular set of input at the first layer of neurons, the neurons in the subsequent layers fire to provide the classification. Convolutional Neural Network is mostly used to classify images. Here, there is a vector of weights and biases which determines the output value given the inputs. There are multiple hidden layers between the input and the output layers. There is also an activation function (commonly taken to be RELU Layer) and is followed by subsequent pooling layers. The nomenclature derives from the fact that convolution operation is performed using a kernel. There could also be backpropagation to optimize how the weights are distributed. Recurrent Neural Network is a type of neural network where the input to the current step is the output of the previous step. Predictive text is one of the most user-known applications of these. To predict the next word, the algorithm needs to store the previous word. RNNs turn the independent activation into dependent activation by making the weights and biases uniform across different layers. Implementation of the Artificial Neural Network The neural network design consists of the input layer, output layer and the hidden layers. The input dataset is converted into an array of input to be fed into the network. Each input set comes with label value. Once it is fed to the network, the network is trained to determine the output label function of the fed dataset. The deviation from the true label value gives the error, the training parameter thus determines which corresponds to the minimum error. The basic neural network operates with the help of three processes—forward propagation, backward propagation and updating the weight associated with the neuron. A neuron in any m layer receives the input from the (m–1) layer and conveys the output to the (m+1) layer. The value obtained with multiplying the input parameter with the weighted vector is fed to the neurons. In the input layer, a bias is also given.

110

N. Mishra et al.

Forward Propagation employs the methods of preactivation and activation. Preactivation involves feeding the network with weighted inputs. It is the linear transformation of the inputs which decides on the further passing of the input through the network. Activation [14] is the nonlinear transformation of the weighted inputs. Backpropagation serves the most important step in the operation of a neural network. When the deviation of the obtained label value from the true label value is calculated, backpropagation helps in minimizing the error value. The training parameter is updated through each iteration until the error value is minimized. In backpropagation, we travel from the output layer to the input layer. It functions by employing the basic types of gradient descent algorithms to optimize the function. After forward propagation of any input through a neuron with respect to an assumed weight, the error is calculated. The weight with respect to the neuron corresponding to the error is then updated and the error function is changed. Similarly, the weight of every neuron is updated while backpropagating from output to input layer. Thus, the final training parameter is obtained. The analysis of the error value with the corresponding training parameter over each iteration presents the method in which the function of the neural network is implemented for performing different algorithms. Thus, the network is trained.

2.2.2

Support Vector Machines for Supervised Learning

The concept of Support Vector Machine (SVM), was developed by Boser, Guyon and Vapnik in COLT-92, in the year 1992. SVMs are one of the several models belonging to the family of generalized linear classifiers based on supervised learning, and are used for classification and regression. SVMs are systems based on hypothesis space of a linear function in a high dimensional feature space. They are trained with optimization learning algorithms that implement learning bias (based on statistical theory). A potent example of SVM is handwriting recognition. Pixel maps are fed as inputs to such classifiers, and their accuracy is comparable to complicated neural networks with specialized features for handwriting recognition. Many applications, especially for pattern classification and regression-based applications due to promising features like better empirical performance of SVMs have come forth. Some examples include handwriting analysis, face analysis and so on. SVMs perform better than neural networks in some scenarios because it uses the Structural Risk Minimization (SRM) principle, which is superior to traditional Empirical Risk Minimization (ERM) principle in use in conventional neural networks. The core difference between the two—minimization of the upper bound on the expected risk by SRM as opposed to the minimization of an error on training data by ERM—allows SVMs to generalize better than conventional neural networks, and are thus better suited to statistical learning. Due to this, modern times have seen increased applications of SVMs to regression problems, in addition to traditional classification problems.

Quantum Machine Learning: A Review and Current Status

111

Why SVM?: Neural networks show good results for several supervised and unsupervised learning problems. The most common—Multilayer perceptrons (MLPs)— include several properties like universal approximation of continuous nonlinear functions, learning with input–output patterns and advanced network architectures with multiple inputs and outputs. The problem with these is the existence of several local minima where the optimization algorithm may get stuck, as well as the problem of determining the optimal number of neurons, to be used in the neural network architecture. There is one other problem: convergence of a neural network solution does not guarantee a unique solution.

2.3 Computational Learning Theory Given the number of machine learning algorithms available, there arises a need to characterize the capabilities of these machine learning algorithms. It is natural to wonder if we can classify machine learning problems as inherently difficult or easy [15]. With this come the obvious questions about quantifying a ‘suitable’ or ‘successful’ machine learning algorithm for various classes of problems. The success of a machine learning algorithm can depend upon several parameters like sample complexity, computational complexity, mistake bound, etc. Computational learning theory or COLT is a subfield of artificial intelligence in which the theoretical limits on machine learning algorithms, encompassing several classes of learning scenarios, are analyzed [16]. The standard procedure in COLT is to mathematically specify an environment or problem, apply learning algorithms on the problem, and characterize the performance of an optimal learning algorithm [15]. To study the aspects of COLT, we consider the probably approximately correct learning model or more colloquially called the PAC model. The basic PAC learning model [17] introduced by Valiant can quantify learnability. We consider the case of learning Boolean valued concepts from training data. Consider the set of all possible instances over which target functions can be defined. Let this set be X. Let C be some set of target concepts our learner L has to learn. C is essentially a subset of X. The instances in X are generated at random according to probability distribution D. The learner L learns from target concepts in C and considers some set H of possible hypotheses describable by conjunctions of n attributes that define elements in X. After learner L learns from a sequence of training examples of target concept c in C, L outputs some hypothesis h from H which is L’s estimate of c. Output hypothesis h is only an approximation of actual target concept c. The error in approximation or true error of hypothesis h with respect to concept c, denoted by error D (h), is the probability that h will misclassify an instance drawn from D at random. We aim to identify classes of target concepts that can be reliably learned from a reasonable number of training examples. Ideally, for a sufficient number of training examples, we require error D (h) to be 0. This assumption is practically impossible owing because multiple consistent hypotheses may exist for c, and there is always a nonzero finite probability that a training example may be misleading for the learner.

112

N. Mishra et al.

Therefore, in a practical scenario, we focus on minimizing error D (h) and limiting the number of training examples required. For some class C of target concepts learned by learner L using hypothesis space H, let δ be an arbitrarily small constant bounding the probability of failure or the probability of misclassifying c. Also let  denote an arbitrarily small constant bounding the upper limit of error D (h) such that error D (h) is less than . C is said to be PAC learnable by L using H if for all c belonging to C, distributions D over X,  such that 0 <  < 21 and δ such that 0 < δ < 21 , learner L with probability at least (1 − δ) will output a hypothesis h belonging to H such that error D (h) is less than , in time that is polynomial in 1 , 1δ , n and size(c) [15]. Thus, for a PAC learnable concept, L must learn from a polynomial number of training examples. Here we defined PAC learnability of conjunctions of Boolean literals. PAC learnability can be defined for other concept classes too. PAC learnability greatly depends upon the number of training examples required by the learner. The sample complexity of the problem is growing in the number of training examples with problem size. Mathematical formulations for sample complexity for finite and infinite hypotheses spaces are available. Generally, the sample complexity of infinite hypotheses spaces is illustrated by the Vapnik-Chervoneniks dimension [18]. Collectively, PAC learnability and bounds of sample complexity relate to computational complexity. A number of training examples from which learner learns in polynomial time of PAC learnable bounds defines a finite time per training example. Here comes the need of handling enormous data in the least possible time.

3 Quantum Machine Learning Training the machine to learn from the algorithms implemented to handle data is the core of machine learning. This field of computer science and statistics employs artificial intelligence and computational statistics. The classical machine learning method, through its subsets of deep learning (supervised and unsupervised) helps to classify images, recognize pattern and speech, handle big data and many more. However, as of today, a huge amount of data is being generated. New approaches, therefore, are required to manage, organize and classify such data. Classical machine learning can usually recognize patterns in data, but the few problems requiring huge data cannot be efficiently solved by classical algorithms. Companies dealing in big database management are aware of these limitations, and are thus actively looking for alternatives: quantum machine learning being one of them. The use of quantum phenomena like superposition and entanglement to solve problems in classical machine learning paves the way to quantum machine learning [19]. algorithms in quantum systems, by Quantum computers, use the several superposition states |0 and |1 to allow any computation procedure at the same time, providing an edge over classical machine learning. In the quantum machine learning techniques, we develop quantum algorithms to operate the classical algorithms with the use of a quantum computer. Thus, data can be classified, sorted and analyzed

Quantum Machine Learning: A Review and Current Status

113

using the quantum algorithms of supervised and unsupervised learning methods. These methods are again implemented through models of a quantum neural network or support vector machine.

3.1 Implementation of Quantum Machine Learning Algorithms In the implementation of algorithms, we broadly discuss the process of the two major learning processes—supervised and unsupervised learning [20]. The pattern is learned observing the given set of training examples in case of supervised learning. While finding the structure from some clustered data set is done in unsupervised learning. Quantum clustering technique [20] uses the quantum Lloyd’s algorithm to solve the k-means clustering problem. It basically uses repetitive procedures to obtain the distance of the centroid of the cluster. The basic methods involve choosing randomly an initial centroid and assigning every vector to the cluster with the closest mean. Repetitive calculation and updating the centroid of the cluster should be done till the stationary value is obtained. The quantum algorithm speeds up the process in comparison to the classical algorithm. For an N-dimensional space, it demands O(Mlog(MN)) time to run the step for the quantum algorithm. Quantum Neural Network [21] model is the technique of deep supervised learning to train the machine to classify data, recognize patterns or images. It is a feed-forward network. The basic principle is to design the circuits with qubits (being the analogue of neurons) and rotation gates (being the analogous of weights used in classical neural networks). The network learns from a set of training examples. Every input string comes with a label value. The function of the network is to obtain the label value of the data set and minimize the deviation of the obtained label from the true label. The focus is to obtain the training parameter that gives the minimum error. The training parameter is updated through every iteration. Error minimization is done by the backpropagation technique, which is based on gradient descent principle. Quantum Decision Tree [22] employs quantum states to create classifiers. Decision trees are like normal tree structures in Computer Science: with one starting node named the root having no incoming edge and all outgoing edges leaving to other internal nodes or leaves. In these structures, the answer to a question is classified as we move down. The node contains a decision function that decides the direction of movement of the input vector along the branches and leaves. The quantum decision tree learns from training data: each node basically splits the training data set into subsets based on discrete function. Each leaf of the decision tree is assigned to an output class based on the target attributes desired. Thus, the quantum decision tree classifies data among its different components: leaves and root/internal nodes (which contain decisions based on which one of their child nodes are traversed for further classification).

114

N. Mishra et al.

Quantum machine learning provides a huge scope in computing the techniques done in classical machine learning on a quantum computer. The entanglement and superposition of the basic qubit states provide an edge over classical machine learning. Apart from neural networks, clustering methods, decision trees, quantum machine algorithms have been proposed for several other applications of image and pattern classification, and data handling. Further implications of the algorithms has been discussed in the paper.

4 Application of Machine Learning for Learning Procedure and Renormalization Procedure Machine learning derives patterns from data in order to make sense of previously unknown inputs. Some features are far too complex for standard numerical modelling methods. This is where techniques of machine learning play a vital role in solving problems. Recent progress in the field of machine learning has shown significant promise in applying ML tools like classification or pattern recognition to identify phases of matter or nonlinear approximation of arbitrary functions using neural networks. Machine learning has become the central aspect of modern society: in web searching, in recommendation systems, in content filtering, in cameras and smartphones, speech-to-text and text-to-speech converters, identify fake news, sentiment analysis and many more. More often, such applications require deep neural networks (having several hidden layers with many neurons) called deep learning. The renormalization group (RG) approach underlies the discovery of asymptotic freedom in quantum chromodynamics and of the Kosterlitz–Thouless phase transition as well as the seminal work on critical phenomena. The RG approach is a conceptual framework comprising techniques such as density matrix RG, functional RG, real-space RG, among others. The essence of RG lies in determining the ‘relevant’ degrees of freedom and integrating the ‘irrelevant’ ones iteratively. We thus arrive at a universal, low-energy effective theory. The RG procedure, which reveals the universal properties that, in turn, determine their physical characteristics, systematically retains the ‘slow’ degrees of freedom and integrates out the rest. The main problem, however, is to identify the important degrees of freedom. Consider a feature map which transforms any data X to a different, more coarse grain scale (1) x → φλ (x) The RG theory requires that the Free Energy F(x) is scaled, to reflect that the free Energy is both Size-Extensive and Scale- Invariant near a Critical Point. The Fundamental Renormalization Group Equation (RGE): F(x) = g(x) +

1 F(φλ (x)) λ

(2)

Quantum Machine Learning: A Review and Current Status

115

5 Quantum HHL Algorithm A quantum algorithm for solving linear systems of equations was put forward by Aram Harrow, Avinatan Hassidim and Seth Lloyd in 2009 [23]. More specifically, the algorithm can estimate the result of a scalar measurement on the solution vector b to a given linear system of equations Ax = b. In this context, A is a N × N Hermitian matrix with a spectral norm bounded by 1 and a finite condition number κ = |λmax /λmin |. The HHL algorithm can be efficiently implemented only when the matrix A is sparse (at most poly(log N ) entries per row) and well-conditioned (that is, its singular values lie between 1/κ and 1). We also emphasize the term “scalar measurement” here: the solution vector x produced by the HHL subroutine is actually (approximately) encoded in a quantum state |x ˜ of log2 N  qubits and it cannot be directly readout; in one run of the algorithm we can at most determine some statistical properties of |x by measuring it in some basis or rather sampling using some quantum mechanical operator M, i.e. x|M| ˜ x. ˜ Even determining a specific entry of the solution vector would take approximately N iterations of the algorithm. Furthermore, the HHL requires a quantum RAM (in theory): that is, a memory which can create the superposition state |b (encoded b) all at once, from the entries {bi } of b without using parallel processing elements for each individual entry. Only if all these conditions are satisfied, the HHL runs in the claimed O(log N s 2 κ2 /) time, where s is the sparsity parameter of the matrix A (i.e. the maximum number of nonzero elements in a row) and  is the desired error [24, 25]. Given all these restrictions, at first sight, the algorithm might not seem to be too useful; however, it is important to understand the context here. The HHL is primarily used as a subroutine in other algorithms and not meant as an independent algorithm for solving systems of linear equations in logarithmic time. In other words, the HHL is suitable for application in special circumstances where |b can be prepared efficiently, the unitary evolution e−i At can be applied in a reasonable time frame and when only some observables of the solution vector x are desired rather than all its elements. The 2013 paper by Clader et al. [26], is a concrete demonstration of such a use case of the HHL with a very real-world application, i.e., calculation of electromagnetic scattering cross-sections of any arbitrary target faster than any classical algorithm (Fig. 1). The HHL algorithm comprises of three steps: phase estimation, controlled rotation and uncomputation [27, 28].     For the first step of the algorithm, let A = j λ j u j u j . Considering the case    when the input state is one of the unitary  eigenvectors of A, |b = u j . Given a iφ operator U with eigenstates u j and corresponding complex eigenvalues e j , the technique of quantum phase estimation allows for the following mapping to be implemented:       |0 u j → φ˜ u j (3)

116

N. Mishra et al.

Fig. 1 HHL algorithm schematic: a Phase estimation b R(λ˜ −1 ) rotation c Uncomputation

Here, φ˜ is the binary representation of φ to a certain precision. In the case of a λ j , the Hermitian matrix A, with eigenstates u j and corresponding eigenvalues  matrix ei At is unitary, with eigenvalues eiλ j t and eigenstates u j . Therefore, the technique of phase estimation can be applied to the matrix ei At in order to implement the mapping       |0 u j → λ˜j u j (4) where λ j is the binary representation of λ j .   In the second step of the algorithm, a controlled rotation conditioned on λ j is implemented. For this purpose, a third ancilla register is added to the system in state |0, and performing the controlled Pauli-Y rotation produces a normalized state of the form    C  ˜    C 2  ˜    1 − (5) λ j u j |0 + λ j u j |1 λ˜j λ˜2 j

where C is the normalization constant. This can be done by applying the operator e

−iθσ y

cos θ − sin θ = sin θ cos θ

(6)

˜ where θ = tan−1 (C/λ).         A−1 = j (1/λ j ) u j u j . Since by definition A = j λ j u j u j , therefore,  We assume that we are given This state  a quantum state |b = i bi |i.    can be expressed in the eigen basis u j of operator A, such that |b = j β j u j . Using the above procedure on this superposition state, we get the state        C C2  |0 + |1 . β j λ˜j u j 1 − λ˜j λ˜2j j=1

N

We uncompute the first register, giving us

(7)

Quantum Machine Learning: A Review and Current Status



     C C2  |0 ⊗ |0 + |1 . βj u j 1− λ˜j λ˜2j j=1 N

117

(8)

β j   u j . Thus, a quantum state close to |x = λ˜j A−1 |b can be constructed in the second register by measuring the third register and post selecting on the outcome’1’, modulo the constant factor of normalization C. Amplitude amplification can be used at this step instead of simply measuring and postselecting to boost the success probability. Notably, Tang’s 2018, thesis titled A quantum-inspired classical algorithm for recommendation systems [29] essentially demonstrated that solutions based on the HHL for several linear algebra problems, which were earlier believed to have no equivalent to HHL in terms of time complexity, can be dequantized and solved with equally fast classical algorithms. Furthermore, the only caveat for Tang’s algorithm is the allowance of sample and query access, and that is far more reasonable than efficient state preparation as demanded by the HHL. However, this doesn’t imply the HHL has been rendered obsolete; we must be careful to note that Tang’s algorithm is specifically aimed at low- dimensional matrices, whereas the original HHL was meant for sparse matrices, albeit quantum machine learning for low- dimensional problems are the most practical algorithms in the literature as of now. Nevertheless, generation of arbitrary quantum evolutions for state preparation remains as hard as ever [30–34]. It is to be noted that, A−1 |b =

N

j=1

6 Quantum Support Vector Machine Data classification is one of the most important tools of machine learning today. It can used to identify, group and study new data. These machine learning classification tools have been used in computer vision problems [35], medical imaging [36, 37], drug discovery [38], handwriting recognition [39], geostatistics [40] and many other fields. Classification tools have machines to identify data, and therefore, know how to react to a particular data. In machine learning one of the most common methods of data classification using is using Support Vector Machines (SVM). The SVM is particularly useful because it allows us to classify data into one of two categories by giving in an input set of training data by drawing a hyperplane between the two categories. Quantum SVM machines have been recreated both theoretically [7] and experimentally [41]. These machines use qubits instead of classical bits to solve our problems. Many such quantum SVM [42–44] and quantum-inspired SVM [45, 46] algorithms have been developed. In a support vector machine, in general, we shall have our training data of n points x2 , y2 }…{ xn , yn } according to the form {{ xi , yi } : xi ∈ R N , yi = given as { x1 , y1 },{ ±1}i=1,2,...,n where xi indicated the location of the point in the space R N and yi

118

N. Mishra et al.

Fig. 2 Maximum margin hyperplane for a SVM

classifies the data as either +1 or –1 as to indicate the class to which it belongs. One of the simplest ways to divide such a data (if it is linearly separable) is by using any plane that satisfies the equation. w.  xi − b = 0

(9)

b is the offset from the origin. In Here w  is a vector normal to the hyperplane and |w|  general, it’s sometimes possible that many planes satisfy this equation so to draw a hard margin SVM where we try to construct two parallel hyperplanes with a maxi2 in between. The construction of these hyperplanes is mum possible distance of |w|  done so that w.  xi − b  1 for yi = 1 and w.  xi − b  1 for yi = −1. We can write this  xi − b)  1. This hyperplane clearly discriminates between out two types as yi (w. of data points (Fig. 2). While data often can be classified into two sets using the aforementioned method, often the data is nonlinear and method cannot be used. The common method to solve such problems is using the kernel trick where the problem is brought to a higher dimension where a hyperplane can be easily used to solve the issue. To do this, we need to use Lagrangian multipliers (α = (α1 , α2 , . . . αn )) and solve the dual formulation for optimization. Hence we can use the formula to get our solution

Quantum Machine Learning: A Review and Current Status

n

119

1 max αi − αi yi α j y j K i j 2 i=1 j=1 i=1 n

n

(10)

where K i j is the kernel matrix xj n and the dot product of the space given as K i j = xi . subject to the constraints i=1 αi yi = 0 and αi yi  0 hence the decision function of the hyperplane becomes f (x) = sgn

n

αi yi K (xi , x) + b

(11)

i=1

n αi yi .xi . Hence even nonlinear problem can be solved where we can write w = i=1 using a SVM. But sometimes the number of dimensions needed to solve a problem inadvertently turns extremely high. This leads what can be called as the Curse of Dimensionality where we have increased complexity and over-fitting due to an increasing sparse matrix defining the location of data points. This is where the quantum SVMs become important. While problems in higher dimensions are extremely tedious to solve using classical computers, the exponential speedup observed in quantum computers by using the quantum SVM algorithms very effectively sorts our data. But to solve it using a quantum computer, we need to bring our solution to a form where it will be easy for a quantum algorithm to solve it. One of the primary things to do at this point is to provide the algorithm with a certain scope of misclassification so that we do not have a problem with over-fitting, and have a variable ξi called a slack variable where ξi  0 using which we can measure the misclassification. We can now write out the following optimization problem

1 ξi min ||w|| + C 2 i=1 n

(12)

where C is the cost parameter. Let also take C = γ/2 where γ is also a form of cost parameter. Once we have set these values we can write our equation as w.  xi − b = 1 − ξi . Looking for the saddle points of the above equation using our given constraints we get the equation. F





b 0 b 0 1T = = 1 K + γ −1 I α  yi α 

(13)

Now to solve our classical algorithm in our quantum computer, we need to transform our algorithm into a quantum one. For this, firstly, we shall convert our training instances to quantum states in the form |xi . Now we shall convert our matrix F = J + K γ where

120

N. Mishra et al.

Fig. 3 Circuit of quantum SVM

Fig. 4 Matrix inversion

T 01 J= 1 0



0 0 Kγ = 0 K + γ −1 I

F Now we shall normalize F as Fˆ = tr(F) = Hausdorff formula we get our equation as

e

ˆ −i Ft

=e

−i J t Kγ

.e

F tr(K γ )

(14)

and now using Baker–Campbell-

−iγ −1 I t Kγ

.e

−i K γ t Kγ

(15)

This simplifies our equation to a form where we can find the eigenvalues and eigenbasis of our equation to find out desired values of b and α. Therefore, we can now find the hyperplane. One of the main advantages of using the quantum SVM is that the speed of execution is increased exponentially [47]. While this method can only be used for a dense training vector, other algorithms have been proved for sparse training vectors [44]. We can also create a circuit diagram [41] of this (Fig. 3). In this circuit, we use the matrix inversion to get the parameters of the hyperplane. Then we enter the training data. After this is done we enter the data x0 to get what classification our data belongs to. These can be drawn as (Figs. 4 and 5). Where F is the (M + 1) × (M + 1) matrix which contains the part of the Kernel K and th1 and th2 are the training data, and th0 is the data of the position of Ux0 . Hence, we can see that quantum SVMs are one of the most effective methods of classifying data. These equations are faster than all other methods to perform data classification. They can also be implemented with ease in most systems. There are some limitations of these systems though. Firstly, these systems can often massively overfit data. That could lead to very data point being a support vector. This is something that is not

Quantum Machine Learning: A Review and Current Status

121

Fig. 5 Training oracle data and Ux0

desirable and could lead to issues in large data sets. It can also make the hyperplane very rigid and would leave very little scope for error. We would have to increase the scope for a soft SVM. Secondly, these systems work well with linear and polynomial kernels but can cause issues in other kernels. But since most of these systems are either polynomial or linear this is usually not an issue, Non Symmetrical kernels can also cause issues. These also form one of the important problems in the future. Solving a general kernel will especially be an important problem to solve. These algorithms will allow us to solve more complicated and specific problems. These would increase the scope of Quantum SVM into a more general application.

7 Quantum Classifier A quantum classifier is a quantum computing algorithm which uses quantum states of the existing data to determine or categorize new data into their respective classes. In the following subsection we discuss about the background work on quantum classifiers, and how they have been implemented on a quantum computer.

7.1 Current Work on Quantum Classifiers In a recent paper by Microsoft [48] presented a quantum framework for supervised learning based on variational approaches (Fig. 6). Further actions are performed by the QPU(the quantum processing unit), consisting of a state preparation circuit Sx that encodes the input x into amplitudes, a model circuit Uθ , and single-qubit measurement. Such a QPU serves to perform inference with the model, or mathematically, determination of f (x, θ) = y by measuring the amplitudes prepared by the state preparation circuit and operated upon by the model circuit. Such measurements yield a 0 or a 1 from which binary prediction can be inferred. The classification circuit parameters θ are learnable and can be trained by

122

N. Mishra et al.

Fig. 6 Idea of the circuit-centric quantum classifier [48]

a variational scheme. Given a n-dimensional ket vector ψx representing an encoded

feature vector, the model circuit maps this to another ket ψ = Uθ ψ(x) by Uθ , where Uθ is necessarily unitary, parameterised by θ (Fig. 7). The above circuit has code blocks B1 and B3 with control r = 1 and r = 3, respectively. There are 17 trainable single-qubit gates G = G(α, β, γ) and 16 trainable controlled single-qubit gates C(G) which must be decomposed into elementary constant gate set for the underlying quantum hardware to use it. If optimization methods are employed such that all controlled gates are reduced to a single parameter, we get 3 × 33 + 1 = 100 learnable parameters in the model circuit, which, in turn, can classify between inputs belonging to 28 = 256 dimensions. Such flexibility such a model is much more compact than conventional feed-forward neural networks. Farhi and Neven’s [49] paper discussed about a quantum neural network (QNN), capable of representing labelled classical or labelled quantum data, and being trained by supervised learning techniques. For instance, a data set may consist of strings z = z 1 z 2 . . . z n such that each z i represents a bit taking the value +1 or −1 and a

Fig. 7 Generic model circuit architecture for 8 qubits [48]

Quantum Machine Learning: A Review and Current Status

123

Fig. 8 Schematic proposed of the quantum neural network on a quantum processor by Farhi and Neven [49]

binary label l(z) chosen as +1 or −1. Also, there exists a quantum processor acting on n + 1 qubits (ignoring the possibility of requiring ancilla qubits). The very last qubit shall be used as a readout. This processor applies unitary transforms on given input states; these are the unitaries that we have come from some toolbox of unitaries, maybe determined by experimental considerations [50] (Fig. 8). Now, preparation of the input state |, 1 occurs followed by transformation via a sequence of qubit unitary transformations Ui (i ) depending on parameters i . They are automatically adjusted during the operation, and the measurement of Yn+1 on the readout qubit produces the desired label for |. A paper by Grant et al. [51], discusses the application of hierarchy-structured quantum circuits to applications involving binary classification of classical data encoded in a quantum state. The authors come up with more expressive circuits successfully applied to the problem of classification of highly entangled quantum states. Such circuits are tree-like, parameterized by a very simple gate set which currently available quantum computers can handle. First of these is called a tree tensor network (TTN) [52]. We further consider more complex circuit layout: multiscale entanglement renormalisation ansatz (MERA) [53]. MERAs are different from TTNs in the sense that they use additional unitaries to capture quantum correlations more effectively. Both one-dimensional (1D) and two-dimensional (2D) versions of TTN, and MERA circuits have been proposed in the literature [54, 55] (Fig. 9). Earlier this year, Turkpençe et al.’s [56] paper on steady state quantum classifier demonstrates exploitation of additivity and divisibility properties of completely positive (CP) quantum dynamical maps. He also numerically demonstrates a quantum unit in its steady state, when subjected to different information environment behaves as a quantum data classifier. Dissipative environments influence the reduced system dynamics in a way that they affect the evolution of pure quantum states into mixed steady states [57]. Such mixed states are mixtures of classical probability distributions carrying no quantum signature. This model was used to demonstrate the usefulness of a small quantum system in classifying data contained in the quantum environments when the former is left in contact with these quantum environments. Figure 10b depicts a few commonly used activation functions. For instance, a step function yields f (y) = 1 if y = i xi wi  0 and yields f (y) = −1 else. After these

124

N. Mishra et al.

Fig. 9 TTN and MERA classifiers for eight qubits. a TNN Classifier, b MERA classifier [51]

results, if a line correctly separates the data instances, this corresponds to a properly functioning perceptron. To benchmark these calculations, the authors make contact between the single spin and the data reservoir in the ρπ =|↓↓| fixed quantum state. They then apply information as shown in Fig. 10d. The authors observe that the time evolution of spin magnetization converges to (σz (t)) = −1 as the spin density matrix approaches to the unit fidelity

 √ ρπσS(t) ρπ = 1 (16) (t) = Tr with the fixed reservoir state monotonically.

Quantum Machine Learning: A Review and Current Status

125

Fig. 10 A general view of the proposed model. a A classical perceptron with N inputs. b A few of the activation functions for the perceptron, c The scheme of the proposed quantum classifier. d Collision model to simulate quantum dynamic systems. e Time evolution of the single spin magnetization depending on the number of collisions. f The Bloch ball vector trajectory of the single spin during the evolution [56]

8 Application of Quantum Machine Learning to Physics Machine learning methods have been effectively used in various quantum information processing problems including quantum metrology, Hamiltonian estimation, quantum signal processing, problems of quantum control and many others. The construction of advanced quantum devices including quantum computers use the techniques of quantum machine learning and artificial intelligence. Machine learning and Reinforcement learning techniques are used to create entangled states. Automated machines can control complex processes, for example, the execution of a sequence of simple gates, as used in quantum computation. While performing the quantum computation, decoherence or noise can be dealt with, by using advanced techniques of machine learning. Optimization algorithms are also used in the optimization of QKD-type cryptographic protocols in noisy channels [58]. Online Machine learning optimization can be used for determining the optimal evaporation ramps for BoseEinstein condensates production [59]. The overlap between the theoretical foundations of machine learning and quantum theory is due to the underlying statistical nature of both. In the field of condensed matter physics, the identification of different phases and determining the order parameters can be done with the help of unsupervised learning. The problem of the Ising model configurations of thermal states can be solved using unsupervised learning techniques. Besides detecting the phases, the order parameters (for example magnetization in this case) can also be identified. Even without any prior knowledge about the system Hamiltonian, we can get information about the phases using these techniques.

126

N. Mishra et al.

9 Quantum Machine Learning and Quantum Artificial Intelligence Quantum Artificial Intelligence is still a much more debatable concept. However, in the following few subsections, we try to understand some basic AI and machine learning terminologies, and finally see how they can be modified using quantum information processing and quantum computing. At the end of this section, we cite some recent developments in this field.

9.1 Basic Terminologies Human intelligence allows us to accumulate knowledge, understand it and use it to make the best decisions. The field of AI aims to simulate such kind of process. The most important part of AI is machine learning (ML). ML tries to formalize algorithms which can learn and predict using some initial data. ML broadly can be classified into two fields viz. Supervised Intelligence and Unsupervised Intelligence. Supervised intelligence maps input to output using labels. Unsupervised learning, on the other hand, doesn’t use labels and rather uses samples based on some specified rules. Another class of ML is Responsive Learning (RL). This is important with a quantum information perspective. In RL the environment is interactive rather than being static. The agent interacts with the environment and gets rewarded if its behaviour is correct. The agent learns through its cumulative experiences. An intelligent agent may be defined as an autonomous entity which can store data and act to achieve some goals.

9.2 Quantum Artificial Intelligence The bigger question now is “Can quantum world offer something to the field of AI ?”. We will now try to relate quantum information processing to AI using some kind of simulations. Quantum Computing (QC) can simulate large quantum data and can enable faster search and optimization. This in particular is very helpful for AI. For example, Govern algorithms and its several variants have a quadratic speed-up in search problems, as well as recent advances in Quantum Machine learning, have caused exponential gains in machine learning problems. We now try to understand Projective Simulation (PS). The agent is placed in some environment, such that the agent can act on the environment, and the environment responds as certain physical inputs. Hence, the agent learns from experience. The main part of the PS model is Episodic Computational Memory (ECM). ECM helps the agent to project itself, and thus induces a random walk through episodic memory space. PS model can easily be quantized. A quantum-enhanced autonomous agent is any agent interacting with classical environments but having a memory with quantum degrees of freedom.

Quantum Machine Learning: A Review and Current Status

127

The agent now takes a quantum walk through its memory space. The transitions generated are now quantum superpositions and can interfere. Also, quantum jumps are generated between different clip states. Since the PS model can be quantized, the model can potentially reach high speed-ups in exploring memory. Hence, the extension of PS model to quantum regime defines for the first time the meaning of embodied quantum agents.

9.3 Recent Developments A team at the University of Pavia in Italy, conducted early research into artificial neural networks, experimented upon IBM’s 5-qubit quantum hardware. Furthermore, there has been a collaboration between IBM and Raytheon BBM, in 2017, to improve the efficiency of certain black-box machine learning tasks. Currently, superconducting electronics has received attention as being a viable candidate for the creation of quantum hardware, with Google’s Quantum AI Lab and UC Santa Barbara’s partnership in 2014, being the latest venture. According to a recent research paper on “Quantum Computing for AI Alignment”, as of now, we can’t expect QC to be relevant to current AI Alignment research due to safety reasons until some protocols are made as efficient as possible. We conclude this section by quoting Sankar Das Sarma and Dong-Ling Deng and Lu-Ming Duan, who wrote “It is hard to foresee what the quantum future will look like. Yet one thing is certain: The marriage of machine learning and quantum physics is a symbiotic relationship that could transform them both.”

10 Entanglement in Quantum Machine Learning Quantum Non-locality and Entanglement was recognized as a key feature of Quantum Physics. Entanglement can be described as correlations between distinct subsystems which cannot be created by local actions on each subsystem separately. In quantum Entanglement, two or more particles which are separated (space like separated) are correlated in such a way that local measurements in any one of the particles will affect the other particle(s) far away, i.e. ‘The spooky action at a distance’ stated by Albert Einstein. This basic nature of quantum particles is due to entanglement. Entanglement received a lot of attention since the advent of quantum mechanics (EPR paradox [60]), and still remains an area of active research. Let us consider a very simple example. Let’s take two shocks (qubits) and each of the shocks can be right one or left one or superposition of both right and left with some probabilities. The right shock is represented by |0 and left one by |1. Now if we want to represent a group of two shocks we will take a tensor product of the two. The composite system of two shock is represented by |ψ such as |ψ = (a |0 + b |1) ⊗ (c |0 + d |1) |ψ = (ac |0 |0 + ad |0 |1 + bc |1 |0 + bd |1 |1), where a,b,c,d are probabili-

128

N. Mishra et al.

ties coefficient. If we want to form a pair of shocks from these set then the coefficients must be selected such as to cancel the two terms of the sum that is |00 and |11 conserving the other two. Now we cannot obtain the quantum state as a tensor product of the individual shocks. This is because between them to form a pair they have some correlation. That when these qubits are separated (space like separated), their correlations remains, even without the existence of any interaction. It is said the shocks (qubits) are entangled; it is not possible to represent the composite system as a tensor product of individual qubit states. These qubits are interconnected, such that measurement on one qubit affects the other. This feature is extensively used in Machine learning as it reduces no of qubits required to perform the same task in classical machine learning. However, there are some demerits of using Quantum Machine Learning as well which is discussed in the work of Cristian [61] and later in our conclusion. In 2015, Cai [62] and his group did a work in which they demonstrate a very difficult task for conventional machine learning—manipulation of high dimensional vectors and the estimation of the distance and inner product between vectors—can be done with quantum computers. It is, for the first time, reported using a small-scale photonic quantum computer to experimentally implement classification of 2, 4 and 8 dimensional vectors into different clusters. These clusters are then used to implement supervised and unsupervised machine learning. Theoretically, this can be scaled to a larger number of qubits, thereby paving a way for accelerating machine learning. In 2018, another work is done by Liu [63] and his group. They implemented simple numerical experiments, related to pattern/image classification, in which they represent the classifiers by many-qubit quantum states written in the matrix product states (MPS). Classical machine learning algorithm is applied to these quantum states to learn the classical data. They explicitly show how quantum entanglement (i.e. single-site and bipartite entanglement) can emerge in such represented images. Entanglement characterizes here the importance of data, and such information is practically used to guide the architecture of MPS, and improve the efficiency. The number of needed qubits can be reduced to less than 1/10 of the original number, which is within the access of the state-of-the-art quantum computers. In recent work by Yoav Levine [64] and his group, it is established that deep convolutional neural networks and recurrent neural networks can represent highly entangled quantum systems. They constructed Tensor Network equivalents of such architectures and identified that the information in the network operation can be reused; this trait distinguishes these from other standard Tensor Network-based representations, thereby increasing their entanglement capacity. These results show that volume-law entanglement can be supported by such architectures, and these are polynomially more efficient than presently employed RBMs. Analysis of deep learning architectures lead to quantification of the entanglement capacity, as well as formally motivating a shift of trending neural network-based wave function representations, closer to the state-of-the-art in machine learning. Neural Network is one of the most significant sides of machine learning and artificial intelligence. To make machines learn from the data patterns, analyze the data on its own, scientists made algorithms to simulate our natural neural network. Warren

Quantum Machine Learning: A Review and Current Status

129

Fig. 11 Performance contrast between entangled-state SVM and separable-state SVM [69]

McCulloch of the University of Illinois and Walter Pitts of the University of Chicago, developed the theoretical basis of neural networks in 1943. In 1954, Belmont Farley and Wesley Clark of MIT, developed the first neural network for pattern-recognition skills [65]. In this context, we see that as the demand for machine learning is increasing day by day, understanding the physical aspects of neural network is to be increased certainly, and this is one of the sides where the study of entanglement properties has to be done. Focusing on the RBM [66] (Restricted Boltzmann Machine), Dong-Ling Deng, Xiaopeng Li and S. Das Sarma, in 2017, studied [67] the entanglement properties, and they found that for short RBM states entanglement entropy follows the area law which is also inspired by the holographic principle [68] that states all the informations reside on the surface of the black hole, hence the entropy depends on its surface not on volume. For any dimension and arbitrary bipartition geometric R-range RBM states, entanglement entropy becomes S ≤ 2a(A)R log 2

(17)

where a(A) is the surface area of subsystem A. In the limit, N → ∞ (N qubits), the entropy starts to vary linearly with the size of the system—entanglement volume law. Supervised Learning can be enhanced by Entangled Sensor Network as shown by Zhuang and Zhang, early this year [69]. So far existing quantum supervised learning schemes depend on quantum random access memories (qRAM) to store quantum-encoded data given a priori in a classical description. However, the data acquisition process has not been used while this process makes the maximum usage of input data for different supervised machine learning tasks, as constrained by the quantum Cramér-Rao bound. They introduced the Supervised Learning Enhanced by an Entangled sensor Network (SLEEN). The entanglement states become handy in quantum data classification and data compression. They used SLEEN to construct entanglement-enhanced SVM and entangled-enhanced PCA and for both the cases, they got genuine advantages of entangled states—data classification and data compression, respectively (Fig. 11). In the case of SVMs, while separable-state SVM becomes inaccurate due to measurement uncertainty making the data classification less contrasting when entangledstate SVM is not affected by the uncertainty keeping the output as expected (Fig. 12).

130

N. Mishra et al.

Fig. 12 Performance contrast between entangled-state PCA and separable-state PCA [69]

In the case of PCAs, the same uncertainties factor prevent the entangled-state PCA from making a perfect principal axis, while entangled-state PCA precisely finds the principal axis. Hence, this entanglement stuff makes the Supervised Machine Learning ultrasensitive to various fields in biological, thermal systems. On the other hand, machine learning is also used to determine the entanglement of systems—how much entangled they are. Jaffali and Oeding showed, mainly focusing on pure states in their paper how Artificial Neural Network can be trained to predict the entanglement type of a quantum state and distinguish them. This may help in processing quantum information, increasing the efficiency of the quantum algorithms, cryptographic schemes, etc. From the above works, we can see that by using the Quantum Entanglement we cannot only outperform the results of classical computers but it also requires less resources. The most motivating work is merging many-body quantum systems to machine learning using Tensor Network. Such an interdisciplinary field was recently strongly motivated, due to the exciting achievements in the so-called “quantum technologies”. Thanks to the successes in quantum simulations/computations, including the D Wave and the quantum computers by Google and others (“Quantum Supremacy”), it becomes unprecedentedly urgent and important to explore the utilization of quantum computations to solve machine learning tasks. The low demands on the bond dimensions, and particularly, on the size, permit to simulate machine learning tasks by quantum simulations or quantum computations in the near future.

11 Quantum Neural Network The following few subsections elaborate the merger of classical neural networks and quantum computing, producing a more powerful version of the former. In subsection A, we provide a brief introduction to classical neural networks. In subsection B, we discuss the previous works on quantum neural networks. Subsequent sections onwards, we describe the quantum neuron and its implementation to the quantum computer, and present the latest development on quantum convolutional neural networks.

Quantum Machine Learning: A Review and Current Status

131

11.1 Classical Neural Networks One of the most basic neural networks in classical deep learning is the deep feedforward networks, mathematically defined by a function y = f (x; θ), where x is the input n-dimensional vector, y is the output m-dimensional vector (m < n usually), and θ represents the parameters that guide the network to map x to y [70]. Neural networks are usually organized in layers (especially the hidden layers between the input layer from which propagation occurs to different hidden layers and the output layer to which propagation occurs from some hidden layer) to divide computation into steps. At each step (or hidden layer), some degree of nonlinearity is added, allowing the network to learn complicated functions. Training a network is choosing the combination of parameters θ that can map the input vectors x as closely as possible to the actual output vectors y [70]. Training involves the initial random choice of parameters, followed by gradual updates as the same vectors x are passed on to the network, and predictions by the network are compared to the actual, real outputs y externally provided to the network. This training happens through the classical procedures—gradient descent and backpropagation [70]. Several different types of architectures have been developed, for instance, Convolutional Neural Networks (CNNs), Long-Short Term Memory networks (LSTMs), Recurrent Neural Networks (RNNs), Generative Adversarial Neural networks (GANs), variational autoencoders and many more [70]. Together, they are able to drive the AI revolution, finding increasing applications to image and sound processing, self-driving cars, online recommender systems, reinforcement learning-based game playing bots, stock market price predictors, virtual assistants and several other applications in all walks of life. For further technical details on these, we refer to Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville [70].

11.2 Background Work in Quantum Neural Networks In a paper by Kak in 1995 [71], attempts were made to model a neural network in quantum computing, and discussions were presented on the versatility of the quantum neural computer with respect to the classical computers. In the same year, Menneer and Narayanan’s [72] work introduced a method based on a multi-universe interpretation of the quantum theory that made neural network training powerful than ever before—superposition of several single layered neural networks to form a bigger quantum neural network. Perus [73] used the quantum version of classical gradient descent, coupled with CNOT gates, to demonstrate the use of parallelism in quantum neural architectures. Menneer [74] did a comprehensive study of the contemporary NN architectures in his Ph.D. thesis. Faber et al. [75] addressed the question of implementation of an artificial neural network architecture on a quantum hardware. Schuld [76] gave guidelines for quantum neural network models: (1) ability of the

132

N. Mishra et al.

network to produce the binary output of length separated from the length of the binary input by some distance measure, (2) reflect some neural computing mechanisms and (3) utilize the quantum phenomenon and be fully consistent with the quantum theory. In the recent past, several advancements have been made to bridge the gap between classical and quantum deep learning. In 2014, Wiebe et al. [77] demonstrated the power of quantum computing over classical computing in deep learning and objective function optimization. Adachi et al. Recent research hasb demonstrated the superiority of quantum annealing (using D-Wave quantum annealing machine) to contrastive divergence based methods, and tested the same on preprocessed MNIST data set. Other notable works include quantum perception model [78], quantum neural networks based on Deutsch’s model of quantum computational network [79], quantum classification algorithm based on competitive learning neural network and entanglement measure [80], a novel autonomous perceptron model for pattern classification applications [81] and a quantum version of the generative adversarial networks [82] to name a few.

11.3 Quantum Neuron The current issue in quantum neural networks is the problem of introducing nonlinearity, as is the case in classical neural networks. Nonlinearity is central to learning complex functions, and thus efforts have been made to resolve this: use of quantum measurements Kak et al. [83] and Zak et al. [84], using dissipative quantum gates [83], and the idea that a quantum neural network based on the time evolution of the system is intrinsically nonlinear. A quantum neuron is strongly correlated to the actual neuron of the human system. The latter, based on the electrochemical signals received, either fires or not. Similar is the model of the classical neuron in deep learning. An input vector x (corresponding to the stimulus in humans) is combined with a set of weights θ (corresponding to the neurotransmitters), and the result of this combination determines whether the neuron fires or not. Mathematically, a n-dimensional input vector X = x1 , x2 , ..., xn is combined with weights θ = θ1 , θ2 , ..., θn to yield the combination: x1 θ1 + x2 θ2 + ... + xn θn + b where b is the bias added to the computation to incorporate functions not passing through the origin of the n−dimensional space considered here [85]. To introduce nonlinearity in the same, several activation functions are used, which have been shown to benefit neural network training [86]. Recent advances have explored learning activation functions for separate neurons using gradient descent [87], approximation of neural networks using ReLU as the activation function [88], and other conventional functions like the sigmoid function and the step function. The quantum equivalent of the classical neuron: the quantum neuron, is used to build the quantum neural networks, which benefit from the intrinsic property of quantum mechanics of storing co matrices and performing linear algebraic operations on those matrices conveniently Neukart et al. [89], Schuld et al. [90], Alvarez et al. [91], Wan et al. [92], Rebentrost et al. [93], Otterbach et al. [94], Lamata et al. [95].

Quantum Machine Learning: A Review and Current Status

133

To implement a quantum neuron, a set of n qubits is prepared and operated upon by some unitary transformation, and the result is prepared in a separate ancilla qubit that is then measured—the measurement being the decision whether the quantum neuron fires or not. Specific details follow as under. To encode a m dimensional classical input vector x, n qubits are used such that m = 2n , n < m, thereby exploiting the advantage of quantum information storage allowing the exponential reduction in the number of input nodes required. The following transformation is done on the input qubits: U |0⊗n = |ψ. Assuming the computational basis of the already defined n dimensional Hilbert space is |1, |2, |3, ..., |n, the input vector x and the weight vector θ can be defined in quantum terms as n 1 x j | j (18) |ψ = 1/2 n j=1 where x j is represents the usual jth component of the classical input vector x. Likewise, the weight vector θ can be encoded in the quantum realm as |φ =

n 1

n 1/2

θ j | j

(19)

j=1

where θ j represents the usual jth component of the classical weight vector θ. Tacchino et al. [96] defines a unitary operation that performs the inner product of the two terms defined above, and updates an ancilla qubit based on a multicontrolled NOT gate. The authors introduce a nonlinearity by performing a quantum measurement on the ancilla qubit.

11.4 Quantum Convolutional Neural Network Convolutional Neural Network is a type of deep neural network architecture motivated by the visual cortex of animals [97]. CNNs provide great power over a variety of tasks: object tracking [98], text detection and recognition [99], pose estimation [100], action recognition [101], scene labelling [102], saliency detection using multicontext deep learning [103]. Further review of deep convolutional deep learning is referred [104]. The power of CNNs arises from the several convolutional layers, pooling layers, followed by few densely, fully connected layers that help to reduce the huge size of various matrices of images to few hundred nodes which can then be used for the final output layer of a few nodes (for instance equal to the number of classes in a multi-classification problem). The weights are optimized by training on huge data sets fed into the network through multiple passes. CNNs also involve parameters that directly affect the parameters/weights, called the hyperparameters. Hyperparameters are fixed for specific networks based on experiments and comparisons across several models.

134

N. Mishra et al.

On the quantum side, neural networks have been used to study properties of quantum many-body systems, as in Carleo et al. [105], Nieuwenburg et al. [106], Maskara et al. [107], Zhang et al. [108], Carrasquilla et al. [109], Wang et al. [110], and Levine et. al. [111]. The use of quantum computers to enhance conventional machine learning tasks has gained traction in the modern world Biamonte et al. [112], Dunjko et al. [113], Farhi et al. [114], and Huggins et al. [115]. A QCNN circuit model has been proposed by Cong et al. [116]. The proposed model upper bounds the n input parameters by O(log(n)). Like conventional CNN, the authors continued the training on the quantum version of the mean squared error: MSE =

m 1 (yi − h(ψ))2 2M i=0

(20)

where yi is the actual output of the input state ψ, and h(ψ) is the computation done by the quantum network. The mean squared error tends to reduce the distance between the predicted value from the network. The authors discuss the efficient implementation of experimental platforms: regarding efficiently preparing quantum many-body input states, two-qubit gates application and projective measurements. With the success of the quantum convolutional neural networks, it is hoped that other conventional deep learning networks will be soon converted, thus increasing the range of quantum neural networks. To solve highly complex problems like quantum phase recognition (QPR), which probes if an arbitrary input quantum state ρin belongs to the particular quantum phase of matter, quantum error correction (QEC) optimization which probes for an optimal quantum error correcting code for a given, a priori unknown error model such as dephasing or potentially correlated depolarization, quantum convolution neural networks has stood out to be the best possible solution. These problems are intrinsically quantum in nature, and therefore, unsolvable by classical machine learning techniques that are in use today. The classical machine learning algorithms with large, deep neural networks can only solve classical problems such as image recognition; the quantum systems, however, involve exponentially large Hilbert spaces and are difficult to implement in a classical framework. Quantum algorithms are able to solve these problems, due to parallelism offered in the design of such algorithms; however, the limited coherence times of short-term quantum devices prevent the realization of large scale deep quantum neural networks.

12 Use of Artificial Neural Networks to Solve Many-Body Quantum Systems Studying the many-body quantum systems is one of the most challenging areas of research in physics. It is mainly due to the exponential complexity of the many-body wave function and the difficulty that arises in describing the non-trivial correlations

Quantum Machine Learning: A Review and Current Status

135

encoded in its wavefunction [105]. However, recently the use of neural networks for the variational representation of the quantum many-body states has generated a huge interest in this field [117–119]. This representation was first introduced by Carleo and Troyer in 2016, in which they had used the Restricted Boltzmann Machine (RBM) architecture with a variable number of hidden neurons. Using this procedure, they could find the ground state of the transverse-field Ising (TFI) and the antiferromagnetic Heisenberg (AFH) models with high accuracy [105]. Moerover, they could also describe the unitary time evolution of complex interacting quantum systems. Since then neural networks have been extensively used to study various physical systems. The representational power and the entanglement properties of the RBM states have been investigated, and the RBM representation of different systems such as the Ising Model, Toric code, graph states and stabilizer codes have been constructed [117]. Also, the representational power of the other neural network architectures such as the Deep Boltzmann Machine (DBM) [120] are under active investigation.

12.1 Variational Representation of Many-Body Systems in RBM Networks A neural network is an architecture that uses its parameters to represent the quantum state of a physical system [105]. The Restricted Boltzmann Machine architecture consists of a visible layer of N neurons and a hidden layer of M neurons. The neurons of the visible layer and hidden layers are connected but there are no intralayer connections. As the spin of the neurons in the RBM network can have the values ±1, the spins of the neurons of the visible layer can be mapped to the spins of the physical system they represent. Moreover a set of weights is assigned to the visible (ai for the i th visible neuron), hidden (bi for the i th hidden neuron) and to the couplers connecting them (Wi j for the coupler connecting the i th visible neuron with the jth hidden neuron) [121]. Then, wave function ansatz for the N-dimensional N would be given by [105, quantum state of spin variable configuration S = {si }i=1 121].    e i ai si + j b j h j + i j Wi j si h j (21) ψ M (S, W) = {h i }

where si and h i denote the spins of the visible and hidden neurons, respectively, and the whole state is given by the superposition of all the spin configuration states with ψ(S) as the amplitude of the |S state [121, 122]. |ψ =

S

ψ(S) |S

(22)

136

N. Mishra et al.

13 Classical Simulation of Quantum Computation Using Neural Networks Since the neural networks are able to represent various quantum states efficiently, a natural question to be posed is whether they can also simulate various quantum algorithms. Interestingly, the networks are also able to simulate the action of various quantum gates. This has been investigated in the DBM and the RBM architectures [120, 123]. As mentioned before, the representation of a quantum state by a neural network depends on its network parameters. Thus, the action of various gates can be simulated by appropriately changing the network parameters in a way that the new quantum state represented by the network with the new parameters is the same as would have been obtained by applying the quantum gate to the initial quantum state. Also in a recent work, the methods to prepare specific initial states in RBM analogous to those used as initial states while implementing a quantum algorithm in a quantum circuit model has been discussed [121]. The prepared states were shown to efficiently simulate the action of the Pauli X gate. These results have opened up a great possibility of solving various quantum mechanical problems using neural networks. Future investigations in this direction may include the implementations of quantum algorithms in various neural network architectures and the exploitation of the machine learning techniques to achieve higher accuracy in solving the quantum mechanical problems [121].

14 Implementation of Quantum Machine Learning Algorithms on Quantum Computers In this section, we discuss the implementation of some quantum machine learning algorithms with the help of quantum logic and quantum gates. Earlier this year H. Liu’s [124] paper first proposed a quantum algorithm to obtain the classical gradients. can be regarded as the inner product of two vecj j tors ( p(x1 ; w) − y1 , ..., p(x N ; w) − y N ) and (x1 , ..., x N ). To achieve this, their quantum algorithm consists of two steps: generate an intermediate quantum state √1  N |i| p(x i ; w) based mainly on amplitude estimation; (2)perform swap test N i=1 to obtain w j in the classical form. Then the parameters w is updated according to the iterative rules via simple calculations. The entire algorithm process is shown in Fig. 13. (1) Generate an intermediate quantum state (1.1) Preparing three quantum registers in the state |0 log N  |0 |0log M  and perform the Hadamard gates H ⊗ log N on the first register, then the system becomes 1 N |i|0|0log M  (23) √ i=1 N

Quantum Machine Learning: A Review and Current Status

137

Fig. 13 The structure of a restricted Boltzmann machine [121]. The spin configuration of the N N (s is the spin value of the i h neuron). Also, there are M visible neurons is represented by {si }i=1 i t hidden neurons in one more layer (called the hidden layer). The coupler connecting the i th visble neuron with the jth hidden neuron has the weight Wi j . However, there are no intra-layer connections. The wavefunction ansatz for the system represented by this network is given by Eq. 21

Fig. 14 The whole process of the entire algorithm for quantum logistic regression [124]

where i is represented in binary as i 1 , i 2 , . . . , i log N . (1.2) Perform H on the second register (Fig. 14) 1 1 √ |i √ (|0 + |1)|0log M  N 2

(24)

138

N. Mishra et al.

Fig. 15 Quantum circuit diagram to generate an immediate quantum state [124]

In the above circuit, where  is the precision, η is the step factor, w(0) is the initial value, t = 0, 1, 2, . . . is the iteration number, T is the data set. The brown rectangle represents the quantum algorithm, and the light purple rectangle represents the classical iterative update algorithm (Fig. 15). In the above circuit, A denotes the 2 sin 2(π) − 1 gate to estimate xi |w. s is the number qubits for estimating xi |w., B, U are the unitary gates to access ||w||,||xi ||, 1 gate to obtain | p(xi ; w). respectively. C represents the 1+exp(−kx) In addition to the above quantum machine learning algorithm implementation, there have been other algorithms that have been implemented on quantum computers as well, for example, S. Lloyd [125] and the team have work Quantum Hopfield neural network which uses qHop, encoding data in order log2 (d) qubits.

15 Conclusion and Future Works In this review, we tried to compile the effects that quantum computers are having and will have on machine learning. While only a few years ago, most of the research works in these fields were largely theoretical, we now have demonstrable quantum machine learning algorithms. And as expected these algorithms are significantly faster and more effective than their classical counterparts. This amalgamation of machine learning and quantum computers allows us to run classical algorithms significantly faster in many cases. The effect that quantum computers can have on machine learning is extremely vast. As quantum computers with larger number of qubits are realized, we will be able to test more quantum algorithms and then truly access the effect that quantum computers will end up having on machine learning. Many of realized on an actual quantum computer due to the large number of qubits they require. But as research in this field progresses, we shall have better quantum computers and better algorithms to solve our machine learning problems. It is always possible that a more effective algorithm to solve machine learning problems is yet to be discovered. This is still one of the biggest problems while working with quantum computers since

Quantum Machine Learning: A Review and Current Status

139

quantum algorithms are often unintuitive and it may take a lot of time to discover a better algorithm. Using quantum computers, we are able to implement classical machine learning classifiers for better, faster and accurate classification. While only time can tell the true effect that quantum computers will have on machine learning the possibilities seem endless and with every new algorithm machine learning seems to be something that can be definitely improved upon by quantum computers. In our society where huge amounts of data is collected and needs to be processed every minute, and where new and novel research methods can have huge impacts on both life and economy quantum machine learning definitely seems to be a methodology that will lead to a better future. Here, we discussed a number of different methods of quantum machine learning algorithms. The research in this field is still mostly on a theoretical level, with few dedicated experiments on the demonstration of the usefulness of quantum mechanics for machine learning and artificial intelligence. Quantum computation exhibits promising applications in machine learning and data analysis with much more advance in time and space complexity. Experimental verification of quantum algorithms requires dedicated quantum hardware, and that is not presently available. It is hoped that the research community shall soon have access to scale-scale quantum computers (500–1000 qubits) via quantum cloud computing ( ’Qloud’ ). However, the world shall still continue to see advancements, in size and complexity, of special-purpose quantum information processors: quantum annealers, nitrogen vacancy centres (NV)-diamond arrays, made to order superconducting circuits, integrated photonic chips and qRAM. Moreover, silicon-based integrated photonics have been used to construct programmable quantum optic arrays having around 100 tunable interferometers; however, as the circuits are scaled up, we experience the loss of quantum effects. There is yet not a general theory that can be used to analyze and engineer new quantum machine learning algorithms. Further, there are still some interesting questions yet to be unanswered, and problems yet to be solved. One of the main problems is that the proposed implementations are limited in the quantity of input data they can handle. Since the dimension of Hilbert space increases as the size does, if we permit to store a huge quantity of data, we must ensure accurate and efficient initialization of the quantum system. This is problematic because ensuring this is difficult and we need a large amount of data for reliable training. In addition to this, we have the problem of retaining memory. In machine learning, we take for granted the memory of the network, which is able to remember the weights of the network. But obtaining this memory in quantum dynamics is very difficult due to the unitary evolution of the system. Another problem pertains to quantum annealers: to improve connectivity and implement more general tunable couplings between qubits. Another problem is that irrespective of the ability of quantum algorithms to provide speed-ups in data processing, there is no advantage to be taken in reading data; on the other hand, the latter sometimes overshadows the speed-up offered by certain algorithms. There is yet another problem relating to obtaining a full solution from quantum algorithms. This is because obtaining a string of bits requires observing an exponential number of bits, rendering few applications of quantum machine learning infeasible. A poten-

140

N. Mishra et al.

tial solution for this problem is to only learn the summary statistics for the solution state. The main challenge is the costing. The theoretical bounds on the complexity suggest that for sufficiently large problems, huge advantages can be extracted; it is unclear, however, when that crossover point occurs. Another problem is the benchmarking problem. To properly assert the supremacy of a quantum algorithm, it must be shown better than known classical machine learning algorithms. This requires extensive benchmarking against modern heuristic methods. We can avoid a few of the above said problems by applying quantum machine learning for a system which follows the principle of quantum mechanics rather than applying on classical data. An aim can be to use quantum machine learning to control quantum computers, enabling an innovative cycle like that in classical computation, where each generation of processors is used to design processors of the next generation. For quantum algorithms for linear algebra, their practical performance is hindered due to issues for data access and restrictions on problem classes that can be solved. The true potential of such techniques can only be evaluated by near future advances in quantum hardware development. Given the great majority of QML literature developed within the quantum community, further advances can come after significant interactions between the two communities. This is the reason behind us structuring this review to be familiar to both quantum scientists and ML researchers. To achieve this goal, we put great emphasis on the computational aspects of ML. Acknowledgments A.P.D. acknowledges the support of KVPY fellowship. S.R. and S.C. acknowledge the support of DST Inspire fellowship. B.K.B acknowledges the support of IISER-K Institute fellowship.

References 1. P.W. Shor, Algorithms for quantum computation: discrete logarithms and factoring, Proceedings 35th Annual Symposium on Foundations of Computer Science (IEEE Comput. Soc. Press, 1994) 2. J. Bermejo-Vega, K.C. Zatloukal, Abelian Hypergroups and Quantum Computation (2015). arXiv:1509.05806 3. R.D. Somma, Quantum simulations of one dimensional quantum systems. Quantum. Inf. Comput. 16, 1125 (2016) 4. S.J. Russell, P. Norvig, Artificial Intelligence: A Modern Approach (Pearson Education Limited, Malaysia, 2016) 5. O. Bousquet, U.V. Luxburg, G. Ratsch, (eds.), Advanced Lectures on Machine Learning: ML Summer Schools 2003 (Canberra, Australia, 2003; Tubingen, Germany, 2003). Revised Lectures (Springer, 2011), p. 3176 6. L.P. Kaelbling, M.L. Littman, A.W. Moore, Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996) 7. P. Rebentrost, M. Mohseni, S. Lloyd, Quantum support vector machine for big data classification. Phys. Rev. Lett. 113, 130503 (2014) 8. H. Abdi, L.J. Williams, Principal component analysis. Wil. Inter. Rev.: Comput. Stat. 2, 433 (2010) 9. R. Orus, S. Mugel, E. Lizaso, Quantum computing for finance: overview and prospects. Rev. Phys. 4, 100028 (2019)

Quantum Machine Learning: A Review and Current Status

141

10. M.S. Palsson, M. Gu, J. Ho, H.M. Wiseman, G.J. Pryde, Experimentally modeling stochastic processes with less memory by the use of a quantum processor. Sci. Adv. 3 (2017) 11. J. Li, S. Kais, Entanglement classifier in chemical reactions. Sci. Adv. 5, eaax5283 (2019) 12. C.H. Bennett, F. Bessette, G. Brassard, L. Salvail, J. Smolin, Experimental quantum cryptography. J. Crypt. 5, 3 (1992) 13. A. Pathak, Elements of Quantum Computation and Quantum Communication (CRC Press, 2018) 14. S. Shahane, S. Shendye, A. Shaikh, Implementation of artificial neural network learning methods on embedded platform. Int. J. Electric. Electron. Comput. Syst. 2, 2347 (2014) 15. T.M. Mitchell, Machine Learning (McGraw-Hill, 2006) 16. D. Angluin, Computational learning theory: Survey and selected bibliography, in Proceedings of 24th Annual ACM Symposium on Theory of Computing (1992), pp. 351–369 17. L. Valiant, Commun. ACM 27(11), 1134–1142 (1984) 18. V.N. Vapnik, A. Chervonenkis, On uniform convergence of relative frequencies of events to their probabilities. Theor. Prob. Appl. 16, 264 (1971) 19. M. Schuld, I. Sinayskiy, F. Petruccione, An introduction to quantum machine learning. arXiv:1409.3097v1 20. S. Lloyd, M. Mohseni, P. Rebentrost, Quantum algorithms for supervised and unsupervised machine learning. arXiv:1307.0411 21. E. Farhi, H. Neven, Classification with quantum neural networks on near term processors. arXiv:1802.06002v2 22. S. Lu, S.L. Braunstein, Quantum decision tree classifier. Quantum. Inf. Process (2013) 23. S. Lloyd, Quantum algorithm for solving linear systems of equations. APS March Meeting Abstracts (2010) 24. S. Aaronson, Read the fine print. Nat. Phys. 11(4), 291 (2015) 25. A.M. Childs, R. Kothari, R.D. Somma, SIAM J. Comput. 46, 1920 (2017) 26. B.D. Clader, B.C. Jacobs, C.R. Sprouse, Preconditioned quantum linear system algorithm. Phys. Rev. Lett. 110.25, 250504 (2013) 27. D. Dervovic et al., Quantum Linear Systems Algorithms: A Primer (2018). arXiv:1802.08227 28. S. Dutta et al., Demonstration of a Quantum Circuit Design Methodology for Multiple Regression (2018). arXiv:1811.01726 29. E. Tang, A quantum-inspired classical algorithm for recommendation systems, in Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (ACM, 2019) 30. E. Tang, Quantum-inspired classical algorithms for principal component analysis and supervised clustering (2018). arXiv:1811.00414 31. A. Gilyén, S. Lloyd, E. Tang, Quantum-inspired low-rank stochastic regression with logarithmic dependence on the dimension (2018). arXiv:1811.04909 32. N.-H. Chia, H.-H. Lin, C. Wang C, Quantum-inspired sublinear classical algorithms for solving low-rank linear systems (2018). arXiv:1811.04852 33. E. Tang, An overview of quantum-inspired classical sampling (2019). https://ewintang.com/ blog/2019/01/28/an-overview-of-quantum-inspired-sampling/ 34. E. Tang, Some settings supporting efficient state preparation (2019). https://ewintang.com/ blog/2019/06/13/some-settings-supporting-efficient-state-preparation/ 35. S. Nowozin, C.H. Lampert, Structured learning and prediction in computer vision. Found. Trends Comput. Graph. Vis. 6, 185–365 (2011) 36. M.N. Wernick, Y. Yang, J.G. Brankov et al., Machine learning in medical imaging. IEEE 27, 25–38 (2010) 37. B.J. Erickson, P. Korfiatis, Z. Akkus, T.L. Kline, Machine learning for medical imaging. Rad. Graph. 37, 505–515 (2017) 38. A. Lavecchia, Machine-learning approaches in drug discovey: methods 5nd applications. Sci. Dir. 20, 318–331 (2015) 39. C. Bahlmann, B. Haasdonk, H. Burkhardt, Online handwriting recognition with support vector machines—a kernel approach, in Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition (2002)

142

N. Mishra et al.

40. L. Chen, C. Ren, L. Li et al., A comparative assessment of geostatistical, machine learning, and hybrid approaches for mapping topsoil organic carbon content. Int. J. Geo-Inf. 8(4), 174 (2019) 41. Z. Li, X. Lui, N. Xu, J. Du, Experimental realization of a quantum support vector machine. Phys. Rev. Lett. 114, 140504 (2015) 42. I. Kerenidis, A. Prakash, D. Szilágyi, Quantum algorithms for second-order cone programming and support vector machines (2019). arXiv:1908.06720 43. A.K. Bishwas, A. Mani, V. Palade, Big data quantum support vector clustering (2018). arXiv:1804.10905 44. Arodz, T., Saeedi, S. Quantum sparse support vector machines (2019). arXiv:1902.01879 45. C. Ding, T. Bao, H. Huang, Quantum-inspired support vector machine (2019). arXiv:1906.08902 46. D. Anguita, S. Ridella, F. Rivieccio, R. Zunino, Quantum optimization for training support vector machines. Neural Netw. 16, 763–770 (2003) 47. B.J. Chelliah, S. Shreyasi, A. Pandey, K. Singh, Experimental comparison of quantum and classical support vector machines. IJITEE 8, 208–211 (2019) 48. M. Schuld, A. Bacharov, K. Svore, N. Wiebe (2018). arXiv:1804.00633v1 49. E. Farhi, H. Neven (2018). arXiv:1802.06002v2 50. M. Schuld, I. Sinayskiy, F. Petruccione, An introduction to quantum machine learning. Cont. Phys. 56, 172–185 (2015) 51. E. Grant, M. Benedetti, S. Cao, A. Hallam, J. Lockhart, V. Stojevic, A. Green, S. Severini, npj Quantum Inf. 4, 65 (2018) 52. Y.-Y. Shi, Y.-Y., Duan, L.-M., and Vidal,G., Classical simulation of quantum many-body systems with a tree tensor network, Phys. Rev. A 74, 022320 (2006) 53. G. Vidal, Class of quantum many-body states that can be efficiently simulated. Phys. Rev. Lett. 101, 110501 (2008) 54. L. Cincio, J. Dziarmaga, M.M. Rams, Multiscale entanglement renormalization ansatz in two dimensions: quantum ising model. Phys. Rev. Lett. 100, 240603 (2008) 55. G. Evenbly, G. Vidal, Entanglement renormalization in noninteracting fermionic systems. Phys. Rev. B 81, 235102 (2010) 56. D. Turkpençe et al., A Steady state quantum classifier. Phys. Lett. A 383, 1410 (2019) 57. H.P. Breuer, F. Petruccione, The Theory of Open Quantum Systems (Oxford University Press, Oxford, 2007) 58. W.O. Krawec, M.G. Nelson, E.P. Geiss, Automatic generation of optimal quantum key distribution protocols, in Proceedings of Genetic and Evolutionary Computation (New York, ACM, 2017), p. 1153 59. P.B. Wigley, P.J. Everitt, A. van den Hengel, J.W. Bastian, M.A. Sooriyabandara, N.P. McDonald, G.D. Hardman, K.S. Quinlivan, C.D. Manju, P. Kuhn, C.C.N. Petersen, I.R. Luiten, A.N. Hope, J.J. Robins, M.R. Hush, Fast machine-learning online optimization of ultra-cold-atom experiments. Sci. Rep. 6, 25890 (2016) 60. A. Einstein, B. Podolsky, N. Rosen, Can quantum-mechanical description of physical reality be considered complete? Phys. Rev. 47, 777 (1935) 61. G. Cristian Romero. https://www.qutisgroup.com/wp-content/uploads/2014/10/TFGCristian-Romero.pdf 62. X.D. Cai, D. Wu, Z.-E. Su, M.C. Chen, X.L. Wang, L. Li, N.L. Liu, C.Y. Lu, J.W. Pan, Entanglement-based machine learning on a quantum computer. Phys. Rev. Lett. 114, 110504 (2015) 63. Y. Liu, X. Zhang, M. Lewenstein, S.J. Ran, Entanglement-guided architectures of machine learning by quantum tensor network (2018). arXiv:1803.09111 64. Y. Levine, O. Sharir, N. Cohen, A. Shashua, Quantum entanglement in deep learning architectures. Phys. Rev. Lett. 122, 065301 (2019) 65. B. Farley, W. Clark, Simulation of self-organizing systems by digital computer. Trans. IRE Profess. Gr. on Inf. Theor. 4, 76 (1954)

Quantum Machine Learning: A Review and Current Status

143

66. P. Smolensky, Chapter 6: information processing in dynamical systems: foundations of harmony theory, in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1 67. D.L. Deng, X. Li, S.D. Sarma, Quantum entanglement in neural network states. Phy. Rev. X 7, 021021 (2017) 68. L. Susskind, J. Lindesay, An introduction to black holes, information and the string theory revolution: the holographic universe. World Sci. 200 (2004) 69. Q. Zhuang, Z. Zhang, Supervised learning enhanced by an entangled sensor network (2019). arXiv:1901.09566 70. I. Goodfellow, Y. Bengio, A. Courville, Deep learning. Gen. Program. Evolv. Machin. 19, 305 (2018) 71. S.C. Kak, Quantum neural computing. Adv. Imag. Elect. Phys. 94, 259 (1995) 72. T. Menneer, A. Narayanan, Quantum-inspired neural networks. Tech. Rep. R329 (1995) 73. M. Perus, Neuro-quantum parallelism in brain-mind and computers. Informatica 20, 173 (1996) 74. T. Menneer, Quantum artificial neural networks, Ph.D. thesis, University of Exeter (1998) 75. Faber, J., and Giraldi, G. A., Quantum Models for Artificial Neural Networks, LNCC–National Laboratory for Scientific Computing 76. M. Schuld, I. Sinayskiy, F. Petruccione, The quest for a quantum neural network. Quantum Inf. Process. 13, 2567 (2014) 77. N. Wiebe, A. Kapoor, K.M. Svore, Quantum deep learning (2014). arXiv:1412.3489 78. M.V. Altaisky, Quantum neural network (2000) arxiv:quant-ph/0107012 79. S. Gupta, R.K.P. Zia, Quantum neural networks. J. Comput. Sys. Sci. 63, 355 (2001) 80. M. Zidan, A.-H. Abdel-Aty, M. El-shafei, M. Feraig, Y. Al-Sbou, H. Eleuch, M. AbdelAty, Quantum classification algorithm based on competitive learning neural network and entanglement measure. Appl. Sci. 9, 1277 (2019) 81. A. Sagheer, M. Zidan, M.M. Abdelsamea, A novel autonomous perceptron model for pattern classification applications. Entropy 21, 763 (2019) 82. S. Lloyd, C. Weedbrook, Quantum generative adversarial learning. Phys. Rev. Lett. 121, 040502 (2018) 83. S. Kak, On quantum neural computing. Inf. Sci. 83, 143 (1995) 84. M. Zak, C.P. Williams, Quantum neural nets. Int. J. Theor. Phys. 37, 651 (1998) 85. Y. Cao, G.G. Guerreschi, A. Aspuru-Guzik, Quantum neuron: an elementary building block for machine learning on quantum computers (2017). arXiv:1711.11240 86. S. Hayou, A. Doucet, J. Rousseau, On the impact of the activation function on deep neural networks training (2019). arXiv:1902.06853 87. F. Agostinelli, M. Hoffman, P. Sadowski, P. Baldi, Learning activation functions to improve deep neural networks (2015). arXiv:1412.6830 88. I. Daubechies, R. DeVore, S. Foucart, B. Hanin, G. Petrova, Nonlinear approximation and (Deep) ReLU networks (2019). arXiv:1905.02199 89. F. Neukart, S.A. Moraru, On quantum computers and artificial neural networks. Sig. Process. Res. 2, 1 (2013) 90. M. Schuld, I. Sinayskiy, F. Petruccione, Simulating a perceptron on a quantum computer. Phys. Lett. A 7, 660 (2015) 91. U. Alvarez-Rodriguez, L. Lamata, P.E. Montero, J.D. Martín-Guerrero, E. Solano, Supervised quantum learning without measurements. Sci. Rep. 7, 13645 (2017) 92. K.H. Wan, O. Dahlsten, H. Kristjánsson, R. Gardner, M.S. Kim, Quantum generalisation of feedforward neural networks. npj Quantum Inf. 3, 36 (2017) 93. P. Rebentrost, T.R. Bromley, C. Weedbrook, S. Lloyd, Quantum Hopfield neural network. Phys. Rev. A 98, 042308 (2018) 94. J.S. Otterbach, et al., Unsupervised machine learning on a hybrid quantum computer (2017). arXiv:1712.05771 95. L. Lamata, Basic protocols in quantum reinforcement learning with superconducting circuits. Sci. Rep. 7, 1609 (2017)

144

N. Mishra et al.

96. F. Tacchino, C. Macchiavello, D. Gerace, D. Bajoni, An artificial neuron implemented on an actual quantum processor. npj Quantum Inf. 5, 26 (2019) 97. D.H. Hubel, T.N. Wiesel, Receptive fields and functional architecture of monkey striate cortex. J. Physiol. (1968) 98. J. Fan, W. Xu, Y. Wu, Y. Gong, Human tracking using convolutional neural networks. IEEE Trans. Neural Netw. 21, 1610 (2010) 99. M. Jaderberg, A. Vedaldi, A. Zisserman, Deep features for text spotting, in European Conference on Computer Visions (2014) 100. A. Toshev, C. Szegedy, Deep-pose: human pose estimation via deepneural networks. CVPR (2014) 101. J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, Decaf: a deep convolutional activation feature for generic (2014) 102. C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for scene labeling. PAMI (2013) 103. R. Zhao, W. Ouyang, H. Li, X. Wang, Saliency detection by multicontext deep learning, in CVPR (2015) 104. N. Aloysius, M.A. Geetha, Review on deep convolutional neural networks, in International Conference on Communication and Signal Processing (India, 2017) 105. G. Carleo, M. Troyer, Solving the quantum many-body problem with artificial neural networks. Science 355, 602 (2017) 106. E.P. Van Nieuwenburg, Y.H. Liu, S.D. Huber, Learning phase transitions by confusion. Nat. Phys. 13, 435–439 (2017) 107. N. Maskara, A. Kubica, T. Jochym-O’Connor, Advantages of versatile neural-network decoding for topological codes. Phys. Rev. A 99, 052351 (2019) 108. Y. Zhang, E.A. Kim, Quantum loop topography for machine learning. Phys. Rev. Lett. 118, 216401 (2017) 109. J. Carrasquilla, R.G. Melko, Machine learning phases of matter. Nat. Phys. 13, 431–434 (2017) 110. L. Wang, Discovering phase transitions with supervised learning. Phys. Rev. B 94, 195105 (2016) 111. Y. Levine, N. Cohen, A. Shashua, Quantum entanglement in deep learning architectures. Phys. Rev. Lett. 122, 065301 (2019) 112. J. Biamonte et al., Quantum machine learning. Nature 549, 195–202 (2017) 113. V. Dunjko, J.M. Taylor, H.J. Briegel, Quantum-enhanced machine learning. Phys. Rev. Lett. 117, 130501 (2016) 114. E. Farhi, H. Neven, Classification with quantum neural networks on near term processors (2018). arXiv: 1802.06002 115. W. Huggins, P. Patil, B. Mitchell, K.B. Whaley, E.M. Stoudenmire, Towards quantum machine learning with tensor networks. Quantum Sci. Tech. 4, 024001 (2018) 116. I. Cong, S. Choi, M.D. Lukin, Quantum convolutional neural networks. Nat. Phys. 15, 1273 (2019) 117. Z.A. Jia, B. Yi, R. Zhai, Y.-C. Wu, G.-C. Guo, G.-P. Guo, Quantum neural network states: a brief review of methods and applications. Adv. Quantum Tech. 1800077 (2019) 118. C. Monterola, C. Saloma, Solving the nonlinear schrodinger equation with an unsupervised neural net-work. Opts. Exp. 9, 72 (2001) 119. C. Caetano, J. Reis Jr., J. Amorim, M.R. Lemes, A.D. Pino Jr., Using neural networks to solve nonlinear differential equations in atomic and molecular physics. Int. J. Quantum Chem. 111, 2732 (2011) 120. X. Gao, L.-M. Duan, Efficient representation of quantum many-body states with deep neural networks. Nat. Comm. 8, 662 (2017) 121. A.P. Dash, S. Sahu, S. Kar, B.K. Behera, P.K. Panigrahi, Explicit demonstration of initial state construction in artificial neural networks usingNetKet and IBM Q experience platform. ResearchGate- (2019). https://doi.org/10.13140/RG.2.2.30229.17129

Quantum Machine Learning: A Review and Current Status

145

122. B. Gardas, M.M. Rams, J. Dziarmaga, Quantum neural networks to simulate many-body quantum systems. Phys. Rev. B 98, 184304 (2018) 123. J. Bjarni, B. Bela, G. Carleo, Neural-network states for the classical simulation of quantum computing (2018). arXiv:1808.05232v1 124. H. Liu, C. Yu, S. Pan, S. Qin, F. Gao, Q. Wen (2019). arXiv:1906.03834v2 [quant-ph] 125. P. Rebetrost, T.R. Bromley, C. Weedbrook, S. Lloyd, Quantum hopfield neural network. Phy. Rev. A 98, 042308 (2018)

Survey of Transfer Learning and a Case Study of Emotion Recognition Using Inductive Approach Abhinand Poosarala and R Jayashree

Abstract In the era of rapid processing, there is a need for application developments that work with different datasets. The novel learning algorithms designed to handle different training and testing sets. Transfer learning allows domains, tasks, and distributions used in training and testing to be different. The concept of Transfer Learning [TL] is motivated by the detail that we can smear knowledge learned formerly to resolve new complications with ease, reusing the knowledge domain. In this paper, we would like to present inductive TL mechanisms to predict emotions from two different languages English and German, where knowledge of emotion identification exists in English language and is extended to learn emotions in German speech. We have attempted to uncover latent features of one language in another language. Keywords Emotion · Recognition · Speech · Transfer learning

1 Introduction The knowledge transfer is a viable solution in the classification process of predefined categories. Reinforcement learning in the machine learning literature leads to transfer learning as shown in Fig. 1. In TL, the main hypothesis is to achieve a classifier, which is scale-invariant [3], and use algorithms that improve the efficiency even in the presence of slight error in constraints with testing data [4]. In an unsupervised TL setting, self-taught clustering use irrelevant unlabeled data to achieve the clustering of target data to a minor extent with useful partitions which are produced by certain constraints [1]. In these clustering algorithms, the constraint set can be carefully pruned to achieve feasibility, given a target domain [2]. The major scenarios in mapping source and target are discussed below. Scenario 1: In unlabeled data, the intrinsic geometric structure can be identified. The separability factor between the classes from unlabeled data is difficult to A. Poosarala (B) · R. Jayashree PESU, Bangalore, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1175, https://doi.org/10.1007/978-981-15-5619-7_9

147

148

A. Poosarala and R. Jayashree

Fig. 1. Machine learning to TL

maximize. In these situations, learning from similar unlabeled data whose separability factor is known will be less time consuming [5–7]. But finding the mapping between solution achieved unlabeled data (source training) to solution to be found unlabeled data (target training) is the gist of TL. The intrinsic structure of the English speech recognition process can be used to learn any natural language as shown in Fig. 1. Some of the applications of TL are classifying a given web document into several predefined categories, recalibration effort, training a classifier on the reviews, reducing the effort for annotating reviews for various products, and adapting a classification model that is trained on some products to help to learn classification models for some other products. In such cases, transfer learning can save a significant amount of labeling effort [8, 9]. Scenario 2: When the application requires a combination of supervised and semisupervised methods that is, to deal portions of labeled and unlabeled data, learning separability factor using clustering plays an important role. In many applications, the local learning parameters in supervised problems are successful and the aim of TL is to incorporate clustering to unlabeled data [10]. The iterative nature of the algorithm makes the relationship between subspace selection and clustering less obvious and challenging. Co-clustering based classification will propagate knowledge of indomain to out-of-domain [27]. Hard and fuzzy clustering methods examine overlapping features of labeled examples in the source set with unlabeled data in the target set [28]. Scenario 3: In many cases, unlabeled training data is from different distribution as related to labeled data. Here instead of separability, similarity between the distribution factors have to be dealt with and normalized. Risk bound should be studied before implementing transfer learning. The human hypothesis leads to learning similar tasks easily, which needs to be artifact in TL [11–13]. In many learning target domains, sampling bias can be dealt with kernel mean matching technique, which is matching

Survey of Transfer Learning and a Case Study …

149

the distribution of training and test sample space. While matching the source and target distribution type regarding data, risk bounds can be analyzed in kernel Hilbert space with a natural cross validation procedure [15, 20, 22, 24, 26]. The classification in covariate shift, where training instance distribution will differ arbitrarily from test distribution requires risk bound analysis [21, 25]. Scenario 4: When domains of source and target do not match, similar structure methods using high level abstraction to be decided. In natural language processing, frequently occurred problem is domain adaptation, which is the scarcity of labeled data in target domains [15]. Structural correspondence learning methods will automate similarities in different domains, for example, logistic regression [16, 17, 23]. Sparsely-connected neural networks are used for the structural representation of tasks which later can be used for mining frequent subgraphs [29]. The characteristics of good classifiers are listed after performing many auxiliary classifications with unlabeled data that lead to predictive structures. Training examples can be selected to accomplish entity tagging knowledge [30–32]. A worthy feature depiction is a vital aspect in the feat of domain adaptation [55]. It is unlikely to obtain synthetic data by direct measurement. The data is applicable in situations to filter information. Real data without delay contribute to streaming and online modeling. TL is effective when domains are synthetic and real data sets [34]. A well performing model is studied from source data distribution and applied on different target data distribution. Domain adaptation is the observing target domain which is slightly better than the source domain [35]. The different contexts are unsupervised, semi-supervised, and supervised. Efficient algorithms do not exist for intractable problems. The procedure that provides a solution is the brute-force search. The approach fails when the problem is truly random. Intractable search problem can be reduced to approximate search algorithm like greedy search [36]. Approximate search improves the basic procedure significantly. Scenario 5: The major concept of TL can be accomplished if the learning algorithm performs alternatively between supervised and unsupervised domains. Learning is common across task representations in unsupervised domain. The depiction is used to investigate cognitive tasks such as decision-making and categorization. In supervised learning, learning happens about task specific functions [37]. Precise relationships in input data effectively generate output data. The regression and classification algorithms include logistic regression, support vector machines, neural networks, and random forests. A generative model is used to capture variations in data due to concealed task variables. To identify and solve multiple related tasks, the probabilistic framework can be built using latent independent components which signify relatedness [38]. For instance, A deep rendering mixture model can be used as an image generator. The algorithm is effective if it performs across a range of inputs and applications. In TL, the knowledge of framework structure will lead to learning new tasks with generalization in performance [39]. The model does not encounter deprivation on unseen data from the same distribution. In the feature set, the relevance of meta features have to be predicted. The relevance to be calculated as a function of meta priors [41]. It is possible to simultaneously learn the meta priors represented as hyperparameters and feature weights from a collaborative prediction tasks that share

150

A. Poosarala and R. Jayashree

similar relevance structure. The feature selection algorithms consider all candidate features as equally likely, a priori. For example, meta features can be characterized as its value or price in a product distribution scenario. A good generalization is required for many applications with limited labeled data. In such conditions, Bayesian learning methods require informative prior, which will encode valuable domain knowledge, leading to discrete supervised learning [42]. The inference is important in dynamic analysis of the sequence of data. The posterior probability is proportional to its prior probability and recently attained likelihood. Scenario 6: When we consider various datasets, these datasets mutually reinforce a common representation with latent features for various classifiers used. Therefore, discrimination formalism with maximum entropy leads to a learning approach for multitask representation. This leads to integrate many algorithms to arrive at a unified model. Thus, the algorithm maintains global solution properties suitable for TL. Example support vector machines with optimal combination kernels [43]. The provision of formal task relatedness framework, with a sub-domain of issues will be helpful in multiple task learning approaches [44]. A manifold semi-supervised alignment can evolve between source and target domains by dimensionality reduction and not only by finding relatedness with training samples [45]. A powerful formalism for learning in relational domains is Markov logic networks, which revises the predicate mapping structure for accuracy improvement [56]. Long-range clauses matched with short-range clauses or a method of approximate transitions of state to be modeled [58, 59]. Structural regularities of source domain in the form of Markov predicate variables have to be mapped with the target domain. [60]. One of the powerful tools in TL is the inductive logic technique, to learn target declarative bias using auxiliary data or by finding a suitable kernel in the set of predefined kernels with convex optimization [61, 62, 64]. Scenario 7: Pseudo-alignment process can be applied for totally unmatched domains. This is an alternative for domain adaptation. When TL is applied to machine translation, the document alignment can be achieved from finding exact translations with pseudo alignment in multilingual corpora. Each language will be provided with a topic-based graph and alignment between different languages, start from lower dimensional space to obtain cross lingual information analysis where the source and target domains are completely different [46]. The parameters learnt by multiple tasks with the same Gaussian process priors lead to efficient TL nonparametric informative machine learning algorithm [47, 48]. The shared covariance function learnt should be in free form to allow good flexibility to model inter task dependencies by circumventing the need for huge training samples [49, 50]. Choice-based conjoint analysis framework with task coupling parameter, is a realistic form of multitask learning. This depicts learn from each other [51, 52]. The parallel tasks are modeled as Bayesian multiple outputs which can be classified as fixed effects and random effects with the level of parameter sharing through joint prior distribution in neural network machinery [53]. For TL, locally weighted predictive framework can be built which chains multiple models. Integration of many learning algorithms from various training domains creates a unified model which will have inherent factors to work on

Survey of Transfer Learning and a Case Study …

151

the different domain [54]. Instead of all features, only a subset of features is made available and the equivalent model is returned at each time step [67]. TL’s two major implementation issues are, knowledge leverage acquired in a source domain and speed of learning in the target domain. One of the special cases in ML is about only one entity availability in the target domain. TL is about utilizing background data to improve logistic regression classifier in the task of interest based on assumption that data of all tasks are mixtures of relevant and unlabeled samples or by sparse coding [65, 66]. Making incremental learning is termed as tagging. One of the instances is gradient descent technique, to minimize stage wise subset feature selection process. Let us assume SD be source domain, TS be learning task and concerned task as ST. Similarly, let us assume TD be the target domain and concerned task as TT. Categories of TL are: Inductive TL: In this way of knowledge transfer, the target task is different from the source task. The existing model bias is reused in a beneficial way on a related but different task. Here TS = TT but there is no requirement on SD and TD to be the same. Therefore, few labeled data in TD is required to make target predictive function. Transductive TL: In this way of learning, the source and target tasks are the same, while the source and target domains are different. Here TS = TT but SD and TD are not the same. Many a time in this situation learning happens in both supervised and unsupervised ways. Unsupervised TL, the target task is different but related to the source task. Focuses on solving unsupervised learning tasks in the target domain. Translated learning, which is made possible by learning a mapping function for bridging features in two entirely different domains. Heterogeneous TL, transferring knowledge across domains or tasks that have a different feature space and transfer from multiple such source domains.

2 State of Art Emotion recognition [ER] is achieved through various techniques with different accuracy rates on different datasets. The need for ER arises in the field of emotion speech synthesis, perception of humans in depression, cockpit discussions, medical applications, stress analysis, and virtual teacher [69, 87].

2.1 Modeling Features The type of emotion can be semi-natural, natural, simulated, and half recordings. In the taxonomy, the low-level descriptors in acoustic level to be considered are intonation, intensity, Cepstral coefficients, linear prediction, formants, spectrum,

152

A. Poosarala and R. Jayashree

transformation, harmonicity, and perturbations like jitters. The low-level descriptors in linguistic levels are phonemes, words, laughter, sighs, and pauses. The feature functionalities are listed as extremes, means, percentiles, higher moments, peaks, segments, regression, spectral, temporal, vector space modeling, and statistical [80]. Contractive neural network with reconstruction penalization in the semiconvolution neural network, candidate features are learnt by unlabeled samples, which are further used to learn to affect discriminative features with saliency [79]. This is a static modeling mechanism. Dynamic modeling mechanisms (Hidden Markov model) are applied for pitch, energy, etc., and static modeling (Data mining toolkit) is applied for other suprasegmental information to know joyful, angry, rest, bored, surprised, emphatic, helpless, touchy, irritated, reprimanding, neutral emotion states [72].

2.2 Challenges Which Motivate Research A region of a speech signal before the vowel onset point represents a consonant region and later represent the transition region. Spectral features are extracted from all three regions for performing ER [40]. The features are derived from a long term spectrotemporal representation of speech [57, 62, 63, 71]. There always exist a tradeoff between description accuracy and evaluation. Many emotion categories lead to low evaluation and limited categories will lead to a low emotional description of utterances [59]. Defining emotional partition without increasing the noise labeling is more challenging, that is algorithm needs to be sensitive for outliers while modeling the context [85, 86].

2.3 Method for Modeling Due to the complex nature of emotions, speech sentences cannot be categorized on a specific emotion. The states can be a result of multiple emotions [73]. The prosodic and spectral features can be fused at the data level for a lesser error rate, as compared with individually [74]. The features are selected from sequential forward with a general regression neural network which and are fed as input to modular neural network [75]. ER is achieved using labeled training audio data from large speech dataset to feed deep convolutional neural network. The spectrogram can be of nature 25 ms block with an overlap of 15 ms which log are transformed and fed to discriminative convolution neural networks [76]. Bag of words and macro class sub-divisional approach are also efficient ER techniques [77, 78]. The latest research works include the effect of white noise for datasets to calculate the direct effect of noise addition. Reduction aids in performance, but features vary in various noise levels [81]. The ER success rate is increased by evolutive algorithms in the speech feature subset selection process [82]. The utterance level

Survey of Transfer Learning and a Case Study …

153

features constructed using segment level probability distribution is sent as input to the extreme learning machine to identify emotions [83]. Classification efficiency for different sizes of the training dataset, different speakers, and various emotion states are discussed [84].

2.4 Datasets There are various speech datasets for ER, such as English, German, Japanese, Dutch, Spanish, Danish, Hebrew, Chinese, Sweden, Russian, Multilingual, Indian (IITKGPSESC) databases [33, 69, 87]. Detection of vocal activity, estimation of intonation curve, and vowel detection without manual corrections are puzzling processes [88–90].

3 Proposed System Transferability between source domains and target domains is to be analyzed. Selecting relevant source domains to extract the information. is a huge challenge. To define transferability between domains, the criteria to measure the similarity between domains need to be defined. One of the related issues arise is when an entire domain cannot be used for TL, whether still can we transfer part of the domain for useful TL. The relationship between TL and ML process exists in the following transferability measures: domain adaptation, multitask learning, sample selection bias, and covariate shift [15, 68, 70]. The novel approach of this paper is to subject the raw spectral features of speech to CNNs, LSTM networks, and then to the shared hidden layer. The stacking of network layers leads to the learning of efficient features in deep architecture. The system will learn to recognize. This provides in a speech signal a synthesis procedure for sequential dynamics. In the shared layer, independent parameters are shared in the reconstruction process, even though the same parameters are fed from the input layer to the hidden layer. The emotion recognition for two languages in one learning structure is achieved. The sequence of steps is presented in Algorithm 1.

154

A. Poosarala and R. Jayashree

Datasets used are IEMOCAP and FAU AEC. The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database is an acted dataset with 12 h of audiovisual data. It comprises of dyadic sessions of improvisations to elicit emotional features. IEMOCAP database is annotated by multiple annotators into categorical labels of anger, happiness, sadness, and neutrality. Friedrich Alexander university Aibo emotion corpus consists of 9 h of German speech. This was obtained by 51 children interactions with robot Aibo. Using syntactic prosodic criteria, the recordings are divided into meaningful chunks. All transcribed words have corresponding German lexicon. The idea is to provide one emotion label for the complete chunk. The training is provided simultaneously loading of four sessions from IEMOCAP (XE) and one session called OHM from FAU AEC (XG) as shown in Fig. 2a.

Survey of Transfer Learning and a Case Study …

155

Fig. 2. a The hidden layer is hierarchical encapsulation of CNNs, LSTM networks and shared hidden layer b The sequential dynamics and independent reconstruction parameters are provided

4 Experiment Sequentially stacked LSTM layers learn CNNs features and the shared hidden layer will exploit commonalities between different learning tasks as shown in Fig. 2b. In this experiment, we like to show the efficiency of 3 concatenation hierarchy (proposed system) compared to 2 concatenation hierarchy (existing system), which is the padding of only CNNs and LSTM networks. 3 concatenation is obtained by 2 concatenation padded by the shared hidden layer. The robustness lies in statistical strength. In shared layer Kernel mean matching optimization is used as shown in Eq. 1. 2  n tr n te  1   tr   te  1    βi Φ xi − βi Φ xi     n tr n tr i=1 i=1

(1)

Categories are identified for both databases so that training and testing can be configured. Angry and Joyful are considered for the FAU AEC database as negative and positive characteristics. Similarly, Not happy and Happy are depicted for the IEMOCAP database. The sample domain differences are shown in Table 1. The values in the table illustrate the usage instances of both the databases in the experiment. Table 1. Two databases characteristics and number of instances Corpus

Characteristics Age

Language

Number of instances Speech

Train Negative

FAU AEC

Children

German

Variable

IEMOCAP

Adults

English

Fixed

Test Positive

Negative

Positive

400

680

250

430

1000

1200

500

740

156

A. Poosarala and R. Jayashree

5 Results The session and spectral instances of English and also German are fed to CNNs after preprocessing. In this work we want to depict, the classification accuracy is increased by embedding a shared hidden layer to already a classifier of two layers with CNN and LSTM. The first layer [CNN] learning process was established with 2 CONV layers. The convolution filter size was set up with a cardinality of 3 × 3 with 16 filters. The activation function selected was RELU with maximum pooling filter size 2 × 2. In an input array of attributes, the closer data values will be correlated with CNN. The local connectivity is the primary factor which will be covered under CNN. The classification accuracy of CNN can be justified by sequentially stacked LSTM layers. The second layer [LSTM] learning process was performed with 128 cells in a layer. The activation function selected was tanh. Three LSTM layers are stacked sequentially. Learning rate was streamlined to 0.01 with a decay measure of 1e6. Stochastic gradient descent algorithm controlled the optimization steps in the direction of reduction in the least mean squared error measure. The input and output frames were maintained at a uniform size of 1 × 128. The drop out factor was set with a value of 0.25. The number of epochs tried is 50 with a batch size of 64. The third layer [Shared hidden layer] configuration parameters were achieved by the number of hidden units to be 220 with the control parameter ranging from [1.0 to 5.0]. The decay values observed were in the range of [0.001 to 0.01]. Basically, the co-variant shift adaptation process was studied. At the final stage, the voting method was used with threshold criteria for maximum. The supervised learning happens sequentially. The weights are calculated. Three approaches followed were matched training, cross training, and KMM. In matched training, the individual dataset training samples are subjected more than 20 times to train the baseline SVM separately. In cross training, the cross-database training is done for baseline SVM, without using the shared hidden layer. In KMM, the cross training between the datasets is done using the shared hidden layer. In the matched training approach [MT] at the shared hidden layer, the unweighted average class recall rate for 3 concatenation processes (proposed system) for IEMOCAP and FAU AEC was 58% and 52%, respectively. The un-weighted average class recall rate for 2 concatenation (existing system) process for IEMOCAP and FAU AEC was 49% and 47%, respectively, as shown in Fig. 3 In cross training approach [CT] at the shared hidden layer, the un-weighted average class recall rate for 3 concatenation processes (proposed system) for IEMOCAP and FAU AEC was 59% and 55%, respectively. The un-weighted average class recall rate for 2 concatenation (existing system) processes for IEMOCAP and FAU AEC were 50% and 49%, respectively, as shown in Fig. 3. After KMM optimization at the shared hidden layer, the un-weighted average class recall rate for 3 concatenation process (proposed system) for IEMOCAP and FAU AEC was 62% and 57%, respectively. The un-weighted average class recall

Survey of Transfer Learning and a Case Study …

157

Fig. 3. UAR comparison between the existing system and proposed system for a IEMOCAP dataset b FAU AEC dataset

rate for 2 concatenation (existing system) process for IEMOCAP and FAU AEC was 52% and 53%, respectively, as shown in Fig. 3.

6 Conclusion The benefit of transfer learning methods is clearly depicted in the above experiment. Traditional methods are compared with two different language databases. At most care is taken to reduce negative transfer which lead to the reduced performance of learning in the target domain. Using a uniform learning structure, emotion recognition for English and German is achieved.

7 Future Work In the future, we like to reduce dimensionality and increase deep learning hierarchy in order to achieve more accuracy in emotion detection using speech. The results will be further validated using cross validation methods for a greater number of cycles. The approach of transfer learning can be continued for many languages to implement uniform classifier as a consortium.

References 1. W. Dai, W. Yang, G. Xue Y. Yu, Self-taught Clustering, in 25th ACM International Conference of Machine Learning (2008), pp. 200–207 2. I. Davidson, S.S. Ravi, Intractability and Clustering with Constraints, in 24th International Conference on Machine Learning (2007), pp. 201–208 3. G. Grifflin, A. Holub, P. Perona, Caltech-256 Object Category Dataset, Technical Report 7694 (California Institute of Technology, 2007), pp. 1–20

158

A. Poosarala and R. Jayashree

4. B. Nelson, I. Cohen, Revisiting probabilistic models for clustering with pair-wise constraints, in 24th International Conference on Machine Learning (2007), pp. 673–680 5. R. Raina, A. Battle, H. Lee, B. Packer, A.Y. Ng, Self-taught learning: transfer learning from unlabeled data, in 24th International Conference on Machine Learning (2007), pp. 759–766 6. Z. Wang, Y. Song, C. Zhang, Transferred dimensionality reduction, in European Conference on Machine Learning and Knowledge Discovery in Databases. Springer (2008), pp. 550–565 7. D. Cai, X. He, J. Han, Semi-supervised discriminant analysis, in IEEE International Conference on Computer Vision (2007), pp. 1–7 8. W. Dai, Q. Yang, G. Xue, Y. Yu, Boosting for transfer learning, in International Conference on Machine Learning (2007), pp. 193–200 9. G.B. Huang, M. Ramesh, T. Berg, E. Miller, Labeled faces in the wild: a database for studying face recognition in unconstrained environments, in Workshop on Faces in Real-Life Images: Detection, Alignment and Recognition (2008), pp. 1–11 10. M. Wu, B. Scholkopf, A local learning approach for clustering, in Conference on Neural Information Processing Systems (2007), pp. 1529–1536 11. J. Ye, Z. Zhao, M. Wu, Discriminative K-means for clustering, in Conference on Neural Information Processing Systems (2007), pp. 1–8 12. W. Dai, G. Xue, Q. Yang, Y. Yu, Transferring Naïve Bayes classifiers for text classification, in 22nd AAAI Conference on Artificial Intelligence (2007), pp. 540–545 13. R. Raina, A.Y. Ng, D. Koller, Constructing informative priors using transfer learning, in 23rd International Conference on Machine Learning (2006), pp. 713–720 14. J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, N.D. Lawrence, Dataset Shift in Machine Learning (The MIT Press, 2009) 15. J. Jiang, C. Zhai, Instance weighting for domain adaptation in NLP, in 45th Annual Meeting of the Association of Computational Linguistics (2007), pp. 264–271 16. J. Blitzer, R. McDonald, F. Pereira, Domain adaptation with structural correspondence learning, in Conference on Empirical Methods in Natural Language Processing (2006), pp. 120–128 17. X. Liao, Y. Xue, L. Carin, Logistic regression with an auxiliary data source, in 21st International Conference on Machine Learning (2005), pp. 505–512 18. P. Wu, T.G. Dietterich, Improving SVM accuracy by training on auxiliary data sources, in 21st International Conference on Machine Learning (2004), pp. 1–8 19. J. Juang, A. Smola, A. Gretton, K.M. Borgwardt, B. Scholkopf, Correcting sample selection bias by unlabeled data, in 19th Annual Conference on Neural Informaion Processing Systems (2007), pp. 1–8 20. A. Gretton, K. Borgwardt, M. Rasch, B. Scholkof, A. Smola, A kernel method for the twosample problem, in Conference on Neural Information Processing Systems (MIT Press, 2006), pp. 1–8 21. Bickel, S., Bruckner, M. and Scheffer, T.: Discriminative Learning for Differing Training and Test Distributions, 24th ACM International Conference on Machine Learning, pp. 81–88, (2007) 22. S. Bickel, T. Scheffer, Dirichlet-enhanced spam fltering based on biased samples, in Conference on Neural Information Processing Systems (2006), pp. 1–8 23. Y. Xue, X. Liao, L. Carin, B. Krishnapuram, Multi-task learning for classification with dirichlet process priors. J. Mach. Learn. Res. 8, 35–63 (2007) 24. M. Sugiyama, S. Nikajima, H. Kashima, P.V. Buenau, M. Kawanabe, direct importance estimation with model selection and its application to covariate shift adaptation, in 20th Annual Conference on Neural Information Processing Systems (2007), pp. 1433–1440 25. M. Sugiyama, M. Krauledat, K.R. Muller, Covariate shift adaptation by importance weighted cross validation. J. Mach. Learn. Res. 8, 985–1005 (2007) 26. S. Rosset, J. Zhu, H. Zou, T.J. Hastie, A method for inferring label sampling mechanisms in semi-supervised learning, in Conference on Neural Information Processing Systems (2004), pp. 1–8 27. W. Dai, G. Xue, Q. Yang, Y. Yu, Co-clustering based classification for out-of-domain documents, in 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2007), pp. 210–219

Survey of Transfer Learning and a Case Study …

159

28. J. Gao, P.N. Tan, H. Cheng, Semi-supervised clustering with partial background information, in 6th SIAM International Conference on Data Mining (2006), pp. 489–493 29. S. Swarup, S.R. Ray, Cross-domain knowledge transfer using structured representations, in 21st National Conference on Artificial Intelligence (2006), pp. 1–5 30. R.K. Ando, T. Zhang, A high-performance semi-supervised learning method for text Chunking, in 43rd Annual Meeting on Association for Computational Linguistics (2005), pp. 1–9 31. R.K. Ando, T. Zhang, A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 6, 1817–1853 (2005) 32. S. Miller, J. Guinness, A. Zamanian, Name tagging with word clusters and discriminative training, in Conference on Human Language Technologies-North American Chapter of the Association for Computational Linguistics (2004), pp. 1–6 33. F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, B. Weiss, A database of german emotional speech, in Interspeech (2005), pp. 1–4 34. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, Y. Singer, Online passive-aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006) 35. H III Daume, Frustratingly easy domain adaptation, in 45th Annual Meeting of the Association of Computational Linguistics (2007), pp. 256–263 36. Search-based Structured Prediction, Daume, H III, J. Langford, and D. Marcu. Mach. Learn. J. 75, 297–325 (2009) 37. A. Argyriou, T. Evgeniou, M. Pontil, Multi-task feature learning, in 19th Annual Conference on Neural Information Processing Systems (2007), pp. 41–48 38. J. Zhang, Z. Ghahramani, Y. Yang, Learning multiple related tasks using latent independent component analysis, in Conference on Neural Information Processing Systems (2006), pp. 1–8 39. A. Argyriou, C.A. Micchelli, M. Pontil, Y. Ying, A spectral regularization framework for multitask structure learning, in 20th Annual Conference on Neural Information Processing Systems (MIT Press, 2008), pp. 25–32 40. S.G. Koolagudi, K.S. Rao, Real life emotion classification using VOP and pitch based spectral features, in IEEE India Conference, INDICON (2010), pp. 1–4 41. S.I. Lee, V. Chatalbashev, D. Vickrey, D. Kollar, Learning a meta-level prior for feature relevance from multiple related tasks, in 24th ACM International Conference on Machine Learning (2007), pp. 489–496 42. R. Raina, A.Y. Ng, D. Koller, Transfer learning by constructing informative priors, in 21st International Conference on Machine Learning (2006), pp. 1–4 43. T. Jebara, Multi-task feature and kernel selection for SVMs, in 21st ACM International Conference on Machine Learning (2004), pp. 1–8 44. S. Ben-David, R. Schuller, Exploiting task relatedness for multiple task learning, in Conference on Learning Theory and Kernel Machines. LNCS Springer, vol. 2777 (2003), pp. 567–580 45. C. Wang, S. Mahadevan, Manifold alignment using procrustes analysis, in 25th ACM International Conference on Machine Learning (2008), pp. 1120–1127 46. F. Diaz, D. Metzler, Pseudo-aligned multilingual corpora, in International Joint Conference on Artificial Intelligence (2007), pp. 2727–2732 47. N.D. Lawrence, J.C. Platt, Learning to learn with the informative vector machine, in 21st ACM International Conference on Machine Learning (2004), pp. 1–8 48. N.D. Lawrence, M. Seeger, R. Herbrich, Fast Sparse Gaussian process methods: the informative vector machine, in Conference on Neural Information Processing Systems (MIT Press, 2002), pp. 625–632 49. E. Bonilla, K.M. Chai, C. Williams, Multi-task Gaussian process prediction, in 20th Annual Conference on Neural Information Processing Systems (MIT Press, 2007), pp. 153–160 50. E.V. Bonilla, F.V. Agakov, C.K.I. Williams, Kernel Multi-task learning using task-specific features, in 11th International Conference on Artificial Intelligence and Statistics (2007), pp. 43–50 51. O. Chapelle, Z. Harchaoui, A machine learning approach to conjoint analysis, in International Conference on Neural Information Processing Systems (MIT Press, 2004), pp. 257–264

160

A. Poosarala and R. Jayashree

52. T. Evgeniou, M. Pontil, Regularized multi-task learning, in 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2004), pp. 109–117 53. B. Bakker, T. Heskes, Task clustering and gating for bayesian multitask learning. J. Mach. Learn. Res. 4, 83–99 (2003) 54. J. Gao, W. Fan, J. Jiang, J. Han, Knowledge transfer via multiple model local structure mapping, in 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008), pp. 283–291 55. S. Ben-David, J. Blitzer, K. Crammer, F. Pereira, Analysis of representations for domain adaptation, in 19th International Conference on Neural Information Processing Systems (2006), pp. 137–144 56. L. Mihalkova, T. Huynh, R.J. Mooney, Mapping and revising Markov logic networks for transfer learning, in 22nd AAAI Conference on Artificial Intelligence (2007), pp. 608–614 57. O. Kwon, K. Chan, J. Hao, T. Lee, Emotion recognition by speech signals, in 8th European Conference on Speech Communication and Technology (2003), pp. 125–128 58. L. Mihalkova, R.J. Mooney, Transfer learning by mapping with minimal target data, in AAAI-08 Workshop on Transfer Learning for Complex Tasks (2008), pp. 1–6 59. M.E. Taylor, G. Kuhlmann, P. Stone, Autonomous transfer for reinforcement learning, in The Autonomous Agents and Multi-Agent Systems Conference (2008), pp. 283–290 60. J. Davis, P. Domingos, Deep tansfer via second-order Markov logic, in 26th ACM Annual Conference on Machine Learning (2009), pp. 217–224 61. W. Bridewell, L. Todorovski, Learning declarative bias, in International Conference on Inductive Logic Programming. LNCS Springer, vol. 4894 (2007), pp. 63–77 62. M. Lugger, B. Yang, The relevance of voice quality features in speaker independent emotion recognition, in IEEE International Conference on Acoustics, Speech and Signal Processing (2007), pp. IV17–IV20 63. S. Wu, T.H. Falk, W.Y. Chan, Automatic recognition of speech emotion using long-term spectrotemporal features, in 16th International Conference on Digital Signal Processing (2009), pp. 1– 6 64. U. Ruckert, S. Kramer, Kernel-based Inductive Transfer, in European Conference on Machine Learning and Knowledge Discovery in Databases. LNCS Springer (2008), pp. 220–233 65. S. Kaski, J. Peltonen, Learning from relevant tasks only, in European Conference on Machine Learning. LNCS Springer, vol. 4701 (2007), pp. 608–615 66. H. Lee, A. Battle, R. Raina, A.Y. Ng, Efficient sparse coding algorithms, in 19th Annual Conference on Neural Information Processing Systems (MIT Press, 2007), pp. 801–808 67. S. Perkins, J. Theiler, Online feature selection using grafting, in 12th International Conference on Machine Learning (2003), pp. 592–599 68. S.J. Pan, Q. Yang, A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2009) 69. S.G. Koolagudi, K.S. Rao, Emotion recognition from speech: a review. Int. J. Speech Technol. Springer 15(2), 99–117 (2012) 70. N. Cummins, S. Amiriparian, G. Hagerer, A. Batliner, S. Steidl, B.W. Schuller, An Imagebased deep spectrum feature representation for the recognition of emotional speech, in ACM Conference on Multimedia (2017), pp. 478–484 71. W. Lim, D. Jang, T. Lee, Speech emotion recognition using convolutional and recurrent neural networks, in IEEE Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 72. B.W. Schuller, S. Steidl, A. Batliner, The interspeech 2009 emotion challenge, in Interspeech (2009), pp. 1–4 73. F. Yu, E. Chang, Y.Q. Xu, H.Y. Shum, Emotion detection from speech to enrich multimedia content, in Advances in Multimedia Information Processing-Pacific-Rim Conference on Multimedia. LNCS Springer, vol. 2195 (2001), pp. 550–557 74. Y. Zhou, Y Sun, J. Zhang, Y. Yan, Speech emotion recognition using both spectral and prosodic features, in IEEE International Conference on Information Engineering and Computer Science (2009), pp. 1–4

Survey of Transfer Learning and a Case Study …

161

75. A. Zhu, Q. Luo, study on speech emotion recognition system in E-learning, in International Conference on Human Computer Interaction, HCI Intelligent Multimodal Environments. Springer LNCS, vol. 4552 (2007), pp. 544–552 76. W.Q. Zheng, J.S. Yu, Y.X. Zou, An experimental study of speech emotion recognition based on deep convolutional neural networks, in IEEE International Conference on Affective Computing and Intelligent Interaction (2015), pp. 827–831 77. F. Pokorny, F. Graf, F. Pernkopf, B. Schuller, Detection of negative emotions in speech signals using bags-of-audio-words, in IEEE International Conference on Affective Computing and Intelligent Interaction (2015), pp. 879–884 78. L. Devillers, L. Vidrascu, Real-life emotion recognition in speech, in Speaker Classification II. LNCS Springer, vol. 4441 (2007), pp. 34–42 79. Z. Huang, M. Dong, Q. Mao, Y. Zhan, Speech emotion recognition using CNN, in 22nd ACM International Conference on Multimedia (2014) pp. 801–804 80. B. Schuller, A. Batliner, S. Steidl, D. Seppi, Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 53(9–10), 1062–1087 (2011) 81. B. Schuller, D. Arsic, F. Wallhoff, G. Rigoll, Emotion recognition in the noise applying large acoustic feature sets, in Proceedings Speech Prosody (2006), pp. 276–289 82. A. Alvarez, I. Cearreta, J.M. Lopez, A. Arruti, E. Lazkano, B. Sierra, N. Garay, Feature subset selection based on evolutionary algorithms for automatic emotion recognition in spoken spanish and standard basque language, in International Conference on Text, Speech and Dialogue. LNCS Springer, vol. 4188 (2006), pp. 565–572 83. K. Han, D. Yu, I. Tashev, Speech emotion recognition using deep neural network and extreme learning machine, in Interspeech (2014), pp. 223–227 84. X. Zhou, J. Guo, R. Bie, Deep learning based affective model for speech emotion recognition, in IEEE Conference on Ubiquitous Intelligence and Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People and Smart World Congress (2016), pp. 841–846 85. C. Busso, M. Bulut, S. Narayanan, Toward effective automatic recognition systems of emotion in speech, in Social Emotions in Nature and Artifact: Emotions in Human and Human-Computer Interaction (2013), pp. 110–127 86. M. Tahon, L. Devillers, Towards a small set of robust acoustic features for emotion recognition: challenges. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24(1), 16–28 (2016) 87. A. Mencattini, E. Martinelli, F. Ringeval, B. Schuller, C.D. Natale, Continuous estimation of emotions in speech by dynamic cooperative speaker models. IEEE Trans. Affect. Comput. 8(3), 314–327 (2017) 88. S. Zhang, T. Huang, W. Gao, Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multimed. 20(6), 1576– 1590 (2018) 89. M. Chen, X. He, J. Yang, H. Zhang, IEEE Signal Process. Lett. 25(10), 1440–1444 (2018) 90. M. Egger, M. Ley, S. Hanke, Emotion recognition from physiological signal analysis: a review. Electron. Notes Theor. Comput. Sci. 343, 35–55 (2019)

An Efficient Algorithm for Complete Linkage Clustering with a Merging Threshold Payel Banerjee , Amlan Chakrabarti, and Tapas Kumar Ballabh

Abstract In recent years, one of the serious challenges envisaged by experts in the field of data science is dealing with the gigantic volume of data, piling up at a high speed. Apart from collecting this avalanche of data, another major problem is extracting useful information from it. Clustering is a highly powerful data mining tool capable of finding hidden information from a totally unlabelled dataset. Complete Linkage Clustering is a distance-based Hierarchical clustering algorithm, wellknown for providing highly compact clusters. The algorithm because of its high convergence time is unsuitable for large datasets, and hence our paper proposes a preclustering method that not only reduces the convergence time of the algorithm but also makes it suitable for partial clustering of streaming data. The proposed preclustering algorithm uses triangle inequality to take a clustering decision without comparing a pattern with all the members of a cluster, unlike the classical Complete Linkage algorithm. The preclusters are then subjected to an efficient Complete Linkage algorithm for getting the final set of compact clusters in a relatively shorter time in comparison to all those existing variants where the pairwise distance between all the patterns are required for the Clustering process. Keywords Complete linkage clustering · Incremental · Large datasets · Preclustering · Threshold · Triangle inequality

P. Banerjee (B) · T. K. Ballabh Department of Physics, Jadavpur University, Kolkata, India e-mail: [email protected] T. K. Ballabh e-mail: [email protected] A. Chakrabarti A.K. Choudhury School of Information Technology, University of Calcutta, Kolkata, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1175, https://doi.org/10.1007/978-981-15-5619-7_10

163

164

P. Banerjee et al.

1 Introduction Cluster analysis is one of the key technologies of data mining that aggregates objects into clusters such that the members of the same cluster share high similarity or cohesion in comparison to the entities belonging to different clusters [1]. Clustering is an unsupervised machine learning technique that not only helps in extracting hidden information from a totally unlabelled dataset but also helps data processing a step easier as now instead of dealing with a large set of totally unlabeled data, only clusters are dealt with for further analysis. Thus, Clustering finds major applications in various fields like information technology, bioinformatics, pattern recognition, image processing, etc. [2–4]. There are various kinds of clustering algorithms like Partitional, Hierarchical, Density-based, Grid-based, etc., and they differ from each other in the notion of the proximity between the patterns. In other words, the term “similarity” used in clustering for grouping data differs from algorithm to algorithm, and hence a clustering algorithm suitable for one application may not be suitable for some other kinds of applications. Hierarchical Clustering [5], is a distance-based clustering technique that forms clusters in the form of a hierarchy or tree called Dendrogram. This method does not need any apriori specification of the number of clusters unlike the Partitional method, and hence is highly useful for those cases where determining the number of clusters in advance is difficult. The Hierarchical clustering depending on the definition of distance or linkage between two clusters can further be divided into various linkage methods out of which Complete Linkage Clustering [6], is a very well-known technique that finds applications in various fields of bioinformatics and biomedical engineering and is highly effective for providing small and compact clusters. However, the main disadvantage of this algorithm lies in its high Convergence time and Space requirements, which makes it unsuitable for clustering large datasets. The traditional method also requires the entire dataset in advance to start the clustering process, and hence is highly unsuitable for streaming data clustering. Triangle inequality is a very well-known technique useful for enhancing the speed of an algorithm by removing the need of doing various redundant distance computations. The method has been previously used for speeding up various Clustering algorithms like K-Means [7], DBSCAN [8], Average Linkage [9], Single Linkage [10], etc. Our paper proposes a preclustering technique based on triangle inequality that produces compact Complete Linkage preclusters in a very short time which will then be applied to a Complete Linkage algorithm to get the final set of clusters based on the user-defined threshold. The novelty of the proposed algorithm: 1. The algorithm unlike some of its well-known variants does not require the entire dataset in advance to initiate the process. 2. The preclustering stage clusters a data point even without comparing the distance of the pattern from all the members of a cluster. Thus, it saves a lot of redundant distance computations making the clustering process much faster.

An Efficient Algorithm for Complete Linkage Clustering …

165

3. The preclustering phase because of its speed and incremental nature can be proved beneficial for the partial clustering of streaming data along with its collection. 4. The Complete Linkage algorithm used at the second stage only works on these preclusters instead of dealing with the entire dataset making the process faster. 5. The final set of clusters satisfies the user-defined threshold and always follows the property of the Complete Linkage Clusters. 6. The Preclustering algorithm if used before any modified versions of the Complete Linkage algorithm can decrease the overall time to a large extent. Thus, using this preclustering algorithm at the initial stage, before subjecting the input patterns to a Complete Linkage algorithm makes the later to converge faster. Lowering the convergence time makes the Complete Linkage algorithm suitable for Clustering large datasets, thus removing one of its major disadvantages. The paper is organized as follows: Section 2 describes the background and related work of this study, Sect. 3 presents the preclustering algorithm, Sect. 4 gives the second stage of the algorithm, Sect. 5 shows the total time and space complexity of the algorithm, Sect. 6 describes the advantage and disadvantage of our algorithm over other well-known variants, Sect. 7 discusses the experimental results and Sect. 8 finally draws the conclusion and the future scopes of the research.

2 Related Work The Complete Linkage Clustering algorithm is also known as the “farthest neighbor clustering” as it defines the distance between two clusters as the maximum of the distances between the members of one cluster from that of the other cluster. In other words, the Complete Linkage algorithm merges two clusters if the maximum distance between them is found within the user-defined threshold. This definition of similarity separates the Complete Linkage algorithm from the other Linkage methods. This process makes the algorithm highly suitable for getting compact clusters. However, the major disadvantage of the algorithm is its high convergence time which makes it unsuitable for clustering large datasets. The steps of the naive implementation of the Complete Linkage algorithm [5, 11] are– (a) Start by assigning each item to each cluster, so that if there are n items, then there are n clusters, each of size unity. (b) Create an n × n proximity matrix by finding the distances between all the pairs of clusters. The distance between two clusters is the maximum of the distances between all the elements of one cluster to that of the other cluster. (c) Find the pair of clusters that are closest to each other and merge them into a single cluster, so that now we have one cluster lesser than the previous number of clusters.

166

P. Banerjee et al.

(d) Repeat steps “b” and “c” until all items are clustered into a single cluster of size “n” or the distance threshold is exceeded at any merging step in case a stoppage criterion in terms of a threshold is provided. The naive implementation of the Complete Linkage Hierarchical Clustering requires a repeated formation of an n × n distance matrix at every merging step, and thereby requires a huge time and space complexity of O (n3 ) and O (n2 ), respectively. This can easily be improved with O (n2 log n) time and O (n2 ) space using a Priority queue to store the distances between Clusters [11]. This high time and space complexity make the algorithm highly unsuitable for clustering large datasets. D. Defays [12] proposed an optimally efficient algorithm known as CLINK (published in 1977) being inspired by the similar algorithm ‘SLINK’ [13] for Single Linkage clustering. The algorithm has a runtime of O (n2 ) with a memory requirement of O (n). Althaus et al. [14] present a greedy algorithm for hierarchical Complete Linkage clusters which is able to cluster one million points in a day on current commodity hardware. The algorithm requires linear space of O (n) and a running time of roughly O (n2 ) by modifying the distance storing method of the priority queue. The algorithm can also be parallelized on shared-memory machines. All these algorithms require the entire dataset in advance to initiate the process. Davidson and Ravi [15] present a method of using constraints in Hierarchical clustering. The paper shows that the constraints can be used to not only improve the runtime performance of agglomerative hierarchical clustering but also the efficiency of the algorithm. The proposed method uses the Leader algorithm and the Triangular Inequality to create partial Complete Link clusters along with collecting the data. This makes it suitable for streaming data clustering or where the entire database is not available at once. The Leader algorithm performs clustering by comparing an incoming pattern with the Leader of the Cluster only. Thus, the preclustering stage without computing the pairwise distance between all the members of clusters takes a clustering decision, unlike the other methods where pairwise distance computations between all the input patterns are required to initiate the algorithm. The preclusters are then subjected to a Complete Linkage algorithm to get the final set of Complete Link Clusters.

3 The Proposed Preclustering Algorithm 3.1 Leader Algorithm The Leader algorithm [16] is a well-known incremental distance-based Partitional clustering technique that clusters the entire dataset by scanning it only once. It compares the distance of an incoming pattern with the existing leaders and assigns it as a follower to a leader if it comes within the threshold to that leader. If the pattern cannot be assigned as a follower to any of the existing leaders, then it will be assigned as a new leader. The entire method is depicted in algorithm 1.

An Efficient Algorithm for Complete Linkage Clustering …

167

3.2 Accelerated Leader Algorithm The Time Complexity of the Leader algorithm is O (mn) for “m” leaders and “n” data points. The algorithm requires comparing an input pattern to every existing leader unless and until it is assigned as a leader or as a follower. Hence, this algorithm is unsuitable for large data volume, as well as for high dimensional data. Patra et al. [9] designed an accelerated version of the Leader algorithm which removes the need of computing the distance between an input pattern with every leader by using triangle inequality. Triangle inequality is a highly efficient tool that provides an upper bound to the distance between a couple of data points “a” and “b” such that ∀a,b,c ∈ D : |d (a, b)|= |d(a, c) − d(b, c)|

(2)

d(a, b) >= d(a, c)

(3)

If d(b,c) >= 2d(a,c),

168

P. Banerjee et al.

Replacing a, b, c with x, lj , l1 we get, d(x, lj ) >= d(x, l1 ) , if d(lj , l1 ) >= 2d(x,l1 )

(4)

Thus, if d(x, l1 ) is > ζ, then Eq. 4, shows d(x, lj ) > ζ, which removes the necessity to compute d(x, lj ) directly. Thus, for any leader lj , by using d(x, l1 ) and d(lj , l1 ) we can conclude whether the distance d(x, lj ) is greater than the threshold or not. This requires storing the distance between all the leaders from the first leader which can be done hand in hand while assigning a data as a leader.

3.2.1

Determining the Threshold for the Preclustering Step

Complete Linkage (CL) Clusters follow the two most important properties – (a) The maximum intracluster (within-cluster) distance must be within threshold. (b) The maximum intercluster (cluster to cluster) distance must be more than the threshold. Let us consider two followers f1 , f2 of a precluster with leader “l” as shown in Fig. 1, d (f2 , f1 )10% along with state-of-the-art performance. Keywords BioNLP shared task · Ensemble learning · Event extraction · Machine learning · Natural language processing

M. Bali (B) · P. V. R. Murthy Department of Computer Science and Engineering, Ramaiah University of Applied Sciences, Bengaluru, India e-mail: [email protected] P. V. R. Murthy e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 N. Sharma et al. (eds.), Data Management, Analytics and Innovation, Advances in Intelligent Systems and Computing 1175, https://doi.org/10.1007/978-981-15-5619-7_32

445

446

M. Bali and P. V. R. Murthy

1 Introduction Since research publications and medical narratives are mostly written in textual format, natural language processing (NLP) takes center-stage in biomedical research, as it can greatly facilitate research productivity by extracting key information from unstructured text and converting it into timely structured knowledge for human understanding. Interdisciplinary collaboration between the NLP and biomedical communities has resulted in a new research area known as biomedical natural language processing (BioNLP) with the goal of developing NLP methods for various kinds of biomedical applications. As shown in Fig. 1, BioNLP researchers first use information retrieval (IE) techniques such as document classification and document/passage retrieval to select relevant documents. This article selection process majorly narrows down the search space from the entire document to the ones of interest. The researchers then incorporate information extraction (IE) technologies (e.g., event extraction which is part of this research or named entity-relation extraction) to identify the text segment that may represent a targeted information. The information may be an ‘entity–entity interaction’ (e.g., ‘drug–drug interaction’ and ‘protein– protein interaction’), an ‘entity–entity relation’ (e.g., ‘protein–residue association’, ‘gene relation’ or ‘temporal relation’), along with participating bio-entities (e.g. gene event extraction), etc. Thus, automatic NLP-enabled data channeling offers users in life sciences specific text snippets of interest without the significant effort of manual searching and researching. The extracted events or text-mined information from biomedical literature has many real-world applications. For e.g., it can be leveraged to assist in database curation, construct ontologies, facilitate semantic web search, and help in the development of interactive systems (e.g., computer-assisted curation tools).

Fig. 1. NLP based data-channeling and its applications

Bio-Molecular Event Extraction Using Classifier …

447

In this research, we focus on extracting valuable biomedical information from unstructured text. Literature confirms various techniques used for this purpose. Broadly, it was observed that these techniques can be categorized into “supervised approach” [1], “rule-based approach” [2], “MSTParser-based approach” [3], “coreference-based approach”, and of late “deep-learning-based techniques” [4]. Despite recent advances, there are many issues facing current methods for obtaining biomedical information. Traditional methods rely primarily on approaches of shallow analysis that are restricted in scope. For instance, in [5, 6] it has been observed that more focus is on the existence of communication between a couple of proteins thereby overlooking the forms of communication, whereas other more comprehensive strategies emphasize only on a small portion of the complete set of interactions [7, 8]. Secondly, the attention is only on single-sentence extraction, which renders them as exceptionally noise sensitive. Thirdly, most of the systems developed are based on any single algorithm and are not so effective individually. Individual systems have their own merits and demerits and in some cases, they perform well, whereas in few other cases poor performance is observed. Ultimately, due to the huge variety of research areas in biomedical research as well as the massive cost of data annotation, there are often substantial differences between training and testing corpora, which reflect badly on the efficiency and generalization capabilities of traditional methods [9]. We present a novel technique for extracting bio-molecular events from unstructured text that alleviates the above issues. It is based on the design of an ensemblelearning framework that consolidates many smaller models for event-extraction. Individual models for event-extraction rely on a supervised approach for the classification of multiple classes. However, the extraction model proposed here is less restricted compared to previous works [10] and covers a far more extensive corpus of biomedical literature. Such realistic environments offer solutions that were hitherto not addressed and provide a way forward for extracting bio-molecular events in an efficient and robust manner. Our contributions from this research can be summarized as below: • We propose a supervised machine learning approach using classification algorithms like Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Decision Tree (DT), and their kernel variants for extracting bio-molecular interactions from the named entities identified using syntactic and semantic parsing techniques. • We combine the above algorithms as base-learners, which enable us to perform first-layer ensemble learning of events. There have been many attempts of using such a method [10, 11] but with poor performance. • We next propose a meta-learner or an Ensemble-of-Ensemble (EoE) model that combines the output of all the above base-learners, which enables us to perform second layer ensemble learning of events. As far as we could possibly know, this is the first attempt of utilizing such a technique for bio-atomic event extraction task.

448

M. Bali and P. V. R. Murthy

Table 1. Event extraction example. Type of entities and event triggers indicated

We carry out a detailed empirical evaluation of the proposed extraction methodology on the BioNLP-2013 shared task (ST), an open community-based event extraction challenge initiated in 2009 using the Genia corpus which is designed to target fine-grained bio-event extraction. The BioNLP-2013 dataset provided six shared tasks as listed in Table 1, however, we focus only on the Genia event (GE) function. The remaining sections of the manuscript are divided as follows: Sect. 2 begins by addressing bio-molecular events and explains the task of event extraction while Sect. 3 discusses the proposed approach for event extraction. Results obtained for the proposed model are outlined in Sect. 4 along with an analysis and comparison with existing models. Section 5 presents the conclusion and future research directions. References used in this research are given at the end of the manuscript.

2 Background To remove events from text, most frameworks essentially depend on a pipeline methodology, which typically comprises three cascading modules including “biomedical term identification”, “event trigger identification”, and “event argument detection”. In such typical approaches, finding the reliability is considered

Bio-Molecular Event Extraction Using Classifier …

449

as a crucial task, as errors that have crept-in move through the pipeline and affect subsequent stages. Preprocessing, therefore, plays an important role in any pipeline for text mining. It comprises reading and modifying the data from its original format into an internal form and eliminating features that normally include all the degree of content or language processing. Preprocessing can also include resolving coreferences or imposing some sort, for instance, in the particular case of event extraction of sentence simplification, to boost extraction performance. In the subsequent sections, we explain some of these tasks and related techniques adopted by us in this research.

2.1 Bio-Molecular Event Extraction Event extraction is primarily a process of extracting structured representations of biological events from text with related attributes and properties that are typically defined by complex and related argument structures that include multiple entities or recursively embedded relationships. It is considered one of the central practices in the programmatic recognition of associations among various entities and other information in textual documents. These researches are attempting to find some specific biomarkers that would assist doctors to better their patient’s customized treatment plans. Nevertheless, a list of changes is not sufficient on its own; results should be correctly addressed in the correct setting as shown in the medical literature. Event extraction systems provide various prospects of extracting the exact context of the observed disease and treatment changes, as well as the ability to generate new theories by discovering and testing new associations in literature and genomic evidence between different entities. Named Entity Recognition (NER) is among the most necessary steps in the processing of events. It includes identifying the textual boundaries that compose the name of an entity and categorizing the entity as per its related, predetermined classes. Biomedical bodies include “names for genes and proteins”, “biological structures”, “diseases”, and “names for medications”. Such concepts are generally treated through their one-of-a-kind and clear portrayals via a process called normalization. In language, it is possible to refer to a particular concept in different ways, which are called term variants. For instance, both “TIF2,” “TIF-2,” “intermediate transcription factor-2” and “intermediate transcription factor 2” represent the same individual. The method of normalization integrates the aggregation of term variants into a class of equivalence, derived from external knowledge bases. In BioNLP-ST, the entities are already given and the standardization step is used to improve the efficiency of event extraction. Relationships are defined after the entity recognition task called trigger detection, which captures the logical connections between the identified entities. An essential precondition for an analytical understanding of biological processes is the systematization of the process of identifying relationships. Earlier attempts to extract biomedical events centered on the extraction of easy, two-way relationships between

450

M. Bali and P. V. R. Murthy

biomedical entities, for e.g., “protein-protein interactions (PPI)”, “gene-disease relationships” and “drug-drug interactions (DDI)”. In any case, these binary relations alone are not sufficient to catch biomedical marvels deeply illustrating the need for better portrayals to enhance the extraction of complex relationships or events. To address the “afore mentioned problem”, [12] proposed the GENIA event annotation scheme to describe complex biological processes, which normally includes a shift in the properties, positions, or connections between biomedical entities, such as “proteins, cells and chemicals”, in order to address the “previous problem”. In addition, the schema typically describes event-related attributes such as “polarity”, “negation”, and “speculation” that reshape the contextual information anchored with an event. Such characteristics are fundamentally important, because negation reverses the value of a case, and with this claim, negation of a candidate statement prevents the construction. The issue of event extraction can be broadly classified into two sub-tasks that can be learned together or one after the other: “trigger detection” and “argument”. The first subtask, cause identification, involves acknowledging the presence of the event in a sentence of simple content-specific textual terms. Such textual units or trigger words are normally in the form of verbs (e.g., regulates, phosphorylates) or nominal verbs (e.g., control, phosphorylation). Trigger detection is an area-specific task where semantics and the amount of event-types can vary significantly from one domain to the next. Next, the BioNLP 2013 Genia (GE) ST comprises 13 different types of events (gene expression, catabolism, or alteration of proteins, control, etc. See Table 2). The second subtask is the argument identification function, which includes defining an event’s participants along with their associated semantic roles in order to completely construct an event. Such statements consist of various substances (e.g., proteins) or events involving clustered structures on a regular basis. Therefore, it is not shocking that the concept of events will form a similar substance in different events. Such arguments corresponding to semantic categories are derived from a defined ontology of semantic forms (e.g., theme and roles of cause). Post-processing steps will follow these two subtasks to turn the detected stimuli and arguments into valid structures. In order to explain the process of event extraction, consider the phrase: “TGFbeta mediates RUNX induction, and FOXP3 is efficiently up-regulated by RUNX1 and RUNX3” (Table 1). Originally, the named entities are: “TGF-beta”, “RUNX”, Table 2. Corpus statistics of the BioNLP-ST 2013 shared tasks Task

Documents

# types

# events

Genia Event (GE)

34 Full papers

2

13

Cancer Genetics (CG)

600 Abstracts

18

40

Pathway Curation (PC)

525 Abstracts

4

23

Gene Regulation Ontology (GRO)

300 Abstracts

174

126

Gene Regulation Network (GRN)

201 Abstracts

6

12

Bacteria Biotopes (BB)

124 Web pages

563

2

Bio-Molecular Event Extraction Using Classifier …

451

“FOXP3”, “RUNX1”, and “RUNX3”. The next step is to define and annotate terms that detect the presence of events called the “trigger words” as labels. Next is the identification of events. In the example above, “induction” is a verb with a theme “RUNX” that in the first case of ‘gene expression’ is a protein. The second event captured is “positive regulation” via the word “media” the cause for which is “TGF-beta”, a protein theme is the gene expression event. The “up-regulated” verb, activated by two “RUNX1” and “RUNX3” proteins, indicates the third event-positive regulation, and the result is another “FOXP3” protein. Token which is “up-regulated” generates a positive regulation event having overlapping edges. The structure so formed is considered separately and a valid event is constructed.

2.2 Bio-Molecular Event Extraction BioNLPST, where multiple biennial competitions are conducted, promotes the use of NLP and NLP applications focused on the biomedical domain. Multiple protocols are established by BioNLP-ST at each round, for the competition. This offers common resources including a rich array of knowledge bases and annotated training firms, and sets benchmarks for the efficiency of measuring and rating competing methods. Such a coordinated effort is key to creating a common framework that describes the needs of biologists and guides the creation of state-of-the-art, observable BioNLP applications for success. The BioNLP-ST series was introduced in 2009 with an event extraction challenge using the Genia corpus, which has seen significant growth and has added new subtasks; extending its reach to extract events from complete papers as well as abstracts; and the resource sophistication. In 2013, the BioNLP working group published six STs listed in Table 2. Furthermore, these challenge series require teams/participants to share their code and trained models in public domain; develop new methods that reduce reliance on deeply annotated data; and reinforce database curation. Most participants are continuing to improve their models even after the challenge submission and new researchers are taking up this task to better the result toward a state-of-the-art performance in this domain. In this study, we focus only on Genia Event (GE) Task 1––the core event extraction that, together with their primary arguments, addresses the extraction of typed events. The GE task at BioNLP 2013 is structured in order to make it a more “true” activity that is useful for the creation of knowledge base. The first choice of design is to create the data sets with only recent complete papers, so the collected pieces of information could be up-to-date domain knowledge. Second, the co-reference annotations are included in the event annotations in order to encourage the use of these co-reference features in the event extraction solution.

452

M. Bali and P. V. R. Murthy

2.3 Evaluation Metrics In general, it is possible to measure the performance of a BioNLP system based on its observations (either true or false), which are used to evaluate few parameters such as precision, recall, and F-score metrics. “Precision” can be sought, as considered by ground truth, with the aid of the fraction of documents collected or derived relevant events. The “F-score” calculation is the “Precision” and “Recall” harmonic standard. Accuracy and recall are in most cases inversely proportional: i.e., accuracy scores normally fall as the retrieved documents rise at the rate of recall, and vice versa. The “F-score” is nothing but a precision-to-recall trade-off feature.

3 Proposed Approach for Event Extraction The overall proposed model comprises the following phases: • • • • • • •

Data Acquisition Data Preprocessing Feature Identification and Extraction (Entity recognition) Ensemble Classification assisted Event Extraction First-layer Ensemble event expression Second-layer Ensemble-of -Ensemble event expression Performance assessment.

A high-level block diagram of the proposed model is illustrated in Fig. 2. Figure 3 depicts the detailed flowchart. The detailed discussion of the above stated methodological paradigm is given as follows.

Fig. 2. High-level block diagram of the proposed model

Bio-Molecular Event Extraction Using Classifier …

453

Fig. 3. Flow-chart of the proposed model

3.1 Data Acquisition Full text scientific papers (GE) task from the BioNLP2013 ST is used. It has 34 full papers with 2 types and overall 13 events targeted.

3.2 Data Preprocessing Pipeline procedure is used, which is made up of three modules including “biomedical term identification”, “event trigger identification”, and “event argument detection”. Event triggers should be identified reliably, since the existence of errors at the beginning stages will be propagated further, which affects the performance of the subsequent module. “Pre-processing plays an important role in any text mining system in order to avoid this problem. It includes most of the operations that read the data from their original format to an internal version, and extract features that usually involve some sort of text or language processing. Pre-processing may also entail resolving co-references in the specific case of event extraction or applying some form of sentence simplification, for example, by expanding conjunctions, in order to improve the extraction results.”

454

M. Bali and P. V. R. Murthy

“To extract a feature depiction from texts, MATLAB uses text processing that includes a set of common NLP tasks, ranging from data cleaning, sentence segmentation and tokenization, chunking, and linguistic parsing.”

3.3 Feature Extraction For a successful event extraction framework, feature extraction plays a critical role. Many current event extraction systems provide a complex set of characteristics derived from “tokens,” “sentences,” “dependence parsing trees”, and “external tools”. The following features are recognized and extracted. • Token-based features containing specific knowledge of each token, like syntactic or linguistic attributes, namely, the POS and the lemma of each token. • Contextual characteristics have a general sentence or community characteristics, where the target token is present. The number of tokens in the sentence is some of the features derived from the sentence number of named entities in the sentence, and bag-of-words counts of all words. • Dependency parsing provides information on grammatical relationships involving two words. • They also encode domain knowledge as features that use external resources, such as the most probable trigger word lexicons, and these are gene and protein names that signify the presence of a cause or entity. “The proposed model focuses more on Named Entity Recognition, which involves of detecting references (or mentions) to entities such as ‘genes’, ‘proteins’ and ‘chemical compounds’, and marking them with their position and type. Named-entity identification in the biomedical domain is typically considered more complicated than in other areas because firstly, millions of entity names are in use and new systems are constantly being included, suggesting that dictionaries cannot be adequately comprehensive; secondly, the biomedical field is evolving too rapidly to allow consensus to be reached on the name to be used in the biomedical domain. It evolves too quickly to allow the agreement to be reached on the name to be used for a given entity or even on the exact definition identified by the entity itself. Therefore, it is easy to use the same name or acronym for specific concepts. We extract entities in MATLAB and they are used in our first-layer ensembles as function vectors.”

3.4 Feature Extraction Detection of word trigger is nothing but a function of extraction of events that attracted most research interest. It is one of the critical tasks, since the information generated by task strongly defines the effectiveness of the following tasks. This assignment

Bio-Molecular Event Extraction Using Classifier …

455

Table 3. Event triggers and their types in the example sentence Trigger 1

Trigger 2

Trigger 3

‘induction’, which is a gene expression

‘mediates’, which is a positive regulation

‘up-regulated’ which again is a positive regulation

describes essentially the chunk of text that causes the event and acts as a reference. Even though the word trigger is not limited to a specific set of part-of-speech tags. Verbs (for example, “activates”) and nouns (for example, “expression”) are the most common tags. In addition, a stimulus can consist of several consecutive terms. Table 3 shows the expected results of the phase of trigger detection in the given example paragraph. As can be seen from Table 1, trigger detection is a mechanism that detects the recognition of causes of events and their type. The observed three causes are shown in Table 3. Various approaches have been suggested for trigger detection, which can be broadly segregated into three types: “rule-based, dictionary-based, and machine learning-based.” We approach it as a supervised machine learning problem and combine both the trigger detection and the argument extraction step. To design an ensemble-learning model, we have considered different learning models like SVM, KNN, and DT with different kernel functions (or learning methods) such as Linear, Gradient Descent, and KNN with 1 and 5-neighbors, respectively. • Support Vector Machine-LINEAR • Decision Tree, • K-nearest neighbors. The detailed discussion of the considered base learning models is given as follows: • Support Vector Machine (SVM) SVM is one of the most commonly used supervised machine learning algorithms having a proven record of performing accurate classification. SVM exploits the patterns of the data and functions as a non-probabilistic binary linear classifier. It reduces the generalization error on an unobserved example using Structural Risk Minimization concept. In this method, the support vectors state a subset of the training set that retrieves the value of a boundary also called the hyper-plane between the previously mentioned two classes. To predict SVM, it applies the function (1). Y  = w ∗ φ(x) + b

(1)

In (1), φ(x) states the nonlinear transform, where the prime motive is to obtain the suitable values of w, b. Here,Y  is retrieved by reducing the risk of regression. l     1  Rreg Y  = C ∗ γ Yi − Yi + ∗ w2 2 i=0

(2)

456

M. Bali and P. V. R. Murthy

In (2), γ and C state the cost function and penalties for error estimation. The value of w is obtained using (3). l      w= α j − α ∗j φ x j

(3)

j=1

In (3), the parameters α and α ∗ state the relaxation parameter called Lagrange multipliers (α, α ∗ ≥ 0). The output obtained is Y =

l  

   α j − α ∗j φ x j ∗ φ(x) + b

j=1

=

l      α j − α ∗j ∗ K x j , x + b

(4)

j=1

  In (4), K x j , x states the kernel function. In the proposed work, SVM with LINEAR kernel function has been applied to extract event expression. Linear SVM is an implementation of SVM using Liblinear. It is a library for large linear classification for data with large no. of instances and features.’ • Decision Tree (DT) Decision tree is a conventional classification algorithm that has extensively been used in data mining and classification purposes. Over time, decision tree has evolved with various names with distinct and better classification performance. Some of the dominant variants of decision tree algorithm are IDE, CART, DT C4.5 and DT C5.0 [51]. In this paper, we have applied C5.0 model of decision tree classifier that performs recursive partitioning over the data to extract event expressions. Originating at the root node, at each node of the tree it splits the feature vector or data into different branches based on association rule in between the split criteria. Our proposed C5.0 decision tree algorithm applies Information Gain Ratio (IGR) to perform multi-class sentiment classification. • K-nearest neighbors (KNN) K-nearest neighbor is used for both regression and classification problems and there is no training process for this algorithm, the entire data set is used for predicting/classifying new data. When a new data point is given, it finds the distance from the new data point, to all other points in our data set. Then depending on the K value, it identifies the nearest neighbor(s) in the data set, if K = 1, then it takes the minimum distance of all points and classifies as the same class of the minimum distance data point. if K > 1, then it takes a list of K minimum distances of all data points

Bio-Molecular Event Extraction Using Classifier …

457

Fig. 4. Detailed architecture of the proposed model

For classification, it classifies the new data point based on the majority of votes in the list. The proposed system for these algorithms and tasks follow an ensemble architecture in which various individual models are combined. To perform this operation, three individual event extraction systems are developed and the output of these care then combined. The same set of features are being used by these systems. The output from these individual systems are then combined in a meta-ensemble. Overall system architecture is shown in Fig. 4. • Ensemble-1 and Ensemble-3 (ens 1 or S1 and ens 3 or S3 ): ‘These two systems are based on multi-class Linear Support vector machine (SVM)’ [11], K-nearest neighbor and Decision Tree as base classifiers. Ensemble-1 comprises Linear SVM, KNN 1 (with 1 neighbor) and KNN 5 (with 5 neighbors) algorithms. Ensemble-3 comprises Linear SVM, KNN 1 and Decision Tree algorithms. The same set of features are used in both the systems and majority voting is used as the ensemble output technique. • Ensemble-2 (ens 2 or S2 ): ‘This system is entirely based on a stacking approach [10], where three classification algorithms are being used, namely Linear SVM, KNN 1 (with 1 neighbor) and Decision Tree as the base classifiers. This system also uses the same set of features but stacking [10] as the ensemble technique. In the Genia event extraction task, various categories of event extractions are found which includes gene_expression, binding, regulation etc. BioNLP-2013 evaluation scheme is used, which generates the statistics like recall, precision and F-score for each type of event expression. It is observed that, few classifiers shows

458

M. Bali and P. V. R. Murthy

Table 4. Combining the output of each classifier within ens 1, ens 2 and ens 3

good performance for one class, others perform well on the other classes. Based on F-score value per class, the outputs of each classifier in ens 1(S 1 ), ens 2(S 2 ) and ens 3(S 3 ) are combined. This entire process is an iterative method, as shown in Table 4. In each iteration, we combine the classifiers by taking the best results from all three classifiers in each ensemble of the first layer.’ • Ensemble-of-Ensemble (EoE or Meta-Learner): To explain how the outputs of different ensembles in first layer are combined, let us consider another example ‘BMP-6 inhibits growth of mature human B cells; induction of Smad phosphorylation and upregulation of Id1”. We consider ens 1, 2 as an example whose outputs are shown in Tables 5 and 6. Table 7 illustrates the detail operations where the outputs are combined. Result of combining the outputs of ens 1(S1 ) and ens2 (S2 ) is shown in Table 8. Out of the four ensembles, two use majority voting and two use stacking technique for deciding on the output. Table 5. Ensemble 1 output

‘T 1:(Protein, BMP-6) T 2:(Protein, Id1) T 3: (Positive_regulation, upregulation) E 1: (Type/class: T3, theme: T1)’

Table 6. Ensemble 2 output

‘T 1:(Protein, BMP-6) T 2:(Protein, Id1) T 3: (Positive_regulation, upregulation) T4: (Phosphorylation, phosphorylation) E 1: (Type/class: T3, theme: T2) E2: (Type/class: T4, theme: T2)’

Bio-Molecular Event Extraction Using Classifier …

459

Table 7. Detailed process for combining outputs from different ensembles

Table 8. Output by combining ens 1(S1 ) and ens 2 (S2 ) systems

‘T 1:(Protein, BMP-6) T 2:(Protein, Id1) T 3: (Positive_regulation, upregulation) E 1: (Type/class: T3, theme: T2)’

Majority Voting: All classifiers are executed over the same dataset (say, entities) which classifies each event tweet. Fir each instance predication (votes) is made by every model and the final output prediction is the one that receives more than 50% of the votes. Stacking: Stacking is a process that is also called as the stacked generalization, is an ensemble method where multiple classifiers are combined using another machine learning technique. The fundamental concept is to train machine-learning algorithms with training dataset, which produces a new dataset with these models. Then generated dataset is used as input for the aggregated machine-learning algorithm. In this research, the final Meta-ensemble uses stacking.

460 Table 9. Result obtained

M. Bali and P. V. R. Murthy Task

Evaluation results

GE Proposed Ensemble-of-Ensemble (EoE) Core event extraction F-score: 66.61

Table 10. Results comparison

Model name

Recall (%) Precision (%) F-measure (%)

TEES

49.56

57.65

53.30

FAUST

49.41

64.75

56.04

EventMine

51.25

64.92

57.28

Proposed System 64.23 (EoE)

68.59

66.34

4 Experimental Results and Analysis BioNLP-ST-2013 dataset evaluation frameworks are used to perform the proposed experiment. The system on the development dataset is regulated, and the resultant configurations are used for final evaluation. This regulated system is then evaluated on the test dataset. Table 1 shows the complete statistics of BioNLP-13 dataset for Genia event extraction. Both trigger detection and classification process is carried out at the same time. We use ensemble classification in a 2-layered approach. The ensembles use either a majority voting or a stacking approach. After combining the outputs in the novel ensemble-of-ensemble approach, the results obtained are shown in Table 9.

4.1 Comparison with Existing Systems University of Turku-TEES [13] have submitted an entry that has gained a top rank in BioNLP-ST-2013. Our experimental results have been compared with results achieved by TEES, FAUST [14] and EventMine [15] (Table 10). The above results shows that the performance has improved by >10% (F-measure gain). It is also observed that our model has attained better performance compared with the model proposed by FAUST and EventMine that had gained good ranks in BioNLP-2011 [16]. The results compared are shown in Fig. 5.

5 Conclusion and Future Work In this research, initially event extraction is carried out in biomedical text using an efficient method, where multiple ensemble models have been developed with

Bio-Molecular Event Extraction Using Classifier …

461

Fig. 5. Comparison with existing models

the help of different algorithms and their kernel variants. Later these models are aggregated in a 2-layer ensemble approach for bio-molecular event extraction. BioNLP 2013 shared-task datasets are used to test and evaluate the proposed model. Results achieved demonstrate that the model implemented attains a higher F1 score along with state of the art performance compared to other existing approaches. As part of future work, the model will be extended to other tasks in the BioNLP 2013 ST data. In addition, the model will be tested on latest BioNLP shared task challenge data to develop a robust model for BioNLP 2020 ST and other challenge entries.

References 1. A. Ozgur, D.R. Radev, Supervised classification for extracting biomedical events, in Proceedings of the Workshop on BioNLP, pp. 111–114 2. Q.C. Bui, D. Campos, E. van Mulligen, J.A. Kors, A fast rule-based approach for biomedical event extraction (2013), pp. 104–108 3. D. McClosky, M. Surdeanu, C.D. Manning, Event extraction as dependency parsing for BioNLP 2011, in Proceedings of the BioNLP Shared Task 2011 Workshop (2011), pp. 41–45 4. J. Wang, H. Li, Y. An, H. Lin, Z. Yang, Biomedical event trigger detection based on convolutional neural network. Int. J. Data Min. Bioinformat. 15(3), 195–213 (2016) 5. A. Airola, S. Pyysalo, J. Bj¨orne, T. Pahikkala, F. Ginter, T. Salakoski, All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC Bioinform. 9, S2 6. R.J. Mooney, R.C. Bunescu, Subsequence kernels for relation extraction, in Proceedings of NIPS (2005), pp. 171–178 7. L. Hunter, Z. Lu, J. Firby, W.A. Baumgartner, H.L. Johnson, P.V. Ogren, K.B. Cohen, Opendmap: an opensource, ontology-driven concept analysis engine, with applications to

462

8. 9. 10.

11. 12. 13.

14. 15. 16. 17. 18.

19. 20. 21. 22. 23. 24.

25.

26. 27. 28.

M. Bali and P. V. R. Murthy capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinform. 9, 78 (2008) R. McDonald, F. Pereira, S. Kulick, S. Winters, Y. Jin, P. White, Simple algorithms for complex relation extraction with applications to biomedical ie, in Proceedings of ACL (2005) D. Tikk, P. Thomas, P. Palaga, J. Hakenberg, U. Leser, A comprehensive benchmark of kernel methods to extract protein–protein interactions from literature. PLoS Comput. Biol. (2010) A. Majumder, A. Ekbal, S.K. Naskar, Biomolecular event extraction using a stacked generalization based classifier, in Proceedings of the 13th International Conference on Natural Language Processing, Varanasi, India, December 2016. NLP Association of India (2016), pp. 55–64 A. Majumder, A. Ekbal, S.K. Naskar, Feature selection and class-weight tuning using genetic algorithm for bio-molecular event extraction, in Proceedings of the NLDB (2017) K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multi objective genetic algorithm: NSGA-ii. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002) J. Bjo¨rne, T. Salakoski, TEES 2.1: Automated annotation scheme learning in the BioNLP 2013 shared task. In Proceedings of the BioNLP Shared Task 2013 Workshop (2013), pp. 16–25, Sofia, Bulgaria. Association for Computational Linguistics S. Riedela, D. McCloskyb, M. Surdeanub, A. McCalluma, C.D. Manning, Model combination for event extraction in BioNLP 2011 (2011), pp. 51–55 M. Miwa, S. Pyysalo, T. Ohta, S. Ananiadou, Wide coverage biomedical event extraction using multiple partially overlapping corpora. BMC Bioinform. 14(1), 175 (2013) J.D. Kim, S. Pyysalo, T. Ohta, R. Bossy, N. Nguyen, J. Tsujii, Overview of BioNLP shared task (2011), pp. 1–6 N. Chinchor, Overview of muc-7, in Proceedings of the Seventh Message Understanding Conference (MUC-7) Held in Fairfax, Virginia, April 29–May 1 (1998) E.M. Voorhees, L.P. Buckland (eds) Proceedings of the Sixteenth Text Retrieval Conference, TREC 2007, Gaithersburg, Maryland, USA, November 5–9, volume Special Publication 500274. National Institute of Standards and Technology (NIST) (2007) C. Nedellec, Learning language in logic–genic interaction extraction challenge, in Proceedings of the 4th Learning Language in Logic Workshop (LLL05) (2005), pp. 31–37 L. Hirschman, M. Krallinger, A. Valencia, Proceedings of the Second BioCreative Challenge Evaluation Workshop, Centro Nacional de Investigaciones Oncologicas (CNIO) J.D. Kim, T. Ohta, S. Pyysalo, Y. Kano, J. Tsujii, Overview of bionlp09 shared task on event extraction. in BioNLP 09: Proceedings of the Workshop on BioNLP (2009), pp. 1–9 H.G. Lee, H.C. Cho, M.J. Kim, J.Y. Lee, G. Hong, H.C Rim, A multi-phase approach to biomedical event extraction, in Proceedings of the Workshop on BioNLP, pp. 107–110 L. Li, Y. Wang, D. Huang, Improving feature-based biomedical event extraction system by integrating argument information (2013), pp. 109–115 J.D. Kim, Y. Wang, N. Colic, S.H. Beak, Y.H. Kim, M. Song, Refactoring the genia event extraction shared task toward a general framework for IE-driven KB development, in Proceedings of the Fourth BioNLP Shared Task Workshop (2016), pp. 23–31 K. Yoshikawa, S. Riedel, T. Hirao, M. Asahara, Y. Matsumoto, Coreference based eventargument relation extraction on biomedical text, in Proceedings of the Semantic Mining in Biomedicine Symposium (2010) D. Mcclosky, Any domain parsing: automatic domain adaptation for natural language parsing. PhD thesis, Providence, RI, USA, 2010. AAI3430199 M. Krallinger, F. Leitner, C. Rodriguez-Penagos, A. Valencia, Overview of the protein-protein interaction annotation extraction task of biocreative ii. Genome Biol. 9, S4 (2008) D. Tikk, P. Thomas, P. Palaga, J. Hakenberg, U. Leser, A comprehensive benchmark of kernel methods to extract protein–protein interactions from literature. PLoS Comput. Biol. (2010)