Machine Learning: Theoretical Foundations and Practical Applications 981336517X, 9789813365179

This edited book is a collection of chapters invited and presented by experts at 10th industry symposium held during 9–1

384 56 7MB

English Pages 190 [178] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
Editors and Contributors
What Do RDMs Capture in Brain Responses and Computational Models?
1 Introduction
2 Methods
2.1 Representational Similarity Analysis (RSA)
2.2 RDM Construction for Human Brain Responses and Deep Neural Nets (DNNs)
2.3 Description of Algonauts Challenge Datasets
2.4 Experiments with DNNs
2.5 Selecting the Best Models
2.6 Qualitative Analysis of High and Low Values in RDMs
3 Results
3.1 Analysis of the Human Brain Responses
3.2 Deep Networks Mimicking Human Brain Responses
4 Discussion
4.1 Qualitative Analysis of Human Responses and DNN Responses
4.2 The Way Forward
References
Challenges and Solutions in Developing Convolutional Neural Networks and Long Short-Term Memory Networks for Industry Problems
1 Introduction
1.1 Inception
1.2 Regional CNN
1.3 Recurrent Neural Networks
2 Application One: Image Recognition in a Document by Using Convolutional Neural Networks
2.1 Analyzing the Problem
2.2 Handling the Challenge of Skewness
2.3 Architecture of CNN
2.4 Prediction/Testing
2.5 Results
3 Application 2: Predicting Equated Monthly Instalment Payments
3.1 Analyzing the Problem
3.2 Data Representation
3.3 Design of the LSTM
3.4 Results
4 Review of the Two Applications
References
Speed, Cloth and Pose Invariant Gait Recognition-Based Person Identification
1 Introduction
2 Literature Review of Related Work
2.1 GEI
2.2 HOG
2.3 Radon Transform and Zernike Moments
2.4 Classifier Model
3 Result and Discussion
3.1 About the Database
3.2 Speed Invariance
3.3 Cloth Invariance
3.4 Pose Invariance
4 Comparison with Existing Approaches
5 Conclusion and Future Scope
6 Declarations
6.1 Funding
References
Application of Machine Learning in Industry 4.0
1 Industry 4.0
2 Evolution of Industry 1.0–4.0
2.1 Industry 1.0
2.2 Industry 2.0
2.3 Industry 3.0
2.4 Industry 4.0
3 Nine Pillars of Industry 4.0
3.1 Big Data and Analytics
3.2 Autonomous Robots
3.3 Simulation
3.4 Internet of Things
3.5 Cloud
3.6 System Integration
3.7 Additive Manufacturing
3.8 Augmented Reality
3.9 Cyber Security
4 Machine Learning
4.1 Regressors
4.2 Classifiers
4.3 Clustering
4.4 Reinforcement Learning
4.5 Natural Language Processing
4.6 Deep Learning
5 Conclusion
References
Web Semantics and Knowledge Graph
1 Introduction
1.1 Advantages of Using the Semantic Technologies
1.2 Challenges in Note
2 Basics of Taxonomies
2.1 Advantages of Using a Taxonomy
2.2 Components of the Taxonomy
3 RDF Data Model
4 Basics of Ontologies
4.1 Classes and Relationships
5 Taxonomy and Ontologies, the Difference and Similarity
6 Introduction to Text Mining
7 Semantic Text Mining
8 Related Thing to Keep in Mind While Text Mining (State of the Art)
9 Resolving the Diverse Database Conflict
10 Conclusion
References
Machine Learning-Based Wireless Sensor Networks
1 Introduction
1.1 Challenges
2 State-of-the-Art Applications of Machine Learning in WSNs
3 ML in WSNs
3.1 Detection
3.2 Target Tracking
3.3 Localization
3.4 Security Enhancement
3.5 Routing
3.6 Clustering
4 Conclusion
References
AI to Machine Learning: Lifeless Automation and Issues
1 Introduction
2 Current State of Art
2.1 Computer Vision
2.2 Generative Adversarial Networks
2.3 Training and Deployment
3 From AI to ML
4 Applications
4.1 COVID
4.2 Disease Detection
4.3 Wind Power Detection in Power Systems
4.4 Agriculture
4.5 Politics
4.6 Genomics
4.7 Networking
4.8 Energy Forecasting
5 Conclusion
References
Analysis of FDIs in Different Sectors of the Indian Economy
1 Introduction
1.1 Technologies Used
2 State of the Art
3 ARIMA Model for Forcasting
3.1 Model 1: AR MODEL
3.2 Model 2: MA MODEL
3.3 Hybrid Model 1: ARMA MODEL
3.4 Hybrid Model 2: ARIMA MODEL
4 Implementation and Results
4.1 Box–Jenkins Method
4.2 Predicting Future Trends
5 Conclusion
References
Customer Profiling and Retention Using Recommendation System and Factor Identification to Predict Customer Churn in Telecom Industry
1 Introduction
2 Literature Review
3 Problem Definition
4 Proposed Model for Customer Churn Prediction
4.1 Data Set Description
4.2 Data Pre-processing
4.3 Performance Evaluation Matrix
5 Factors Identifying Customer Churn
6 Customer Profiling and Retention
7 Customer Retention Using Recommendation System
8 Implementation and Results
9 Conclusion and Future Work
References
Recommend Papers

Machine Learning: Theoretical Foundations and Practical Applications
 981336517X, 9789813365179

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Studies in Big Data 87

Manjusha Pandey Siddharth Swarup Rautaray   Editors

Machine Learning: Theoretical Foundations and Practical Applications

Studies in Big Data Volume 87

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Big Data” (SBD) publishes new developments and advances in the various areas of Big Data- quickly and with a high quality. The intent is to cover the theory, research, development, and applications of Big Data, as embedded in the fields of engineering, computer science, physics, economics and life sciences. The books of the series refer to the analysis and understanding of large, complex, and/or distributed data sets generated from recent digital sources coming from sensors or other physical instruments as well as simulations, crowd sourcing, social networks or other internet transactions, such as emails or video click streams and other. The series contains monographs, lecture notes and edited volumes in Big Data spanning the areas of computational intelligence including neural networks, evolutionary computation, soft computing, fuzzy systems, as well as artificial intelligence, data mining, modern statistics and Operations research, as well as self-organizing systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. The books of this series are reviewed in a single blind peer review process. Indexed by zbMATH. All books published in the series are submitted for consideration in Web of Science.

More information about this series at http://www.springer.com/series/11970

Manjusha Pandey · Siddharth Swarup Rautaray Editors

Machine Learning: Theoretical Foundations and Practical Applications

Editors Manjusha Pandey School of Computer Engineering KIIT (Deemed to be University) Bhubaneswar, Odisha, India

Siddharth Swarup Rautaray School of Computer Engineering KIIT (Deemed to be University) Bhubaneswar, Odisha, India

ISSN 2197-6503 ISSN 2197-6511 (electronic) Studies in Big Data ISBN 978-981-33-6517-9 ISBN 978-981-33-6518-6 (eBook) https://doi.org/10.1007/978-981-33-6518-6 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

Data science, big data analytics and machine learning are the need of the hour; it has escalated itself in every sphere of our day-to-day lifestyle starting from smart homes to smart agriculture, automobile sector to educational aids and from automated workplaces to circumspect industries; everywhere, the automation has led towards the requirement of machine learning. Machine learning is defined as the embedded capabilities in the computer systems that provide it the ability to learn and adapt without the requirements of explicit instructions each and every time. The same has been made possible by the use of algorithms and statistical models that are use to analyze and draw inferences from patterns in data. The management of huge amount of data that is continuously generated by the automated systems has resulted into rise to concerns regarding data collection efficiency, data processing, analytics and security along with the mandate of machine learning to further automate the processes. The presented edited book titled Machine Learning—Theoretical Foundations and Practical Applications is a work consolidating the chapters submitted and invited chapters presented by invited speakers at the 10th Industry Symposium held during 9–12 January 2020 in conjunction with 16th edition of ICDCIT. As a subset of artificial intelligence machine learning aims to provide computers the ability of independent learning without being explicitly programmed with the ability to take intelligent decisions without human intervention. The stream of research is proceeding towards enabling machines to grow and improve with experiences referred to as learning by machines making them more intelligent. Numerous Advantages of machine learning like usefulness for large-scale data processing, large-scale deployments of machine learning is beneficial for improved speed and accuracy in processing, understanding of nonlinearity in the data and generation of function mapping input to output as in supervised learning providing recommendations for solving classification and regression problems, ensuring better customer profiling and understand of their needs and many more the proposed title aimed to cover the following topics, but not limited to like machine learning and its applications, statistical learning, neural network learning, knowledge acquisition and learning, knowledge-intensive learning, machine learning and information retrieval, machine learning for web navigation and mining, learning through mobile data mining, text and multimedia mining through machine learning, distributed and parallel learning v

vi

Preface

algorithms and applications, feature extraction and classification, theories and models for plausible reasoning, computational learning theory, cognitive modelling, hybrid learning algorithms. This edited book would be targeting for the technical institutes, analytical industries and analytical research institutes as primary audience, and we hope it would be helpful for the future researchers in the field of machine learning. Bhubaneswar, India

Dr. Manjusha Pandey Dr. Siddharth Swarup Rautaray

Contents

What Do RDMs Capture in Brain Responses and Computational Models? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krutika Injamuri, Sai Somanath Komanduri, Chakravarthy Bhagvati, and Raju Surampudi Bapi Challenges and Solutions in Developing Convolutional Neural Networks and Long Short-Term Memory Networks for Industry Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arunkumar Balakrishnan Speed, Cloth and Pose Invariant Gait Recognition-Based Person Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vijay Bhaskar Semwal, Arghya Mazumdar, Ashish Jha, Neha Gaud, and Vishwanath Bijalwan

1

17

39

Application of Machine Learning in Industry 4.0 . . . . . . . . . . . . . . . . . . . . . Mahendra Kumar Gourisaria, Rakshit Agrawal, Harshvardhan GM, Manjusha Pandey, and Siddharth Swarup Rautaray

57

Web Semantics and Knowledge Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bharat Sharma

89

Machine Learning-Based Wireless Sensor Networks . . . . . . . . . . . . . . . . . . 109 Lipika Mohanty, Junali Jasmine Jena, Manjusha Pandey, Siddharth Swarup Rautaray, and Sushovan Jena AI to Machine Learning: Lifeless Automation and Issues . . . . . . . . . . . . . . 123 Subhashree Darshana, Siddharth Swarup Rautaray, and Manjusha Pandey

vii

viii

Contents

Analysis of FDIs in Different Sectors of the Indian Economy . . . . . . . . . . . 137 Parikshit Barua, Sandeepan Mahapatra, Siddharth Swarup Rautaray, and Manjusha Pandey Customer Profiling and Retention Using Recommendation System and Factor Identification to Predict Customer Churn in Telecom Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Nooria Karimi, Adyasha Dash, Siddharth Swarup Rautaray, and Manjusha Pandey

Editors and Contributors

About the Editors Dr. Manjusha Pandey presently working as Associate Professor at the School of Computer Engineering, Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, Odisha, India. She has teaching and research experience of more than 9 years. She did her doctoral degree from Indian Institute of Information Technology, Allahabad, U.P., India; her research interest includes big data analytics, computer networks, intelligent systems, machine learning and similar innovative areas. Her research contribution includes 04 co-edited proceedings/books which include SIS Springer series, more than 65 research publications in reputed conferences, book chapters and journals indexed in Scopus/SCI/ESCI and with a citation index of 600 as on date. As an organizing chair, she has organized 02 international conferences and has been part of different core committees of other conferences and workshops. She has delivered invited talks in different workshops and conferences. Dr. Siddharth Swarup Rautaray presently working as Associate Professor at the School of Computer Engineering, Kalinga Institute of Industrial Technology, Deemed to be University, Bhubaneswar, Odisha, India. He has teaching and research experience of more than 9 years. He did his doctoral degree from Indian Institute of Information Technology, Allahabad, U.P., India. His research interest includes big data analytics, image processing, intelligent systems, human–computer interaction and similar innovative areas. His research contribution includes 05 co-edited proceedings/books which include ASIC Springer series, more than 60 research publications in reputed conferences, book chapters and journals indexed in Scopus/SCI/ESCI and with a citation index of 1800 as on date. As an organizing chair, he has organized 05 international conferences (ICCAN 2017, ICCAN 2019, 16th ICDCIT 2020, FICTA 2016, FICTA 2017) and has been part of different core committees of other conferences and workshops. He has delivered invited talks in different workshops and conferences.

ix

x

Editors and Contributors

Contributors Rakshit Agrawal School of Computer Engineering, KIIT (Deemed to be University), Bhubaneswar, Odisha, India Arunkumar Balakrishnan ikval Softwares LLP, Coimbatore, India Raju Surampudi Bapi School of Computer and Information Sciences, University of Hyderabad, Hyderabad, India; Cognitive Science Lab, International Institute of Information Technology Hyderabad, Hyderabad, India Parikshit Barua School of Computer Engineering, KIIT (Deemed to be University), Bhubaneswar, India Chakravarthy Bhagvati School of Computer and Information Sciences, University of Hyderabad, Hyderabad, India Vishwanath Bijalwan Institute of Technology Gopeshwar, Gopeshwar, India Subhashree Darshana School of Computer Engineering, KIIT (Deemed to be University), Bhubaneswar, Odisha, India Adyasha Dash School of Computer Engineering, KIIT (Deemed to be University), Bhubaneswar, India Neha Gaud SCSIT DAVV Indore, Indore, India Harshvardhan GM School of Computer Engineering, KIIT (Deemed to be University), Bhubaneswar, Odisha, India Mahendra Kumar Gourisaria School of Computer Engineering, KIIT (Deemed to be University), Bhubaneswar, Odisha, India Krutika Injamuri School of Computer and Information Sciences, University of Hyderabad, Hyderabad, India Junali Jasmine Jena School of Computer Engineering, KIIT (Deemed to be University), Bhubaneswar, Odisha, India Sushovan Jena Wipro Limited, WIPRO BHDC, Bhubaneswar, Odisha, India Ashish Jha NIT Rourkela, Rourkela, Odisha, India Nooria Karimi School of Computer Engineering, KIIT (Deemed to be University), Bhubaneswar, India Sai Somanath Komanduri School of Computer and Information Sciences, University of Hyderabad, Hyderabad, India Sandeepan Mahapatra School of Computer Engineering, KIIT (Deemed to be University), Bhubaneswar, India

Editors and Contributors

xi

Arghya Mazumdar NIT Rourkela, Rourkela, Odisha, India Lipika Mohanty School of Computer Engineering, KIIT (Deemed to be University), Bhubaneswar, Odisha, India Manjusha Pandey School of Computer Engineering, KIIT (Deemed to be University), Bhubaneswar, Odisha, India Siddharth Swarup Rautaray School of Computer Engineering, KIIT (Deemed to be University), Bhubaneswar, Odisha, India Vijay Bhaskar Semwal MANIT Bhopal, Bhopal, India Bharat Sharma ZS Associates, Pune, India

What Do RDMs Capture in Brain Responses and Computational Models? Krutika Injamuri, Sai Somanath Komanduri, Chakravarthy Bhagvati, and Raju Surampudi Bapi

Abstract The question of how the human brain represents information, and if insights from this knowledge help us in formulating better deep learning algorithms, is still open. This paper attempts to address these questions by investigating the brain responses when participants viewed images of objects belonging to a variety of categories. We adopt the representational similarity analysis (RSA) framework wherein the brain responses are projected into a representational dissimilarity (RD) space so that responses from multiple subjects can be assessed on a common basis. We performed RSA of the brain responses of 15 participants on a set of 92 images (from the Algonauts challenge data set) Cichy et al. (The Algonauts project: a platform for communication between the sciences of biological and artificial intelligence 2019 [4]). The results reveal that human brain responses exhibit appropriate qualitative similarities observed in the image space. We then evaluated various deep networks such as VGGnet, AlexNet, and Inception in Siamese configuration with RD-loss function and identified hidden layers that had maximal similarity to human RD matrices (RDM). The best layers then were subjected to further analysis by looking at the image pairs corresponding to high and low similarity values both in the early and late visual areas. Our results point out that while deep neural network (DNN) responses mimicking early visual areas seem to indicate both feature- and categorybased similarities, those mimicking the late visual areas are mostly categorical and contextual. It seems that while deep networks are capable of mimicking the similarity distributions observed in human brain responses, it is still not clear what the networks are actually learning while trying to minimize the representational dissimilarity loss function. We suggest the use of alternative dissimilarity measures to possibly redress

K. Injamuri and S. S. Komanduri: Equal Contribution. K. Injamuri · S. S. Komanduri · C. Bhagvati · R. S. Bapi (B) School of Computer and Information Sciences, University of Hyderabad, Hyderabad, India e-mail: [email protected] R. S. Bapi Cognitive Science Lab, International Institute of Information Technology Hyderabad, Hyderabad, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 M. Pandey and S. S. Rautaray (eds.), Machine Learning: Theoretical Foundations and Practical Applications, Studies in Big Data 87, https://doi.org/10.1007/978-981-33-6518-6_1

1

2

K. Injamuri et al.

this problem. We also discuss the possibility of adopting a dissimilarity metric for training future deep networks. Keywords fMRI · MEG · RSA · RDM · brain decoding · information representation in the brain

1 Introduction The central focus of this paper is understanding the representation that the brain uses for storing information, in particular, visual information. One way to analyse this is to use non-invasive methods such as functional magnetic resonance imaging (fMRI), magnetoencephelography (MEG) to record the brain responses when human subjects are either actively or passively processing visual information. The entire set of experiments described in the paper tries to answer the following question: what does the human mind store for visual processing? Moreover, can we have a deep neural network that can mimic the working of the human visual cortex? Traditional methods such as multi-voxel pattern analysis (MVPA) have been used to identify a set of voxels that are selectively active during the processing of visual information, and these patterns have been compared with those learned by suitably configured deep neural networks [18]. The more recent approach is to perform a representational similarity analysis (RSA) [8]. RSA is performed by calculating the similarity of responses to pairs of images, thus projecting the original data into a representational similarity space. This similarity space can then be used to compare different participants’ responses or compare the responses of a human participant with those of a deep neural network. In addition, RSA allows us to go beyond issues such as signal noise and individual variability in responses often encountered in fMRI experiments. We make use of the fMRI recordings of the 92 image set collected by Cichy et al. [3], for all our experiments. In this study, we discuss the usefulness of RSA to address the questions we posed earlier. We also present the results and insights obtained from our computational experiments on neural nets to see how well they mimic the human responses and conclude by offering suggestions with regard to the usage of RSA.

2 Methods 2.1 Representational Similarity Analysis (RSA) RSA allows us to compare representations obtained from different modalities (e.g. deep neural networks and fMRI patterns) by characterizing the dissimilarity between the response patterns on a set of stimuli. We use a representational dissimilarity

What Do RDMs Capture in Brain Responses and Computational Models?

3

Fig. 1 For each pair of experimental conditions, the associated activity patterns (in a brain region or model) are compared. The dissimilarity between them is measured as one minus the correlation. Dissimilarity is 0 for perfect correlation, 1 for no correlation, 2 for perfect anti-correlation. The dissimilarities for all pairs of conditions are assembled in the RDM. Each cell of the RDM, thus, compares the response patterns of two images

matrix (RDM) to relate different modalities. RDM is a square symmetric matrix in which the diagonal entries are the comparisons between identical stimuli and are 0 by definition (See Fig. 1). Each non-diagonal value indicates the dissimilarity between the response patterns associated with two different stimuli.

2.2 RDM Construction for Human Brain Responses and Deep Neural Nets (DNNs) For a given brain region, we consider the voxel intensities and represent them as an activity vector. By comparing these activity vectors, we obtain the representational dissimilarity matrix (RDM) for human brain responses. Dissimilarity measures such as Pearson correlation, cosine similarity, Gaussian kernels can be used. In the current

4

K. Injamuri et al.

paper, we have investigated Pearson correlation-based RDMs. Pearson correlation distance (1-correlation) normalizes for both the overall activation and the variability of activity across space and is thus the preferred choice. It detects distributed representations without sensitivity to the global activity level. For the deep neural networks (DNNs), the process of constructing RDM for a single-layer includes passing all the 92 images through that layer and storing the activation for each image as the representative vector for that stimulus. We now calculate the dissimilarity between pairs of activation vectors and populate the RDM for that layer. This process is repeated for all the layers of a DNN.

2.3 Description of Algonauts Challenge Datasets Magnetic resonance imaging (MRI) scans were obtained on a 3 Tesla Trio scanner (Siemens, Erlangen, Germany) with a 32-channel head coil. The acquisition parameters of T1-structural and EPI-functional MRI are described in Cichy et al. [3] and made available as Algonauts Challenge dataset. The experiment consisted of presenting 92 visual images of objects for passive viewing to each of the 15 participants. Hundred and ninety-two functional volumes were acquired for each trial (presentation of one visual image) by using a partial brain coverage protocol covering only the ventral visual areas, namely the occipital and temporal cortices. In addition to the fMRI dataset, the challenge also includes data acquired from magnetoencepholography (MEG) with a similar experimental protocol (please see [3] for more details). The Algonauts challenge [3] concentrates on two regions of interest (ROIs) in ventral visual stream: the occipital cortex, denoted here as the early visual cortex (EVC) and the late visual activation in the inferior temporal cortex (IT). The neurons in EVC are said to respond to simple features such as edges, and the neurons in IT are said to be responsible for categorical perception of objects [5]. Visual information is processed in the human brain in a hierarchical fashion starting from feature extraction in the EVC and continuing on to object recognition in the IT [5]. The aim of the Algonauts challenge is to construct suitable networks that can mimic the activation in the early and late visual areas from both fMRI and MEG modalities [8]. To accomplish this, RDMs from the fMRI responses of EVC and IT are provided. Similarly, RDMs from responses obtained from MEG corresponding to visual processing in early and late time intervals are also considered in this study. The training dataset consists of RDMs that were constructed for the said signal space in response to two sets of images, one consisting of 92 images, mostly featuring silhouette objects and the other consisting of 118 images, featuring images with a natural background [3]. The RDMs were also constructed for another set consisting of 78 images held back as the test dataset [8].

What Do RDMs Capture in Brain Responses and Computational Models?

5

2.4 Experiments with DNNs In line with our goal to establish a deep neural network (DNN) which can mimic the human responses, we fine-tune DNNs with human RDMs. We consider four different DNN architectures for our trials. We use AlexNet [9], two variants of VGG [16], and Inception [17]. We obtain implementations of each of the mentioned architectures from the Pytorch model zoo [12], trained on the ILSVRC12 dataset [14]. Much of our experimental methods were inspired from [1]. The Algonauts challenge provides with two image sets as mentioned earlier, one with 92 images and the other with 118 images. They also provide RDMs associated with each of these image sets. All our experiments use only the 92 image set. The 92 image set has two RDMs, one corresponding to the EVC (early visual cortex) region and the other corresponding to the IT (interior temporal) region. As a consequence of the image set containing 92 images, the RDM of a single participant, for a particular region of interest (EVC or IT), would be of shape 92 × 92. We therefore have an RDM of shape 15 × 92 × 92, for each of EVC and IT region. We normalize each of the RDMs using z-score [function of sklearn toolkit], after which the RDMs were averaged across the first dimension. As a result, we now have an subject-averaged RDM of shape 92 × 92 for each of the EVC and IT region. To mimic the human responses using DNN, we take each of the pre-trained architecture and arrange them in Siamese architecture [2]. A Siamese network consists of two sister networks that are identical and share the same weights, which join at their outputs. The output of the network is fed to a loss function, which calculates the distances between the two outputs of the network. These type of networks are really useful to learn the similarities between the inputs. In our case, we want the distance between the outputs of two images to be equal to the corresponding RDM value of the image pair in the empirical data. Since we have 92 images, we will have a total of 92 C2 pairs that constituent the training set. Therefore, our training process includes arranging one of the above-mentioned architectures that was pre-trained on ImageNet, in a Siamese fashion and pass it two inputs I1 and I2 . If our network is represented by a function G, the corresponding outputs will be G(I1 ) and G(I2 ), for which we compute [1 − P(G(I1 ), G(I2 ))] where P is Pearson correlation. Now, the loss is calculated as the squared Euclidean distance between the estimated (as described above) and the observed (in empirical data) correlation between the two input images. The observed value is the same as the average RDM described earlier. The above procedure is used to fine-tune all the architectures for both EVC and IT averaged RDMs. Furthermore, while fine-tuning the DNNs, we trained it in two configurations. In the first configuration, the weights of all the layers were allowed to be updated during a backward pass. In the second configuration, if it were an EVC fine-tuning, we would freeze the weights of all the layers after 40th one in the Inception model, and the all the layers after the 4th layer in AlexNet and VGG variants; that is, the weights of those layers are not updated. If it were an IT finetuning, we would freeze the weights of all the layers before 40th one in the Inception model, and the all the layers before the 4th layer in AlexNet and VGG variants.

6

K. Injamuri et al.

Since all the 92 images have no background, we applied Foveation before the images were presented to the models. Foveation is an image processing technique that varies the amount of detail depending on the fixation point [13]. The fixation point of the image has the highest resolution. In all our experiments, we used the centre of the image as the only fixation point and used a Gaussian blur that increases as the radius from the fixation point increases. The idea behind this was to emulate the working of the human retina. To understand to what extent Foveation affects the model’s correlation with the human RDMs, we presented the models with both foveated and non-foveated images. For an easy representation of results, the models will have their configuration as subscript. We use ‘e’ for early and ‘l’ for late layers. A subscript of ‘f’ represents that the images were Foveated before shown to the model and the subscript ‘n’ represents that no Foveation was applied. Subscripts ‘p’ and ‘q’ represent frozen and unfrozen configurations, respectively. Therefore, a model named V GG16l f p would be the VGG16 architecture, fine-tuned for the late RDM values (IT region), with no Foveation applied when the image pairs were presented and the initial layers frozen during the fine-tuning process. This training setup gives us a total of 24 models to investigate. Each of the above models was trained for 150 epochs except for inception, which was trained for 100 epochs. All the models were trained with a learning rate of 0.05. Adam optimiser [7] was used for the inception-based models while all others use Stochastic Gradient Descent(SGD) optimizer. We used a batch size of 32 for all the models.

2.5 Selecting the Best Models RSA plays an essential role in the selection process as it facilitates the comparison of model outputs at every layer with those of human fMRI data, belonging to different modalities. Once the RDMs for the EVC and IT regions are constructed for all the layers of the 24 different model configurations we trained, they are now directly comparable with those of the humans. We use Spearman correlation R to calculate the similarity and relate the RDMs. The result is then squared to R 2 . This depicts the amount of explainable variance. Noise ceiling (NC) is the expected RDM correlation achieved by the ideal model, given the noise in the data (see [8] for details). The subject-averaged RDM of the humans is assumed to be the best estimate of the ideal model. The noise ceiling is used to normalize R 2 values to noise-normalized variance explained. Thus, a model can explain from 0 to 100% of the explainable variance. This illustrates the model’s explanatory power of the brain data. The model that has the highest correlation for the EVC or IT region would then be said to have the best fit for the responses from the human brain. We ran this test for all the layers of the 24 model configurations to identify the best matching layers. Since we used the

What Do RDMs Capture in Brain Responses and Computational Models?

7

correlation score as the only criterion to select the best matching layers, it may be possible that the best matching layers for EVC and IT may come from two different model configurations.

2.6 Qualitative Analysis of High and Low Values in RDMs As per the definition of RD measure, if the RDM value for a cell is low, this means that the corresponding image pairs are highly correlated, and similarly if the RDM value is high, the image pairs would be highly dissimilar. To check if this is indeed the case, we took the images pairs that corresponding to high and low RDM values from both the model RDMs and human response RDMs and performed qualitative analysis on them. For all the 15 participants, we analysed the top 20 similar and dissimilar image pairs selected from the human brain response RDMs. Similarly, we considered the top 20 similar and dissimilar image pairs from the model responses. We looked at all the layers of models from each architecture that showed the highest correlation on average for the EVC and IT regions to identify any consistent behaviour. For example, for VGG19 that has 44 layers, we identified the top 20 similar and dissimilar images for qualitative analysis. To understand the extent of categorical grouping by the human brain in the EVC and IT regions and also at different layers of the DNNs, we constructed new RDMs, hereafter referred to as Categorical RDMs. To construct these categorical RDMs, we considered eight categories used by the fMRI track winner of the Algonauts challenge [10]. The original RDMs provided by the Algonauts are rearranged based on the categories in the following order: scenes-with-objects, fruits–vegetables, animals, animal-face, monkey-face, face, body-parts, and back-profile-view-of-humans. An 8 × 8 categorical RDM is constructed by averaging the values of the original 92x92 RDM for the images that belong to the same category. This approach of constructing categorical RDMs is inspired from [6] where each cell represents the within-category or between-category distance. Figures 4, 5 and 6 represent the visualization of these RDMs. Blue colour in these figures represents a low RD value, and red colour depicts a high RD value.

3 Results 3.1 Analysis of the Human Brain Responses To investigate what RDMs capture, we averaged the human EVC and IT RDMs of the 92 image set, across all the participants. Then, we selected the top 20 similar image pairs (pairs that have low RD values) and the top 20 dissimilar image pairs (pairs that have high RD values) and few representative examples are displayed in

8

K. Injamuri et al.

Fig. 2 Similar and dissimilar image pairs in EVC region of averaged human brain responses and their RD values. a Image pairs that have low RD values and are similar. b Image pairs that have high RD values and are dissimilar. c Image pairs that have low RD values but are dissimilar. d Image pairs that have high RD values but are similar

the following figures. Figures 2 and 3 correspond to the good and bad examples of similar and dissimilar image pairs that are selected from the averaged human RDMs of EVC and IT region of interests, respectively.

3.2 Deep Networks Mimicking Human Brain Responses Table 1 manifests the results of best performing models in EVC and IT in each of the three experiments. The first experiment being the evaluation of pre-trained models (Alexnet, VGG16, VGG19, and Inception) with the ground truth human RDMs is provided by the Algonauts challenge. These results in 1 have the value in fine-tuned column as “No”. The second experiment is the evaluation of fine-tuned models with the ground truth and the best results for EVC and IT ROIs on 92 image set. In Table 1 ,these results have the value in fine-tuned column as “Yes” and the image set column as “92”. The third experiment is evaluation of these fine-tuned models with the 78 image challenge test set [11]. For this, the value in fine-tuned column is “Yes” and image set is “78”.

What Do RDMs Capture in Brain Responses and Computational Models?

9

Fig. 3 Similar and dissimilar image pairs in IT region of averaged human brain responses and their RD vales. a Image pairs that have low RD values and are similar. b Image pairs that have high RD values and are dissimilar. c Image pairs that have low RD values but are dissimilar. d Image pairs that have high RD values but are similar Table 1 fMRI track results of ImageNet pre-trained models that performed the best when presented with 92 image set Model Fine-tuned Image set ROI Layer name NC % Alexnet VGG19 VGG19efq VGG19lnp VGG16efp Inceptionlnq

3.2.1

No No Yes Yes Yes Yes

92 92 92 92 78 78

EVC IT EVC IT EVC IT

6: MaxPool2d 31: Conv2d 44: Linear 40: Dropout 29: Conv2d 6: Conv2d_4a_3 × 3

30.82 13.52 42.64 58.50 20.97 21.30

What Did the Fine-Tuned Models Learn?

To see what the fine-tuned models learned across the layers, we compare and contrast the RDMs at different layers of Imagenet pre-trained VGG19 (Fig. 4) and the finetuned VGG19lnp model which performed the best in the IT for the 92 image set. In order to do this, we consider the categorical RDMs (refer to the methods section for more details).

10

K. Injamuri et al.

Fig. 4 Layer-wise progress of categorical RDMs of pre-trained VGG19

Fig. 5 Layer-wise progress of categorical RDMs of fine-tuned VGG19lnp

3.2.2

Similar and Dissimilar Image Pairs

We investigated the RDMs of the models similar to those of the human RDMs. For this purpose, we considered the 44th layer of VGG19efq which highly correlates (42.64% NC) with the EVC ROI of averaged human RDM and the 40th layer of VGG19lnp which highly correlates (58.50 % NC) with the IT ROI of averaged human RDM. Figures 7-I and 7-II depict good and bad image pairs of the best matching models for mimicking EVC and IT activation. Figure 6 makes a head-to-head comparison of the categorical RDMs of human EVC and IT and the corresponding highly correlated layers of the fine-tuned models.

What Do RDMs Capture in Brain Responses and Computational Models?

11

Fig. 6 Comparing the EVC and IT RDMs of human brain response with the highly correlated early and late layers of the fine-tuned model. a displays the average categorical RDM of the EVC and IT of the RDMs averaged across all subjects. b depicts the categorical RDM of 44th layer of VGG19efq and 40th layer of VGG19lnp that highly correlated with the human EVC and IT RDMs

4 Discussion 4.1 Qualitative Analysis of Human Responses and DNN Responses When we collected the images pairs that have low RDM values in the EVC region, we noticed that there were examples of not only similar images but also dissimilar ones. Consider Fig. 2a, we see that human faces and monkey faces are found similar and several such appropriate pairs are shown in this figure. In Fig. 2c, we see groupings that have low RD values. The image pair of a dog and a chair seems to have shape similarities, while other similar pairs like that of a chef and a tomato, a snake and a tree seem to indicate context-dependent grouping. We also identified pairs that are similar and yet have high RD values. In Fig. 2d, we notice examples from different categories such as animals, body parts, and fruits and vegetables that seemed to match in their visual category but the RDM scores indicate otherwise, whereas those pairs in Fig. 2b that have high RD values and are dissimilar seem to be appropriate. This last result where two quite distinctly different objects ending up with very high RD values is a consistent behaviour observed in all the 15 participants.

12

K. Injamuri et al.

Fig. 7 (I) Similar and dissimilar image pairs for VGG19 efq model which is highly correlated with EVC region of average human RDM. a Image pairs that have low RD values and are similar. b Image pairs that have high RD values and are dissimilar. c Image pairs that have low RD values and are dissimilar. (II) Similar and dissimilar image pairs for VGG19 lnp model which is highly correlated with the IT region of average RDM from human responses. a Image pairs that have low RD values and are similar. b Image pairs that have high RD values and are dissimilar. c Image pairs that have low RD values but are dissimilar

What Do RDMs Capture in Brain Responses and Computational Models?

13

We observe the same trend when we analysed the image pairs for the IT region. Figure 3a shows that objects belonging to the same category are paired together. Figure 3b shows objects that are not related to each other and in line with this the RD values are also high. Figure 3c shows image pairs that are visually dissimilar, yet have low RD values. For example, consider the pair: a vegetable (aubergine/brinjal) and a building are matched with a highly similar brain responses in the IT region! Fig. 3d shows images that belong to the same apparent visual category, but the RD values being high are inappropriate. The overall observation from the qualitative analysis of RDMs from human responses is that the early responses seem to match based on feature and category similarities, and those in the IT regions seem to indicate categorical similarity. When DNNs are configured to learn these similarity profiles using RD-loss function, it appears that the networks seem to learn progressively more categorical representation as observed in the categorical RDMs of two network configurations in Figs. 4 and 5. This is expected from deep architectures that learn very well to categorize objects. We did a similar analysis for DNNs. We qualitatively compared the most similar and dissimilar image pairs of the layer that best matched the human EVC and IT RDMs. We tried to guess the reasons behind their pairings and understand what the networks seem to use for making these decisions. Figure 4a shows image pairs that are similar and have low RD values. We observe that human faces seem to be grouped quite often as similar. We also observe that shape and colour are used as a deciding factor. Examples like the monkey and a lion face paired together support this claim. Figure 4b shows object are dissimilar and have high RD values. This behaviour where dissimilar objects have high RD values seems consistent with humans. Figure 4c shows some inconsistent examples. When we compared the image pairs of the layer that best matched the IT RDM, we show similar behaviour. The same trends repeat. In both the best matching layers, we could not find many examples that have high RD values but are actually similar. Thus, the qualitative analysis of DNN responses indicates that the layers that match early visual responses (EVC in Fig. 7-I) seem to match both feature-wise and category-wise similarities, whereas the responses in the layers that match late visual responses (IT in Fig. 7-II) seem to be more categorical and highly contextual in their representation. Categorical RDMs of Humans and DNNs: From the categorical RDMs across human and DNN responses as depicted in Fig. 6, we observe that the categorical RDM of 40th layer of VGG19lnp is quite close to categorical RDM of human IT. We also observe the hierarchical grouping that the model learns across the animalcategory as indicated by a blue square (compare the top and bottom-left panels in Fig. 6). It needs to be mentioned that models seem to distinguish between animate and inanimate categories (the distinct blue squares on the diagonal in both the bottom-left and right panels of Fig. 6). The categorical RDMs of DNNs (lower panels of Fig. 6) are not as sharp as those of the human response RDMs (especially, the EVC categorical RDM in the top-left panel of Fig. 6). This might be because of the similarity across

14

K. Injamuri et al.

low-level features such as shape, colour that are present across different images belonging to different categories.

4.2 The Way Forward Overall, it appears that the representational dissimilarity (RD) measure seems to bring out similarities and dissimilarities in the human responses reasonably well. However, the results of the deep neural networks (DNNs) mimicking the human responses using RD-loss function seem to be a mixed bag. The RDMs computed from DNNs aggregated over eight categories such as animals, faces, humans seem to be noisy when compared to the RDMs from human responses. One possible future direction is to try other correlation measures such as Spearman or Kernel similarity metrics as proposed in Shahbazi, Raizada, and Edelman [15]. It needs to be pointed out that human information representation seems to be highly distributed and very contextual in the sense that an item is stored along with the environmental and internal multi-modal context. It is not clear if RD measure, being linear, can capture such contextual representation. Do the off-diagonal elements in RDMs point to such contextual information, something that needs to be investigated in future? One limitation of the current study is that responses from only two ROIs (EVC and IT) have been considered and the distributed information in other ROIs is not considered. The final, tantalizing proposal is to consider using dissimilarity as a process for constructing deep networks for classification. Conflict of interest The author(s) declare that they have no conflict of interest.

References 1. Agrawal, A. (2019). Dissimilarity learning via siamese network predicts brain imaging data (2019). arXiv: 1907.02591 [q-bio.NC]. 2. Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., & Shah, R. (1994). Signature verification using a “siamese” time delay neural network. In Advances in Neural Information Processing Systems (pp. 737–744). 3. Cichy, R. M., Pantazis, D., & Oliva, A. (2016). Similarity-based fusion of meg and fmri reveals spatio-temporal dynamics in human cortex during visual object recognition. Cerebral Cortex, 26(8), 3563–3579. 4. Cichy, R. M., Roig, G., Andonian, A., Dwivedi, K., Lahner, B., Lascelles, A., et al. (2019). The algonauts project: A platform for communication between the sciences of biological and artificial intelligence. CoRR http://arxiv.org/abs/1905.05675. 5. Goodale, M. A., & Milner, A. D. (1992). Separate visual pathways for perception and action. Trends in Neurosciences, 15(1), 20–25. 6. King, M. L., Groen, I. I., Steel, A., Kravitz, D. J., & Baker, C. I. (2019). Similarity judgments and cortical visual responses reflect different properties of object and scene categories in naturalistic images. NeuroImage, 197, 368–382.

What Do RDMs Capture in Brain Responses and Computational Models?

15

7. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv: 1412.6980. 8. Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysisconnecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2, 4. 9. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (pp. 1097– 1105). 10. Lage-Castellanos, A., & De Martino, F. (2019). Predicting stimulus representations in the visual cortex using computational principles. bioRxiv p. 687731. 11. Mohsenzadeh, Y., Mullin, C., Lahner, B., Cichy, R. M., & Oliva, A. (2019). Reliability and generalizability of similarity-based fusion of meg and fmri data in human ventral and dorsal visual streams. Vision, 3(1), 8. 12. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In: H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, R. Garnett, (Eds.), Advances in Neural Information Processing Systems (Vol. 32, pp. 8024–8035). Curran Associates, Inc. (2019). http://papers.neurips. cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf . 13. Perry, J. S., & Geisler, W. S. (2002). Gaze-contingent real-time simulation of arbitrary visual fields. In Human Vision and Electronic Imaging VII (Vol. 4662, pp. 57–69). International Society for Optics and Photonics. 14. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252. 15. Shahbazi, R., Raizada, R., & Edelman, S. (2016). Similarity, kernels, and the fundamental constraints on cognition. Journal of Mathematical Psychology, 70, 21–34. 16. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556. 17. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826. 18. Wang, X., Liang, X., Jiang, Z., Nguchu, B.A., Zhou, Y., & Wang, Y: Decoding and mapping task states of the human brain via deep learning. arXiv (2020). https://arxiv.org/abs/1801.09858.

Challenges and Solutions in Developing Convolutional Neural Networks and Long Short-Term Memory Networks for Industry Problems Arunkumar Balakrishnan

Abstract Developing a deep learning application requires unique approaches and methods. These are discussed in two industrial applications. The architectures used are two classic ones: convolutional neural network (CNN) and long short-term memory (LSTM). The CNN application to identify hand-written choice options from a scanned image, faced issues of “pose” (oriental transformation) problem and “region segmentation.” The LSTM application to predict the equated monthly instalment by all customers had to overcome challenges of “multi-variate” and “multiple time series.” The evolution of a satisfactory solution for the customer in both these cases is described in this chapter. The importance of data analysis, data pre-processing and proper choice of hyper-parameters is seen through these experiments. Keywords Convolutional neural network (CNN) · Long short-term memory (LSTM) · Oriental transformation · Region segmentation

1 Introduction Artificial neural networks try to emulate the behavior of the central nervous system of a human being, through a software system. The human brain consists of trillions of neurons, highly interconnected. Signals are passed from one neuron to many other neurons. Based on the total signal strength received by a neuron and its synapse level (internal threshold setting), the neuron generates an output signal. This simple architecture replicated across trillions of neurons forms the basis for human intelligence. In a similar manner in an artificial neural network, a neuron takes input from a set of other neurons, multiplies each input by a value representing weightage given to the input source and then sums them up. It then compares this summation with its threshold value. If it is greater than the threshold value, then an output is sent to all artificial neurons connected to it. This activity can repeat across multiple layers of A. Balakrishnan (B) ikval Softwares LLP, Coimbatore, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 M. Pandey and S. S. Rautaray (eds.), Machine Learning: Theoretical Foundations and Practical Applications, Studies in Big Data 87, https://doi.org/10.1007/978-981-33-6518-6_2

17

18

A. Balakrishnan

neurons. If the final output matches the expected output, then the network architecture is maintained. Else, if there is an error between expected output and received output, then a learning phase is enabled. In the case of a supervised learning (where the correct output is made available), the difference between the expected output and received output is differentiated with respect to the weightage values for each input connection and this value is used to change the weight values for each connection between neurons. This change in weight values (new weight value is equal to old weight value + differentiation of error with respect to weights) is done starting from the output layer and proceeding toward the input layer. This process of learning new weights in this manner is termed as back-propagation. The process of deriving a new output, comparing with expected output and back-propagation, is repeated until the error is within acceptable limits (Figs. 1 and 2). Artificial neural networks demonstrated a good capability to recognize patterns that were too complex to be symbolically defined as a knowledge piece. For example, there was initial success in recognizing hand-written digits. But researchers in artificial neural networks were not able to build solutions when confronted with problems that required more differentiation between attributes of the problem. Put in other words, artificial neural networks were limited to only a few layers of artificial neural nodes. The reason for this, especially in the supervised learning, lies with the backpropagation process. As the layers increase, differentiation of error with respect to weight results in a very small value. As such, the slope or level of change in weight value is too small to bring in a significant change in performance over the next iteration(s). This problem has been highlighted more in the current versions of artificial

Fig. 1 Artificial neural network [3]

Challenges and Solutions in Developing Convolutional Neural …

19

Fig. 2 Back-propagation [3]

neural networks termed “deep learning.” In deep learning, more layers of artificial neural network neurons are made possible thus resulting in the nomenclature of “deep.” The vanishing gradient problem in artificial neural networks was handled by an innovative use of the “expectation maximization” concept. Alex Krizhevskey and Geoffrey Hinton designed a convolutional neural network called “AlexNet,” where in one iteration of the neural network they kept the weights fixed, while in the subsequent iteration, the error values were kept fixed. By this technique, the vanishing gradient problem was solved and multiple layers of neural network nodes were facilitated. Their approach won the ImageNet (ILSVRC) challenge of 2012 [6]. The attributes of a deep learning neural network are: • Types of computational layers (of neural nodes): Convolutional, Auto-Encoder, Recurrent, LSTM, Belief, Adversarial, Dense • Functions: – Activation: Mean Square Error, L2 loss, RELU, SIGMOID, TANH – Loss: Hinge loss, Cross-entropy – Optimization: ADAM. • Normalization/Dropout layers: L1, L2. Deep learning is advertised as doing away with “feature engineering” (other machine learning and artificial intelligence approaches require significant feature engineering). The statement claims that there is no need for a designer to analyze the domain and identify various classes of objects, attributes of each class and dependencies between the attributes as well as between different objects. In reality, the designer of a deep learning application still has to analyze the domain in detail. The activities in the two deep learning application developments described in this chapter will show what feature analysis is still required in the course of the development of deep learning applications.

20

A. Balakrishnan

Fig. 3 Convolutional neural network [3]

The capabilities offered by the use of deep learning technologies have provided significantly better results in the fields of image classification/computer vision, natural language processing, speech recognition and predictive analytics. Realtime face recognition and handwriting recognition by a software have been made practically possible by the use of convolutional neural networks (Fig. 3). A convolutional neural network (CNN) contains 1. 2. 3.

convolution layers where the (mathematical) function defined in that layer works on the output of the previous layer’s output Max-pooling layers where only the maximum value of all outputs in the previous layer is allowed to pass through dropout layers where some of the connections in the layer are probabilistically allowed to stop functioning and fully connected layers where all the outputs are considered.

Dropout layers are used to avoid overfitting of data, while fully connected layers allow classification among outputs. Max-pooling layers provide a reduction in the number of intermediate outputs being manipulated thus reducing the compute requirement. Surprisingly, it is also found that without Max-pooling layers, the CNN does not recognize images with the same capacity as with Max-pooling layers. This is one of the behaviors that need to be studied more because a Max-pooling layer actually reduces the details of the data being processed. Every convolutional layer has a multitude of inputs taken in a matrix form. These are processed by using a filter of a particular size that is moved across the complete set of inputs. The filter focuses on a set of data points. The filter is moved across the layer of nodes with a defined stride. Each output from a particular application of the filter forms the input to the next layer.

Challenges and Solutions in Developing Convolutional Neural …

21

AlexNet was constructed with five convolutional layers, max-pooling layers, dropout layers and three fully connected layers. This architecture was sufficient to classify one thousand categories. AlexNet used Restricted Linear Unit (ReLU) as the activation function, stochastic gradient descent as optimize function, used dropout layers and data augmentation techniques of image translations, horizontal reflections and patch extractions [6]. In the case of AlexNet and most other successful CNNs, the filter size is kept larger in the initial layers and becomes smaller in the layers closer to the output. This approach goes against the assumed process of identification where smaller components are identified first and used to build larger structures, finally resulting in an independent image identification. Additionally, use of the larger filters in initial layers hid that data/information from the subsequent layers. ZF Net tried to implement this idea of constructing images from smaller sub-components, by having the smallest filters in the first layer and increasing the filter size in subsequent layers. ZFNet was able to achieve identification results in lesser iterations than AlexNet [3].

1.1 Inception InceptionNet [7] from Google set to rest the discussion on smaller versus larger filters by a new architecture where the inputs are provided to all the different layers, irrespective of their position. Similarly, outputs from all the layers are taken to a collator. An important point is that by this approach no data/information is lost to any level of analysis (all layers are open to all information). This is a non-sequential model and is called a graphical model. Different filter sizes process the input and intermediate outputs. All the outputs are collected together before a final output layer. Many optimizations have been performed on inception architectures to reduce computational complexity: adding 1 × 1 convolutions, adding residual connections that add the output of the convolution operation of the inception module to the input. Inception approaches achieve a higher accuracy at an earlier training iteration (Fig. 4).

Fig. 4 InceptionNet [1]

22

A. Balakrishnan

1.2 Regional CNN An approach that has had good success in identifying images along with their location is regional CNNs. While a CNN is primarily used for image classification, regional CNN (R-CNN) are better at object detection. In object detection, the location of the identified object is also detected; a set of bounding boxes is returned along with the object class that is identified. R-CNN uses a hierarchical approach to segregate overall image into parts [8]. This approach appreciates the fact that multiple scales in segmentation are required along with the acceptance that individual features cannot always decide the segmentation to be used. They also noted that there are many instances where there is a need to recognize the overall object before the individual objects can be grouped together [8]. Along with handling these matters, they have also considered the computational complexity of collectively processing a large number of pixels in order to ensure that any potential location of the object is not missed. The approach uses a data-driven segmentation approach and provides for increased diversity by using a set of complementary criteria to group as well as complementary color spaces with different invariance properties. RCNNs are constructed as a combination of a proposal network along with a classifier network. The progress in CNNs has led to a good adoption of this technique in real-time object classification and location identification.

1.3 Recurrent Neural Networks Recurrent neural networks have been used to develop natural language processing models and speech recognition systems allowing a user to converse with the software in English or other natural languages. One of the advanced types of recurrent networks is long short-term memory networks, which have been successfully used to develop applications that can predict new values in a time series of observations. CNNs, like most other artificial neural networks, take a fixed size vector as input and output a fixed size vector as output. The number of computational steps (number of layers in the artificial neural network) is also fixed. Recurrent neural networks differ by processing a sequence of vectors [5]. This sequence represents a memory. This memory feature, expressed in computer science terms combines the input vector with state vector and with a (learned) function to produce a new state vector. A simple view of recurrent networks is that they consider input from a previous slice of time along with the current input. The input at time “t” produces an output. At time “t + 1”, the previous input is remembered and combined with current input (and activation function) in order to arrive at a new output [5]. Recurrent neural networks are more prone to the “vanishing” (and “exploding”) gradient problem. In addition, recurrent networks are only able to handle dependencies between data that are close together; they are not able to identify relationships

Challenges and Solutions in Developing Convolutional Neural …

23

between data that are significantly separated. Long short-term memory (LSTM) neural networks are a type of recurrent neural networks where these problems are handled. Memory is maintained and provided to all neuron cells. Each cell can now decide to add, forget or remember what is in memory. LSTM uses “gates,” where feedback of previous input is controlled. Only when it is relevant, is the feedback from a previous input allowed to affect the current computation. More specifically, there are gates which are composed of a sigmoid neural net layer and a pointwise multiplication operator. The sigmoid value will determine whether the input is to be let through for this neural node or not. Primarily, a forget gate layer that decides to omit or keep the previous input is used. For example, if a new instance of an entity is started to be processed, then the previous entities data are to be forgotten. An important variant of this forget gate is gated recurrent unit where forget and input gates are combined into one. This keeps a dependency between what is forgotten and what is input. This simpler version of the LSTM has a good performance and is widely used [2]. Other versions of LSTM include attention LSTMS and grid LSTMS [2] (Fig. 5).

Fig. 5 Long short-term networks [2]

24

A. Balakrishnan

2 Application One: Image Recognition in a Document by Using Convolutional Neural Networks The first deep learning task that is discussed in this chapter involves image recognition. The input was a document with multiple pages and containing multiple questions that applicants mark their choices on. The task was to identify choices by the applicants for each of these questions and to then do an analysis of the total result. The documents are given regularly to members of a particular governmental program. The questions are to check whether a particular attribute of the applicant has changed or not. Most of these attributes change only occasionally. So most (over seventy percent) of the filled in documents would finally be marked as not having any modification for all the multiple questions. All such documents need no further processing. Therefore, with the objective of effectively utilizing the manpower in the department conducting this program, the requirement was to develop a software application that could take the scanned images of the pages of each document and identify whether all the questions have been marked as “without modification” along with checking for the mandatory signature of the applicant at a specific location in the document. This application would thereby relieve the department of seventy percent of work that is not productive, allowing the manpower to be re-assigned to more critical activities. The application that was developed had to overcome challenges caused by (1) hand marks, (2) format of the document (3) scanning of the document Hand marks (1) (2) (3) (4) (5)

The choice of “modified” or “no modification” was marked by hand. The allowed hand marks were of four types. Hand marks could be of varying sizes and structures. Entries can be scratched out. The signature location is allowed at the bottom of a page. Format of the document

(6) (7) (8)

There was more than one choice question on a page. At one location, a barcode image overlapped the area of choice marking, The document is in two languages: English and Spanish. Scanning issues (Pose problem)

(9) (10)

The page would not always be placed in the same location in the scanner The page could also be placed in a tilted manner.

2.1 Analyzing the Problem Approach I—classical (symbolic) machine learning: A classical machine learning approach where the features are identified and learnt through inductive generalization of the results of a digital image pre-processing algorithm was considered. The

Challenges and Solutions in Developing Convolutional Neural …

25

complexity of the problem was a factor of the multiple key strokes that were possible for each hand mark, along with the possible x, y locations of the hand mark. This feature and image identification would then be contrasted with a free white space identification for the corresponding alternate choice to determine which of the two options had been chosen. Identifying the key strokes for each hand mark as a possible feature set was a complex task. The curves of a hand mark could differ from person to person. In addition, the location of these key strokes could differ even though it was bounded to be around the choice boxes. Third, the possible skewness of the scanned image led to an extrapolation of the locations and angles of these key strokes for each of the hand marks. Considering these three compounding factors, it was decided to use a deep learning approach to solve the task. Approach II—transfer learning: One of the practiced approaches in industry to reduce the turn around time is to use transfer learning. Transfer learning is the process where a suitable model that has identified a super set/sub set of the images to be recognized is identified and loaded into the application, followed by a relatively short training with specific images that will enable recognition of the images relevant to the problem at hand. Along with training of the model, the deep learning architecture of the training application is modified as required. More particularly, when there is a good intersection of the trained images and the images to be recognized, then a couple of deep learning layers are added to the training architecture. These additional layers are normally added just prior to the classification (dense) layer. Transfer learning was ruled out as an option, in this case as there were no available trained models that can recognize the hand marks that were used for this document. Approach III—Training deep learning architecture to recognize the images: The option here is to create data sets of the four different hand marks and see whether the application can then recognize the presence of each hand mark at the choice of modified or of no modification for each of the questions. Each hand mark would be created in the various manners that they can be drawn. For example, one type of hand mark could be small, long, curved at the end, curved at the beginning, could have an acute or an obtuse slope. Similarly, another type could be perfect, crooked, elliptical, small, big. A third type was more difficult as they could be elongated in any orientation and be of any thickness. The fourth type could also be of different dimensions. Another problem arose as part of this approach. There are multiple option boxes in a page. If the training is to identify a modified box chosen or a no modification box chosen, then the recognition stops in these pages with the identification of the choice for the first option box itself. The recognition does not proceed to check the other option boxes in the page. In other words, the deep learning application is trained to recognize a single option box for modified or for no modification. When the learned model is used for recognition, then the recognition stops with the very first option box with the result that it is a modified option chosen or a no modification option chosen. But what is required is for the deep learning application to recognize the choice for all the choice options in a page. To solve this, regional CNN was attempted. This required that the deep learning application has to be trained on the actual question for each option box. As this would require further training and subtle

26

A. Balakrishnan

disambiguation, this approach was dropped. More specifically, each choice box had to be differentiated based on particular headings/words for each box. The issue was that these words could occur in other sentences too but not as a heading. This created a confusion for the application that was difficult to solve. Approach IV—The next option taken was to consider the page as whole. For this, all combinations of choices for each option box were considered. One combination was all modified in the five option boxes of the page, while a second would be all no modification, followed by a single no modification, two no modifications, three no modifications and four no modifications. The single no modification had five possibilities (one for each option box) while the two no modifications had ten possibilities. Similarly, for the three no modifications and the four no modifications. A similar set of possibilities was done for the modified option. This was repeated for the other pages which had multiple options. Now the variations due to the different hand marks had to be considered for each of the four types. For each of the hand marks for each of the combinations possible, there had to be twenty samples created. In total, there were to be two thousand two hundred and forty samples to be created. Given the time constraint, the effort required to build the samples and the possible error that can happen while the samples are created, it was decided to abandon this approach. Approach IV—The solution that was adopted was to decompose the problem (use “divide and conquer”). Instead of considering the complete page, the approach used was to find the independent entities of the task and approach them one at a time. OpenCV tools were used to segregate the page into sections. Each section would pertain to one of the choice boxes. To identify each section, OpenCV’s capability to identify straight lines was used. As each box was in a rectangle, OpenCV was used to identify these rectangles and the associated straight line. The straight-line location was used to cut the page into different sections. This was done for both training phase as well as for the recognition phase. So during training, samples were created for each section for each hand mark option. To identify a section, the straight line that runs across the page, below a section was used. In addition, the identification of the choice box within a section helped in segregating the various sections from one another. By the use of OpenCV’s functions, the page was split into sections where each section had exactly one choice box. Now, the overall task was broken down from identifying the choice in each of five choice boxes to one of identifying the choice in a single choice box and then accumulating results across the sections.

2.2 Handling the Challenge of Skewness The second challenge was in handling skewness of the scanned image. When the page was placed in a scanner to generate an image, it could be placed in a misaligned manner. This misalignment caused problems in recognition of images within the document. This problem is classified as the pose (translational invariance) problem

Challenges and Solutions in Developing Convolutional Neural …

27

in the literature of deep learning previous research approaches this problem using Hough transform. The pose problem is a current research topic [4]. Human eyes seem to easily adjust the perception to accommodate tilts in the object being recognized. This accommodation is difficult to do for computer vision tasks. In this application, a combination of OpenCV’s line finding functions along with a calculation of the angle of the lines to determine whether the page is tilted or not is used. First, a straight line at the top of the page is identified. This is used as a reference. Next, straight lines at different sections of the page are identified and analyzed to determine the angle of the line. The angle was then corrected to be parallel to the top straight line of the page. This straightening process was repeated at every section of the page as the angle of skewness was found to vary across the page. In this manner, the scanned image was straightened. This skewness correction was done as a pre-processing prior to training/recognition.

2.3 Architecture of CNN

Layer (type)

Output shape

Param #

conv1d_5 (Conv1D)

(None, 12, 64)

29,312

max_pooling1d_5 (MaxPooling1)

(None, 12, 64)

0

dropout_10 (Dropout)

(None, 12, 64)

0

conv1d_6 (Conv1D)

(None, 10, 64)

12,352

max_pooling1d_6 (MaxPooling1)

(None, 10, 64)

0

dropout_11 (Dropout)

(None, 10, 64)

0

conv1d_7 (Conv1D)

(None, 8, 64)

12,352

max_pooling1d_7 (MaxPooling1)

(None, 8, 64)

0

dropout_12 (Dropout)

(None, 8, 64)

0

conv1d_8 (Conv1D)

(None, 8, 64)

4160

max_pooling1d_8 (MaxPooling1)

(None, 8, 64)

0

dropout_13 (Dropout)

(None, 8, 64)

0

conv1d_9 (Conv1D)

(None, 8, 64)

4160

max_pooling1d_9 (MaxPooling1)

(None, 8, 64)

0

dropout_14 (Dropout)

(None, 8, 64)

0

flatten_2 (Flatten)

(None, 512)

0

dropout_15 (Dropout)

(None, 512)

0

dense_5 (Dense)

(None, 10)

5130

dense_6 (Dense)

(None, 1)

11

Total parameters: 67,477 Trainable parameters: 67,477

28

A. Balakrishnan

Non-trainable parameters: 0. Five convolutional layers were used to identify the key features of the image. Each convolutional layer was followed by a Max pooling layer and a dropout layer. Finally, a fully connected layer was used for classification of the image. The input layer took the image as a set of 64 × 64 pixels. The filter sizes used were • • • • •

Layer 1: 5 × 5 Layer 2: 5 × 5 Layer 3: 3 × 3 Layer 4: 3 × 3 Layer 5: 1 × 1.

Stride was maintained as 1 × 1 for all layers. Learning rate was maintained at 0.1. The MaxPooling layers (after each convolutional layer) had a size of 64 × 64 pixels. A dropout layer of was also introduced after each MaxPooling layer. At the output, two dense layers were used in order to produce the best value as the classification of the image. In the course of the project, different filter sizes and different number of layers were tried. The other architectures experimented with were: Layers: • Six layers of convolutional layers with corresponding MaxPooling and dropout layers • Four layers of convolutional layers with corresponding MaxPooling and dropout layers • Seven layers of convolutional layers with corresponding MaxPooling and dropout layers. Filter Sizes: • Seven by seven followed by five by five followed by three by three • One by one followed by three by three followed by five by five (similar to ZFNET). Different loss functions were also experimented with: • Linear • Sigmoid • RELU. Training: Data set creation: Due to the multiple locations where a hand mark can be made within the option box, the training set was created with a distinct space between each mark: top left, top center, top right, mid left, mid center, mid right, down left, down center,

Challenges and Solutions in Developing Convolutional Neural …

29

down right, along with random marks between these nine locations. This process was repeated for all the option boxes. Then, these were scanned and converted into images. Each set of twenty images was stored into appropriately labelled folders. Twenty samples were created for each category. The different architectures were run on separate computer systems, in order to more quickly evaluate the results. Variations in the loss functions were done in a second iteration. Different filter sixes were tried in a third iteration. A mix of architecture and loss function was evaluated in a final iteration • (5 × 5, 5 × 5, 3 × 3, 3 × 3, 1 × 1 filter size) with RELU versus same architecture with Sigmoid.

2.4 Prediction/Testing The first level of predictions was done with a system flow that did not include page skewness alteration. This led to good results initially but were followed by bad results in later trials. Then, after the problem of page skewness was identified and solved, the results were more consistent. For prediction, a different set of people were requested to make the hand marks at random locations.

2.5 Results Our application passed all tests with a 99.9% accuracy. Tests were conducted with different types of hand marks, done by different people. Mixed mode of hand marks were also tested, where some choice boxes had (for example) check marks and others had circle hand marks. The only test case where the results were not always correct was in the location where a barcode image covered part of the choice box and the hand mark used was overlapping this area of the choice box. The application was demonstrated using scanned images of documents that had external people recording their choices for each section of the documents. It provided satisfactory results for the end user.

3 Application 2: Predicting Equated Monthly Instalment Payments The second deep learning application that is described in this chapter was developed for a challenge faced in financial institutions. The financial institution lends money to their customers to allow them to procure vehicles. The customer pays back

30

A. Balakrishnan

the loan amount in Equated Monthly Instalments (EMI). Unfortunately, customers default on their payments. If the defaulting payments continue over specified time periods, then the account is termed as a non-performing asset (NPA). All financial institutions would like to reduce their NPA. There is therefore a need to predict and take appropriate action on possible NPAs. The requirement given was to predict with maximum accuracy, the amount of payment that every one of their customers will make in an upcoming month. Initial discussions with the financial institution highlighted that the payments were related to income, business expenses, family expenses, other investments done by the end user, as well as business trends, changes in government policies, natural calamities, economical and financial changes in the world. The organization did not contain the details of their client’s demographics, family details, illness history, educational expenses, other investments, nor was there a means to determine the business that the client was engaged in. The primary data that was made available was the customer’s history of EMI payments. The task was then to predict the subsequent month’s EMI payment of a customer, based on the history of payments made by that customer. Another attribute that was seen to have an impact on payments made was the type of vehicle that the customer had purchased. This data was also provided to be used to determine the next month’s EMI. There were over seventeen thousand customers. The data provided were payment data over thirty months, for seventeen thousand customers. Payment data were broken down into total loan amount, number of months of loan, amount due per month and amount actually paid. The amount due for every month varied based on the type of loan that the customer had taken. Therefore, this became a specific attribute for every instance.

3.1 Analyzing the Problem The problem was identified as a “multi-variate, multi-time” series problem. One customer’s monthly payment sequence was one time series. The payments would depend on previous payments. The capability to make a payment of a particular amount would decide the capability to make a similar payment in the next month. There could also be cases where the customer would opt to make a larger payment in one month and pay less or nil in a subsequent month. Similarly, different customer payments would result in different patterns of time series. The task was therefore not a simple time series but a multiple time series. The architecture best suited to model memory-based behavior is a long short-term memory (LSTM). In the area of LSTM multi-time series is an ongoing research topic. Currently, the approach is to add an identifier for each time series allowing the deep learning system to learn and then predict each time series accordingly.

Challenges and Solutions in Developing Convolutional Neural …

31

3.2 Data Representation In the data that were shared, every customer had thirty-two attributes relating to location, customer, marketing personnel, status of account, amount receivable, arrears and payments dues over months. Of these fields, the ones that were relevant for prediction were selected as: Loan_Amount, Required_monthly_amount, arrear, description of item, due_amounts (over months) paid_amounts (over months). Of these, one customer could have data as 450,000, 21,000 0, A—Used, 21,000 21,000 21,000 21,000 21,000 21,000 Another customer could have data as 700,000, 27,100, 0, E—Used, 27,100 27,100 27,100 27,100 A third customer could have 913,040, 31,390, 4517, E—Used, 31,390 62,900 31,390 32,000 31,390 0 31,390 32,000. (Above data are changed in order to maintain the privacy of the organization but reflect the actual data). Only three data have been shown. Considering the variation in all these data, over seventeen thousand customers created a challenge. The first approach to normalize the data was to replace the numbers by percentages. So, the percentage of payment made per month of the due amount was used. (The due amount was used for the calculation but was not stored in this new representation.) This converted the data to a format that could be analyzed better. The result of this conversion created data that looked like (for the above three cases) 450,000, 21,000 0, A—Used, 100 100 100 700,000, 27,100, 0, E—Used, 100 100 913,040, 31,390, 4517, E—Used, 200, 100, 0, 100. This converted varied numbers of 21,000, 27,100 and 31,390 to a single percentage value of 100. A similar result was seen for other customer records. A few examples are given below

% in M1

100

200

100

100

100

100

Vehicle id

222

235

547

547

547

547

100

100

100

100

0

100

% in M2

0

0

0

0

100

100

% in M3

0

0

0

0

210

100

% in M4

0

0

0

0

0

100

% in M5

0

0

0

0

110

100

% in M6

310

310

300

300

100

100

% in M7

0

0

0

0

0

100

% in M8

2620

2620

2620

2620

100

100

% in M9

0

0

0

0

110

100

% in M10

0

0

0

0

100

2920

% in M11

32 A. Balakrishnan

Challenges and Solutions in Developing Convolutional Neural …

33

The first two rows show difference in payment details, while the next two are the same “payment pattern” and the last two form another payment pattern. Thus, converting the payment to a percentage form resulted in the similarities between payments becoming explicit. This was a major step toward identifying patterns and making predictions. The percentage calculation and subsequent pattern identification formed a major part of the “data pre-processing” in the developed application. The next challenge that was overcome referred to the various types of attributes. One set of attributes was the payment in every month along with the due amount every month, while another set of attributes concerned the initial loan sanction details, and a third set referred the type of and model of vehicle that was purchased with the sanctioned loan. All these attributes had to be converted to a common range and unit in order to ensure that the specific values did not bias the learning and subsequent prediction. It was also noted that the requirement was to predict the classification of the customer’s payment into one of the possible percentage values. Using the fact that there was to be a categorical classification, one-hot encoding was used to represent the attributes. By using one-hot encoding, the need to do categorical classification and the need to represent various types of attributes were met. While the solution was convenient from the perspective of a common range of values for all the attributes, there were other challenges caused by this approach. For one, the memory requirement increased significantly, as each of the attribute values were converted to a one-hot encoding. This meant that training with all the payment patterns for the seventeen thousand customers could not be done in one instance. The payment patterns were grouped and trained in batches. Each of the training would then create a separate model. This meant that during prediction also the payment pattern of the customer has to be first identified and then the appropriate model for this pattern would be used to make the prediction. One-hot encoding also created a problem of mandating a particular number of columns for every data set that was used for making a prediction. If the number of columns was not consistent with the columns used for the training time, then the application would throw an error. So, during prediction of payment, the input data had to be analyzed to ascertain the number of columns and then dummy data had to be added if the columns did not match. One-hot encoding was used for this task with these additional support modules implemented.

3.3 Design of the LSTM The LSTM software development to predict the Next month EMI payment by customers was divided into four primary stages: (a) (b) (c) (d)

Feature exploration: Define Relevant features Prototyping and training: Train a model prototype Model selection and validation: Perform model selection and fine tuning Productization: Take the selected model prototype to production. Feature exploration involved four tasks of

34

(i) (ii)

(iii)

A. Balakrishnan

Data analysis: Analyzing format of data and related values provided by the financial institution Development of a program to pre-process the data: Organizing data, converting data and re-arranging data. In this, the payment amounts were made a percentage of the due amount for each month Program to identify data features: Categories of payment patterns were identified. First, it was decided to use a variation of ten percent in payment as an indicator of a different payment pattern. More specifically, if one customer had made a payment of 100% in a particular month, another customer who made a payment of less than 91% would be considered to be a different payment category. So, finding payment patterns meant that the payments made in every month by one customer were compared with the payments made by other customers. If in any month, there was a variation of 10 or more percentage, then this customer’s payments were considered to be a new payment pattern. Before arriving at “ten percentage” as the proper range to use, payments were analyzed with the following variations: • All thirty months as a payment pattern • Twelve months as a payment pattern • Three months as a payment pattern. The percentage values were varied across: • Ten percent • Five percent • Two percent.

(iv)

All these payment patterns were given for training to the LSTM architecture. The best and most consistent results were got for the twelve months ten percent combination. In this manner, thirteen thousand three hundred and forty-two payment patterns were identified over the thirty months of payments, of seventeen thousand two hundred and seventeen customers. Identify and predict patterns: Build a (deep learning) software that can identify the correlations between the categories.

It was observed that customer payment patterns were changing over time. Keeping the impact of this on the required prediction, it was decided to focus on the customer behavior at the end of the time period of the data that were shared by the financial institution. Here, the number of customer records that had twelve or more months of data was eight thousand five hundred and sixty-eight. The LSTM architecture to learn and then predict customer payments had six thousand one hundred and eighty-one parameters. There were four LSTM layers, each followed by a dropout layer. Finally, two dense layers were used for the classification.

Challenges and Solutions in Developing Convolutional Neural … Layer (type)

Output Shape

35 Param #

LSTM_1 (LSTM)

(64, 10, 8)

4448

DROPOUT_1 (Dropout)

(64, 10, 8)

0

LSTM_2 (LSTM)

(64, 10, 8)

544

DROPOUT_2 (Dropout)

(64, 10, 8)

0

LSTM_3 (LSTM)

(64, 10, 8)

544

DROPOUT_3 (Dropout)

(64, 10, 8)

0

LSTM_4 (LSTM)

(64, 10, 8)

544

DROPOUT_4 (Dropout)

(64, 8)

0

DENSE_1 (Dense)

(64, 10)

90

DENSE_2 (Dense)

(64, 1)

11

Total parameters: 6181 Trainable parameters: 6181 Non-trainable parameters: 0 Recognizing multi-time series: As there were many customers making payment, there were different patterns of payment. This meant that there were different time series that had to be learnt by the LSTM. This is one of the research challenges of LSTMs; how one architecture can learn different series. Based on the advice of research articles on multi-time series, an identifier was introduced for each time series. The pre-processing algorithm identified different time series based on its payment pattern ID. This payment pattern ID was added to the customer’s payment history as a time series identifier. As stated before, there were five thousand and twenty-four patterns discovered through the data pre-processing. Training was done with the time series identifier and all the “percentage converted” customer payment data. Surprisingly, the results were not satisfactory. Around fifty percent validation was only achieved. Variations in the data format were now explored. Vehicle ID was removed, customer data for shorter ranges were tried, dropout was decreased, and payment pattern ID was removed. The best results were arrived at when the time series identifier was removed. A 99% accuracy was reached with this architecture and framework. An intuitive conclusion is that the statistically arrived at payment pattern classification was a misleading indicator. It biased the LSTM in the wrong manner. Removing this pattern identifier seems to have allowed the LSTM architecture to arrive at its own identifier between the five thousand and twenty-four different payment patterns. This is a point to be probably studied further.

3.4 Results A 99% accuracy was achieved in predicting the next month’s EMI payment for all customers, who had a history of payment over ten months. Prediction was done for

36

A. Balakrishnan

all customers and validated using the following month’s actual payment. There were cases where the payment pattern changed in the following month. Leaving aside these cases the application was able to predict with 99% accuracy the next month’s payment (within a range of ten percent). The multi-variate multi-time series was handled successfully by the LSTM that was developed.

4 Review of the Two Applications (1)

(2)

The approach that worked for identifying the hand marked choice exhibited the good usage of divide and conquer. This has to be done with care to ensure that only independent components are addressed separately, without ignoring any underlying dependencies that may be present. The LSTM project improved its results to a satisfactory level after the time series identifier was removed from the data set. This seems to imply that the payment pattern ID that was statistically calculated and added to the data set was a wrong bias, encouraging the application to predict wrongly. Therefore, the topic of identifier for a multi-time series LSTM prediction needs to be researched better. Developing a deep learning application involves a high amount of engineering:

• Analyze the problem: As the problem is one that needs intelligence to solve, the analysis phase is significantly more demanding than other software tasks. • Possible solutions: A set of possible solutions emerge after the analysis with no clear distinction between them. All have to be attempted and improved upon. • Experimental results: Results at each stage have to be recorded granularly and analyzed to decide the architecture/design for the next iteration toward deriving the required result. • Improve architecture: Analyzing performance is used to decide how the architecture can be changed. Certain combinations may not be possible/not recommended, while others have to be tried and checked. • Perseverance: Different data pre-processing, designs, parameter values, attribute changes and more, finally, give a good result.

References 1. Raj, B. (2018, May). A simple guide to the versions of the inception network. Medium. https://tow ardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202. 2. Colah blog. (2019, December). Understanding LSTM Networks. http://colah.github.io/posts/ 2015-08-understanding-LSTMs/. 3. Deshpande Adit. The-9-Deep-Learning-Papers-You-Need-To-Know-About https://adeshp ande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html.

Challenges and Solutions in Developing Convolutional Neural …

37

4. Elton, D. (2017). http://moreisdifferent.com/2017/09/hinton-whats-wrong-with-CNNs. 5. Andrej Karpathy blog. (2015, May). The unreasonable effectiveness of recurrent neural networks. http://karpathy.github.io/2015/05/21/rnn-effectiveness/. 6. Krizhevsky, A., Sutskever, I., & Geoffrey, H. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/ 3065386. ISSN 0001-0782. 7. Szegedy, C., et al. (2015). Going deeper with convolutions. CVP Computer Vision Foundation. 8. Uijilings, J. R. R., et al. (2012). Selective search for object recognition. Technical report, 2012 IJCV.

Speed, Cloth and Pose Invariant Gait Recognition-Based Person Identification Vijay Bhaskar Semwal, Arghya Mazumdar, Ashish Jha, Neha Gaud, and Vishwanath Bijalwan

Abstract Gait is very important to identify person from distance. It requires very less interaction with human participants. Gait is considered the popularly known visual identification technique. The major challenges associated with gait-based person identification are high variability, gait occlusion, pose and speed variance and uniform gait cycle detection, etc. In this research work, the CASIA-A, B and C data set is explored for the view, cloth and speed invariant person identification to address the challenged associated with gait-based person identification. In this work, the very important technique of computer vision for object identification is being explored. It included feature extraction techniques, namely gait energy image(GEI) for cloth invariance, histogram of gradients(HOG) for multiview invariance and Zernike moment with random transform for crossview invariance. To classify data, SVM, ANN and XGBoost-based machine learning algorithms are used on the CASIA gait data set and achieved 99, 96 and 67% identification accuracy, respectively, for three different scenarios of invariance, i.e. speed, cloth and pose. Keywords Human locomotion · View invariance · Speed invariance · Gait identification · Machine learning This paper is the outcome of work carried out under the project tilted Development of Computational model for bipedal walking trajectories generation and analysis of different gait abnormality. The project is supported by SERB, DST (Government of India) under Early Career Award to Dr. Vijay Bhaskar Semwal, PI of project with DST No: ECR/2018/000203 ECR dated 04-June-2019. V. B. Semwal (B) MANIT Bhopal,Bhopal, India e-mail: [email protected] A. Mazumdar · A. Jha NIT Rourkela, Rourkela, Odisha, India N. Gaud SCSIT DAVV Indore, Indore, India V. Bijalwan Institute of Technology Gopeshwar, Gopeshwar, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 M. Pandey and S. S. Rautaray (eds.), Machine Learning: Theoretical Foundations and Practical Applications, Studies in Big Data 87, https://doi.org/10.1007/978-981-33-6518-6_3

39

40

V. B. Semwal et al.

1 Introduction Gait is the process in which upper and lower body act in unison. It can be loosely understood as person’s way of walking. The gait cycle has two distinct phases. One is stance, and other is swing. The entire gait cycle can be divided into eight subphases as shown in Fig. 1. In stance phase, the gait cycle has initial contact, loading response, mid-stance and terminal stance. In swing phase, the gait cycle has preswing, initial swing, mid-swing and terminal swing [1]. In the gait cycle, hip, knee and ankle moves are distinct way to produce gait. We aim to differentiate and identify people based on their gait [2]. Human gait is used to identify the person from distance, and it is unobstructed biometric trait process for person identification. Gait analysis is very important in surveillance, identification purpose and security infrastructure system [3]. Right now we have fingerprint, face recognition for biometric recognition but none of the techniques works when the subject to be identified is at a distance. Gait is the only biometric trait that can identify subject at a distance [4]. Gait analysis is done for medical purposes too where it can be used for early detection of gait abnormalities including Parkinson disease. The gait study further can be utilized for generation of robot walking trajectories [5]. Gait is also used for planning the path of humanoid robot [6]. Gait is considered as behavioural biometrics which has the highest collectabilty as shown in Tables 1 and 2. Gait suffers with low permanence at early state of learning. A learner’s gait dynamics can change drastically within a short period of time as

Fig. 1 Different subphase of one complete human gait cycle

Speed, Cloth and Pose Invariant Gait …

41

Table 1 Different physiological-based biometric characteristics Biometric

Universality

Distinctiveness Permanence

Collectabilty

Circumvention

Face

High

Medium

Medium

High

Low

Iris

High

High

High

Medium

Low

Palm print

High

High

Medium

Medium

Medium

Fingerprint

High

High

Medium

Medium

High

Retina

High

High

High

Low

Low

Table 2 Different behaviour-based biometric characteristics Biometric

Universality

Distinctiveness Permanence

Collectabilty

Circumvention

Gait

Low

Medium

Low

High

Low

Speech

High

Medium

High

Medium

Medium

Signature

High

Medium

Low

Medium

Low

Keystroke

High

Medium

Low

Medium

Low

Device Uses

Low

Medium

Low

High

Low

the learners gets accustomed to the environment being used. Once it is acquired and accustomed, it is very less likely to change with time [7]. Figure 1 shows the different subphases of one complete gait cycle. The one human gait cycle consists of two broad phases: one is single support phase (SSP) when one foot will be in air and another will be on ground, and another phase is double support phase (DSP): when both foot will be place on ground [8]. The DSP is observed very less during normal walk and quantitative; it is only 10–12% one gait cycle. One healthy person can complete the one gait cycle in between 0.52 and 1 s [9]. The one complete gait cycle cab be further divided into seven different linear subphases [10]. The rest of the paper is organized into five sections. Section 2 presents the literature review about the gait recognition work done so far. Section 3 is presenting our proposed view, cloth and speed invariant method. Section 4 is result and discussion section and presents the performance study of all the algorithms used for classification, and, finally, Sect. 5 presents the conclusion and future research work.

2 Literature Review of Related Work In the past, various spatial and temporal features have been used for gait-based identification. A number of models have been proposed including vision, sound, pressure and accelerometry models [11, 12]. Gait signals can be complex which make gait-based identification tough. Two different approaches have been developed for image-based gait recognition. One is model-based method which computes the model features by fitting model to the image, and other is appearance-based strategies [13].

42

V. B. Semwal et al.

Many researchers have tried to improve the accuracy of machine vision-based gait identification [14]. Researchers have proposed a method in which the gait features are obtained by calculating the area of the head, arm swing and leg swing regions. This method works only in speed invariant human identification problem. Several researches have used speeded up robust features to depict the trajectories of the different parts of the body, but this method works only if there is only one moving object [15]. Hence, it is not suited for crowded scene. Kusakunniran [21] proposed a technique in which space-time interest points are obtained from a raw video sequence. The advantage of this method is that the time complexity occurred by pre-processing of the video is removed. Wang et al. [16] developed a temporal template called as chrono gait image. This method requires the less computational cost to preserve the gait cycle information. The approach (Fig. 2) is based upon the notion that the boundary of the silhouettes (Fig. 3) will contain maximum information, and this would hold true for any kind of gait condition including the person in a different attire, with a bag or walking at

Fig. 2 Process flow diagram

Speed, Cloth and Pose Invariant Gait …

43

Fig. 3 Gait silhouette

different speeds. However, since, the silhouettes represent a walking sequence over a particular set of frames, we need to convert them into a single frame by using GEI [17].

2.1 GEI Human walking is a repetitive motion which remains same up to a frequency. The order of positions of the limbs is the same in each cycle that is the legs move forward and the arms backward. This order is not important in feature extraction for gait recognition and so can be combined to form a spatio-temporal representation of a single image by using GEI [18, 19]. The greater the pixel value in the GEI, more is the frequency of occurrence of the human body. Pre-processing steps include extracting region of interest of the silhouette image and finding out the centre of the human [20]. Although it might seem implicit in nature, the region of interest found out here is based on a novel approach which is termed as pixel approximation method Algorithm 1 which is then used to calculate the GEI.

2.2 HOG HOG descriptors can be used to extract boundary information in a swift manner from the GEI images (Fig. 4). It is a feature descriptor technique used in image processing, mainly for object detection. The distribution [histograms) of directions of gradients is used as features here. Gradients (x and y derivatives) of an image are pivotal because the magnitude of gradients is large around edges and corners (where intensity abruptly changes ). Here, nine orientations are considered with pixels per cell being (16, 16) and cells per block is (2, 2), for each image. To calculate the final feature vector for the entire image, the individual vectors of each path are flattened

44 Fig. 4 GEI (Left) used in obtaining HOG features (Right)

Fig. 5 Radon transformed GEI image

V. B. Semwal et al.

Speed, Cloth and Pose Invariant Gait …

45

into a array having 1420 elements with the original image size being 210 × 70, thus representing a 90% reduction in size of feature vector. Benefits of selecting this feature are that it is compact, fast to calculate, scale and translation invariant and work appreciably for silhouette images. Algorithm 1 ROI calcuation with centre of Image location for GEI building Result: GEI is build on passing silhouette images of a walking sequence Input: Image (i) with width (w) and height (h) Output: GEI Procedure – Set four pointers up ← 0, down ← 0, right ← 0, le f t ← 0 to capture the delimiters of the silhouette in four directions. – Increment the pointers until a white pixel is found so that the rectangular boundary is obtained – Crop the image now by making use of the pointers pi xelCount ← [] index ← right -left  pixelCount ← N on Z er o Pi xels I n Cr opped I mage countAll ← (pixelCount) pixelPercent = count / countAll foreach count in pixelCount countPercentSum ← 0, minTheta ← 1,bestIndex ← 0 index,val in pixelPercent: tmp ← |0.5 - countPercentSum| if tmp 90%. Voxels influencing classification between PD and PSP patients included midbrain, pons, corpus callosum and thalamus, four basic locales known to be unequivocally included within the pathophysiological instruments of PSP. Comparison with existing strategies: Classification precision of person PSP patients was steady with past manual morphological measurements and with other directed machine learning application to MRI data, while precision within the discovery of person PD patients was significantly higher with our classification strategy. Conclusions: The calculation gives great separation of PD patients from PSP patients at a person level, in this way empowering the application of computer-based determination in clinical practices.

4.3 Wind Power Detection in Power Systems Estimating of wind power, electric loads and vitality cost has gotten to be a major issue in control frameworks. Taking after needs of the showcase, different methods are utilized to figure the wind control, vitality cost and control demand. The issue confronted by control framework utilities is the inconstancy and non-schedulable nature of wind cultivate control era. These inborn characteristics of wind control have both specialized and commercial suggestions for efficient arranging and operation of control frameworks. To address the wind control issues, this paper presents the application of an adaptive neural fuzzy inference system (ANFIS) to exceptionally short-term wind determining utilizing a case ponder from Tasmania, Australia. The distinctive estimating issues and procedures within the range of electric stack estimating, vitality cost estimating and wind control forecast are said underneath.

AI to Machine Learning: Lifeless Automation and Issues

A.

131

Load forecasting

Load forecasting could be a process to anticipate stack for a future period. Application of stack determining falls into distinctive time skylines: long-term estimating (from one year to ten a long time), medium-term determining (from a few months to one year), short-term determining (from one-hour to one-week) and real-time or exceptionally short-term estimating (in minutes). Long-term figures influence the choices on era and transmission arranging, which is utilized for deciding the economical location, sort and measure of end of the control plants. Medium-term stack estimates are fundamental for era and transmission upkeep, conjointly for fuel planning. Precise short-term stack figures are essential for unit commitment a financial alacrity. Exceptionally short-term stack determining is for minutes ahead and is utilized for automatic generation control (AGC) [22, 23]. B.

Price forecasting results

The cost estimating result displayed is based on past work on machine learning application to figure day-ahead utilizing NN model integrated with SD strategy [24]. Within the created cost forecast strategy, day-ahead power cost is gotten from the neural arrange that modifies the cost bends gotten by averaging chosen number of comparable cost days comparing to estimate day, i.e., two strategies were analyzed: (1) forecast based on averaging costs of comparable days and (2) expectation based on averaging costs of comparable days furthermore neural network refinement. C.

Wind power prediction result

A persistence model was too created for comparison. Determination is directly an industry benchmark for exceptionally short-term wind estimating and so is the foremost characteristic evaluation. The ANFIS show was created in a few diverse designs. The ANFIS show plan is adaptable and competent of taking care of quickly fluctuating information patterns [25].

4.4 Agriculture Machine learning has developed with enormous information advances and highperformance computing to make unused opportunities for information seriously science within the multi-disciplinary agri-technologies space. In this paper, we show a comprehensive audit of inquire about devoted to applications of machine learning in agrarian generation frameworks. The works analyzed were categorized in: crop management, counting applications on abdicate forecast, infection discovery, weed discovery crop quality, and species recognition; livestock management, including applications on animal welfare and livestock production; water management; and soil management. The filtering and classification of the displayed articles illustrate how horticulture will advantage from machine learning innovations. By applying machine learning to sensor information, cultivate administration frameworks are

132

S. Darshana et al.

advancing into genuine time artificial intelligence empowered programs that give wealthy proposals and bits of knowledge for agriculturist decision support and activity [26].

4.5 Politics Issues dealing with varied opinion mining from microposts, and the challenges they force on an NLP framework, at the side an illustration application we have created to determine political leanings from a set of pre-election tweets. Whilst there are a number of estimation and analysis devices accessible which abridge positive, negative and impartial tweets approximately a given keyword or subject, these devices basically deliver poor outcomes, and work in a decently shortsighted way, utilizing as it were the nearness of certain positive and negative descriptive words as markers, or basic learning methods which don’t work well on brief microposts [28]. On the other hand, intelligent devices which work well on motion picture and client audits cannot be utilized on microposts due to their brevity and need of setting. Our strategies make utilize of an assortment of modern NLP procedures in arrange to extricate more significant and higher quality suppositions and join extra-linguistic contextual data.

4.6 Genomics The field of machine learning, which points to evolve computer algorithms that progress with involvement, holds guarantee to empower computers to help people within the analysis of huge, complex information sets. Here, we offer an outline of machine learning applications for the investigation of genome sequencing information sets, counting the annotation of grouping elements and epigenetic, proteomic or metabolite information [27]. We show contemplations and repetitive challenges within the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative displaying approaches [29]. We offer common rules to help within the determination of these machine learning strategies and their down to earth application for the investigation of hereditary and genomic information sets.

4.7 Networking Machine learning (ML) has been relishing an exceptional surge in applications that unravel issues and empower robotization in differing spaces. Fundamentally, this is often due to the blast within the accessibility of information, noteworthy changes in

AI to Machine Learning: Lifeless Automation and Issues

133

ML procedures, and headway in computing capabilities. Undoubtedly, ML has been connected to different unremarkable and complex issues emerging in arrange operation and administration. There are different studies on ML for particular zones in organizing or for particular arrange advances. This study is unique, since it together presents the application of assorted ML procedures in different key zones of organizing over distinctive arrange advances. In this way, perusers will advantage from a comprehensive talk on the diverse learning standards and ML methods connected to essential issues in organizing, counting traffic prediction, routing and classification, clog control, asset and blame administration, QoS and QoE administration and network security. Besides, this overview depicts the restrictions, donate experiences, investigate challenges and future openings to development ML in organizing. Hence, a convenient and periodic commitment of the implications of ML for networking, that’s pushing the boundaries of autonomic network operation and management [31].

4.8 Energy Forecasting Energy security and global warming are considered two of the most noteworthy challenges confronting the USA, whereas energy preservation is the elemental arrangement to both [31]. Owing to quick consumption of non-renewable assets, vitality productivity is one of the imperative issues that the world is confronting. Numerous nations are confronting vitality issues in all levels of their framework, industry and economy. The said worldwide vitality and natural challenges have driven city governments to continuously adjust their approaches, choices and techniques toward greener and vitality effective approaches. Within the USA, the state governments have set driven objectives in decreasing their greenhouse gas (GHG) outflows, such as 80% by 2050 in Unused York City and Boston. Other than understanding current utilization designs, determining urban buildings energy performance is vital to assembly such objectives.

5 Conclusion Machine learning can be deployed and also can serve wide range of purposes, many of which have been mentioned. It gives richer proposals and bits of knowledge for the ensuing decisions based on past information and activities with the extreme scope of production enhancement. ML strategies have profoundly established in our lifestyle beginning from horticulture, malady discovery and forecast, legislative issues, control systems, energy framework and to each corner of our life. These days, machine learning methods are being broadly utilized to unravel real-world issues by storing, controlling, extricating and recovering information from expansive sources.

134

S. Darshana et al.

Administered machine learning methods have been broadly embraced in any case these methods demonstrate to be very expensive when the frameworks are actualized over wide extend of information. This can be due to the reality that significant sum of exertion and taken a toll is included since of getting huge labeled information sets. Thus, dynamic learning gives a way to reduce the labeling costs by labeling as it were the foremost useful instances for learning. So, in the forth coming days, it is anticipated that the utilization of ML models will be indeed more far reaching, permitting for the plausibility of coordinates and pertinent instruments. At the moment, all of the approaches respect individual approaches and arrangements and are not enough associated with the decision-making algorithms, as seen in other application spaces. This integration of computerized information recording, information examination, ML execution and decision-making or back will give down to earth tolls that come in line with the so-called knowledge-based agribusiness for expanding generation levels and bio-products quality. The performance of machine learning algorithms is more efficient for a classification assignment when the algorithms are combined. For the forecast of the correct output class, combined learner chooses the lesson to which most noteworthy likelihood has been assigned among all the learners.

References 1. Pereira, F., Mitchell, T., & Botvinick, M. (2009). Machine learning classifiers and fMRI: A tutorial overview. Neuroimage, 45(1), S199–S209. 2. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484–489. 3. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). 4. Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M. L., Stolcke, A., et al. (2017). Toward human parity in conversational speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12), 2410–2423. 5. Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1341–1390. 6. Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82. 7. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. 8. https://towardsdatascience.com/the-state-of-ai-in-2020-1f95df336eb0. 9. Shortliffe, E. (Ed.). (2012) Computer-based medical consultations: MYCIN (Vol. 2). Elsevier. 10. Grottola, A., Marcacci, M., Tagliazucchi, S., Gennari, W., Di Gennaro, A., Orsini, M., et al. (2017). Usutu virus infections in humans: A retrospective analysis in the municipality of Modena, Italy. Clinical Microbiology and Infection, 23(1), 33–37. 11. Barbat, M. M., Wesche, C., Werhli, A. V., & Mata, M. M. (2019). An adaptive machine learning approach to improve automatic iceberg detection from SAR images. ISPRS Journal of Photogrammetry and Remote Sensing, 156, 247–259. 12. Mountrakis, G., Im, J., & Ogole, C. (2011). Support vector machines in remote sensing: A review. ISPRS Journal of Photogrammetry and Remote Sensing, 66(3), 247–259.

AI to Machine Learning: Lifeless Automation and Issues

135

13. Shang, R., Qi, L., Jiao, L., Stolkin, R., & Li, Y. (2014). Change detection in SAR images by artificial immune multi-objective clustering. Engineering Applications of Artificial Intelligence, 31, 53–67. 14. Gao, F., You, J., Wang, J., Sun, J., Yang, E., & Zhou, H. (2017). A novel target detection method for SAR images based on shadow proposal and saliency analysis. Neurocomputing, 267, 220–231. 15. Colubri, A., Hartley, M. A., Siakor, M., Wolfman, V., Felix, A., Sesay, T., et al. (2019). Machinelearning prognostic models from the 2014–16 Ebola outbreak: Data-harmonization challenges, validation strategies, and mHealth applications. EClinicalMedicine, 11, 54–64. 16. Choi, S., Lee, J., Kang, M. G., Min, H., Chang, Y. S., & Yoon, S. (2017). Large-scale machine learning of media outlets for understanding public reactions to nation-wide viral infection outbreaks. Methods, 129, 50–59. 17. Nápoles, G., Grau, I., Bello, R., & Grau, R. (2014). Two-steps learning of Fuzzy Cognitive Maps for prediction and knowledge discovery on the HIV-1 drug resistance. Expert Systems with Applications, 41(3), 821–830. 18. Vickers, N. J. (2017). Animal communication: When I’m calling you, will you answer too? Current Biology, 27(14), R713–R715. 19. Lalmuanawma, S., Hussain, J., & Chhakchhuak, L. (2020). Applications of machine learning and artificial intelligence for Covid-19 (SARS-CoV-2) pandemic: A review. Chaos, Solitons & Fractals, 110059. 20. Kavadi, D. P., Patan, R., Ramachandran, M., & Gandomi, A. H. (2020). Partial derivative nonlinear global pandemic machine learning prediction of covid 19. Chaos, Solitons & Fractals, 139, 110056. 21. Salvatore, C., Cerasa, A., Castiglioni, I., Gallivanone, F., Augimeri, A., Lopez, M., et al. (2014). Machine learning on brain MRI data for differential diagnosis of Parkinson’s disease and Progressive Supranuclear Palsy. Journal of Neuroscience Methods, 222, 230–237. 22. Senjyu, T., Mandal, P., Uezato, K., & Funabashi, T. (2005). Next day load curve forecasting using hybrid correction method. IEEE Transactions on Power Systems, 20(1), 102–109. 23. Hippert, H. S., Pedreira, C. E., & Souza, R. C. (2001). Neural networks for short-term load forecasting: A review and evaluation. IEEE Transactions on Power Systems, 16(1), 44–55. 24. Mandal, P., Senjyu, T., Urasaki, N., Funabashi, T., & Srivastava, A. K. (2007). A novel approach to forecast electricity price for PJM using neural network and similar days method. IEEE Transactions on Power Systems, 22(4), 2058–2065. 25. Osório, G. J., Matias, J. C. O., & Catalão, J. P. S. (2015). Short-term wind power forecasting using adaptive neuro-fuzzy inference system combined with evolutionary particle swarm optimization, wavelet transform and mutual information. Renewable Energy, 75, 301–307. 26. Liakos, K. G., Busato, P., Moshou, D., Pearson, S., & Bochtis, D. (2018). Machine learning in agriculture: A review. Sensors, 18(8), 2674. 27. Mackowiak, S. D., Zauber, H., Bielow, C., Thiel, D., Kutz, K., Calviello, L., et al. (2015). Extensive identification and analysis of conserved small ORFs in animals. Genome Biology, 16(1), 179. 28. Weichselbraun, A., Gindl, S., & Scharl, A. (2010). A context-dependent supervised learning approach to sentiment detection in large textual databases. Journal of Information and Data Management, 1(3), 329–329. 29. Libbrecht, M. W., & Noble, W. S. (2015). Machine learning applications in genetics and genomics. Nature Reviews Genetics, 16(6), 321–332. 30. Boutaba, R., Salahuddin, M. A., Limam, N., Ayoubi, S., Shahriar, N., Estrada-Solano, F., & Caicedo, O. M. (2018). A comprehensive survey on machine learning for networking: Evolution, applications and research opportunities. Journal of Internet Services and Applications, 9(1), 16. 31. Fathi, S., Srinivasan, R., Fenner, A., & Fathi, S. (2020). Machine learning applications in urban building energy performance forecasting: A systematic review. Renewable and Sustainable Energy Reviews, 133, 110287.

Analysis of FDIs in Different Sectors of the Indian Economy Parikshit Barua, Sandeepan Mahapatra, Siddharth Swarup Rautaray, and Manjusha Pandey

Abstract Foreign direct investment (FDI) is the investment made by a foreign person/company to establish a business in another country. It is an essential factor for evaluating the progress of any country and plays a huge role in its economy. An increase in FDI means an increase in technology, infrastructure, employment, which in turn results in higher tax generation for the host country. It also leads to an increase in foreign reserves. With India’s Prime Minister, Narendra Modi’s $5 trillion economy target for India, his government has initiated various campaigns such as “Make in India,” which would encourage joint ventures by foreign entities into India and further better the economy. This paper explores the trends of FDIs in various sectors of the Indian economy for the past 19 years and determines which sectors have seen a decrease in investments over the years, and hence, may need more funds from the government or ease of regulations to promote FDIs. Also, finding sectors that are being invested in the most will help us determine the strong sectors of the Indian economy. Using machine learning and forecasting, this project proposes a model to estimate the expected trends in FDIs which can possibly be seen in the coming few years. This will help the government make a decisive plan and form a budget for the upcoming financial years to cut the slowdown that the country is currently facing and make the economy grow further. Keywords AR · Autoregression · ARIMA · Economy · FDI · MA · Moving average · SARIMA

P. Barua (B) · S. Mahapatra · S. S. Rautaray · M. Pandey School of Computer Engineering, KIIT (Deemed to be University), Bhubaneswar, Odisha, India e-mail: [email protected] S. Mahapatra e-mail: [email protected] S. S. Rautaray e-mail: [email protected] M. Pandey e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 M. Pandey and S. S. Rautaray (eds.), Machine Learning: Theoretical Foundations and Practical Applications, Studies in Big Data 87, https://doi.org/10.1007/978-981-33-6518-6_8

137

138

P. Barua et al.

1 Introduction A foreign direct investment (FDI) is an investment made by a firm or individual in one country into business interests located in some other country. A person/firm establishing foreign business operations or acquiring foreign business assets in a foreign company is an instance of an FDI. Foreign direct investments (FDI) are a major monetary source for economic development of India as it reduces unemployment, creates job opportunities, increases the use of technology and managerial skills, removes balance of payment constraints, promotes exports, and generates a competitive environment and many other progressive outputs that differentiates this kind of investment from other funding sources. Due to the lack of enough domestic capital for economic development, India needs foreign capital. India has become an investment hub over the last decade. The major areas of FDIs are—oil, mining, coal and gas, banking, insurance, transportation, finance, manufacturing, retailing, etc. FDI has been playing a major role in the growth of the Indian economy. Historically, India has followed a very cautious and selective approach regarding foreign capital, but after economic reforms in 1991, it has liberalized foreign direct investment policies. A number of measures were undertaken to promote FDI, and thus, the Government of India (GOI) has been successful in attracting more FDI in India. From the years 1991–1992 to 2011–12, India has fetched 4,26,318 US $ million FDI inflows (considering estimates made by RBI for the year 2010 to 2012). This research paper explores the trends of FDIs in various sectors of India’s economy in the past 19 years. The data (source: dipp.gov.in) consists of the records of FDIs in various sectors like agricultural machinery, electrical equipment, electronics, hospitals and diagnostic centers, etc., for different years starting from 2000 up to 2019. Graphical analysis based on the top 5 sectors has been done which shows that the service sector has reached its height in recent years. Most sectors are also showing an increasing trend except for a few sectors. It also goes through the impact of demonetization by the present ruling government in India from where we come to know that there has been a significant growth in the fields of computer software and hardware, telecommunications, and trading. The overall interpretations have been done using exploratory data analysis and forecasting techniques in the python framework. This report also helps to monitor and analyze the potential of inflow to India so that adequate measures can be taken for the improvement of the economy. This paper is presented in five sections where Sect. 1 was an introduction to the paper and also a discussion about the technologies used. The state of the art has been presented under Sect. 2 of this paper. Section 3 discusses the model used and gives a detailed summary about it. Section 4 contains observations from the data and also the forecasting done by the model. The paper has been wrapped up and concluded in Sect. 5.

Analysis of FDIs in Different Sectors of the Indian Economy

139

1.1 Technologies Used This research work analyzes foreign direct investment (FDI) with the help of Python programming language. Python is a fairly simple programming language to learn, and it is easily readable in the sense one can tell what a line of code is doing just by looking at it, without any prior knowledge of its syntax. It contains various packages and modules with functions in them which makes the task of a data scientist very easy. Firstly, the data set has been imported using Pandas, which is a library for manipulating data frames in the Python programming language. Pandas is very efficient to read, manipulate, and operate on tabular data. It treats each column of the table as an individual “Pandas Series,” and it is very easy to perform calculations or apply functions to rows or columns of a table. Pandas has also been used for data pre-processing and has been used for transforming raw data into an understandable format. Auto regressive integrated moving-average (ARIMA) model has been used for forecasting. Forecasting is the process of predicting the future based on past and present trends, and it is commonly used for predictions on time-series data. The ARIMA is actually a class of models that “explain” a given time series based on its own past values. It is a hybrid model made up by integration of two independent models, namely the auto regressive (AR) model and moving-average (MA) model. The stats models library provides the capability to fit an ARIMA model. Finally, Python libraries like Matplotlib and Seaborn, which are excellent libraries for graphing and plotting in Python, have been used to perform data visualization which helps us look into the trends in various sectors of the economy and how the government has performed. Library like Numpy has also been used for faster computations.

2 State of the Art This section of the paper work discusses the work done by other authors in this field. The author’s name, title of his/her work, the proposal of the paper, and its findings or reviews have been given in a tabular format in Table 1.

3 ARIMA Model for Forcasting The ARIMA model can be thought of to be a hybrid model made up of the integration of two simple models. The first simple model being the AR model.

140

P. Barua et al.

Table 1 Tabular view of current research work carried out in the field Author’s name

Proposal

Reviews

Noheed Khan 2017 [1, Impact of FDI and 9–16] export on economic growth: Evidence from Pakistan and India

Title

To find the relationship between FDI and economic growth

FDI and export have a positive relationship with economic growth

Wang Shaobin 2011 [2, 17]

To find the relationship between FDI and economic growth

FDI and GDP have a positive correlation Although foreign direct investment is not all factors to the economic growth of Shaanxi, its role is still very large

Jinsheng Zhu 2011 [3] The impact of FDI on To find the impact of China’s economy FDI on China’s security: An empirical economic security analysis based on factor analysis and GRNN neural network

FDI sources, FDI in regional structure, and FDI in the industrial structure have a great effect on China’s economic security

Yan Chen 2014 [4, 18–20]

The analysis of the technical progress effect of FDI in China

To find the relation between FDI and technological development

FDI has a promoting effect on China’s technological progress

Jun Guan 2010 [5]

The trade effect analysis of foreign direct investment in Hebei

To find the effect of FDI in trades

It has a direct relationship with trade, thus having an increasing trend

Cheng Li-wei 2007 [6, 21–23]

Empirical investigation To find the impact of It has been observed of FDI impact on FDI over that FDI promotes China’s employment employability employability

Yu Chao 2011 [7, 24, 25]

Effects of China’s outward FDI on domestic employment

To examine the effect of outward FDI on domestic employment

China’s outward FDI has a positive effect on domestic employment, the effect is not significant

V. Aruna 2012 [8]

Foreign direct investment (FDI) in the multi-brand retail sector in India—A boon or bane

To examine the negative and positive impacts on the entry of FDI in India

Experience of the last decade shows that small retailers have flourished in harmony with large outlets

Analysis on relationship between FDI and economic growth in ShaanXi based on OLS model

3.1 Model 1: AR MODEL The auto regressive (AR) model regresses the values of the time series against previous values of the same time series. The equation of a simple AR model, AR (p = 1) is given by (1):

Analysis of FDIs in Different Sectors of the Indian Economy

yt = a1 · yt−1 + et

141

(1)

The value of the series at time “t” is directly proportional to its value at the previous step. et is the shock term or the white noise. Each shock is random and is not related to other shocks in the series. The second simple model is the MA model.

3.2 Model 2: MA MODEL The moving average (MA) model regresses the values of the time series against previous shock values of the same time series. The equation of a simple MA model, MA (q = 1) is given by (2): yt = m 1 · et−1 + et

(2)

Combining both the AR and MA models gives us the ARMA model.

3.3 Hybrid Model 1: ARMA MODEL The ARMA( AR + MA) is a combination of both the AR and MA models. This model regresses the time series against both the previous values and the previous shock terms. A simple ARMA (p = 1, q = 1) model is given by Eq. (3): yt = a1 · yt−1 + m 1 · et−1 + et

(3)

The ARMA model can be extended to the ARIMA model by providing it with a parameter “d,” which defines the number of “differencing” required by the series before fitting it to the model.

3.4 Hybrid Model 2: ARIMA MODEL The autoregressive integrated moving average (ARIMA) is one of the most widely used models for forecasting. It uses both autoregressive (AR) and moving-average (MA) elements. It requires “Stationary” time-series data for forecasting. A “Stationary” timeseries data is one which follows the following conditions: Condition 1: Has constant mean or the trend in the data is 0.

142

P. Barua et al.

Fig. 1 ARIMA Model

Condition 2: Has constant variance. Condition 3: Has autocovariance that does not depend on time. There are various ways to transform a non-stationary data into stationary. One way is to take difference of the series. If y denotes the dth difference of the series Y, then: • If d = 0, yt = Y t • If d = 1, yt = Y t − Y t −1 • If d = 2, yt = (Y t − Y t −1 ) − (Y t −1 − Yt −2 ) …. and so on An ARIMA model can be classified as “ARIMA(p, d, q),” where: p: number of autoregressive terms or autoregressive lags d: number of differences needed for stationarity q: the number of moving-average terms Figure 1 describes an ARIMA Model. Although the ARIMA model is capable of catching the trends in univariate data, it does not particularly succeed in catching the seasonality aspect of time-series data. ARIMA expects the data to be non-seasonal or does better with data that has the seasonal component removed.

4 Implementation and Results This paper is aimed at analyzing the trends of FDIs in India over the last 20 years. It also tries to propose a model for forecasting future trends for FDIs in the country. The data set has been prepared by collecting the data provided by the Government of India’s site https://data.gov.in. (Source: https://data.gov.in/catalog/foreign-directinvestment-fdi-equity-inflows?filters%5Bfield_catalog_reference%5D=839421& format=json&offset=0&limit=6&sort%5Bcreated%5D=desc).

Analysis of FDIs in Different Sectors of the Indian Economy

143

Fig. 2 Analytical life cycle for FDI

Figure 2 explains the work-flow and the approach that has been taken. Two data sets have been imported for this work. One containing the sector-wise FDI rates starting from the year 2000–01 up to 2016–17, and the second data set contains the total inflow in each sector starting from 2000 to 2001 to the last financial quarter of this financial year (2019–20). These two data sets have been joined on the sectors. A new column has been made which contains the difference of total inflow in each sector (till last quarter) and the sum of inflow till 2016–17. This gives the inflow from 2017 to last quarter. It has been assumed that two-third of this difference was the inflow for the year 2017–18, and the rest one-third has been the inflow for the year 2018–19 (GDP growth in 2017–18 was 7.2% compared to 6.8% in 2018–19). This has been done in order to get an organized data set containing the inflow of FDI for each sector in the economy, starting from financial year 2000–01 up to financial year 2018–19. The data has been manipulated in such a way so that it is easy to work with and plot it. This data has been used to plot various graphs and draw meaningful conclusions as we shall see later on. This is also the type of data that a forecasting model expects. Figure 3 shows the last five rows and all the features of the data set thus obtained. Figure 4 is a visual representation of the FDIs in different sectors/industries of India for the past two decades. We can see the trend of FDI in all sectors of the economy. Some sectors have had huge fluctuations in FDI rates every year, whereas some sectors have received quite stable inflow of FDI over the years. FDI rates in few sectors have grown exponentially; while in some sectors, it has been very low since the beginning of this century. Not much can be inferred from this plot. So, looking at the top five sectors which have been bringing in the most capital would be more meaningful. A plot of the top five sectors with the highest average investment over the past 19 years shows us the strong sectors or backbone of India’s economy in terms of the rate of FDI. Figure 5 shows the top five sectors (services, computer software, construction development, telecommunications, and automobile) that the government should primarily focus on maintaining and progressing as these are bringing in a lot of joint ventures and capital into the economy. Services sector has grown a lot whereas the

144

P. Barua et al.

Fig. 3 Tail of the data set

Fig. 4 FDIs in all sectors of Indian economy

construction development sector had its peak around 2009 and then started declining rapidly. Computer software and hardware, telecommunications, and construction development sectors have seen an uptrend since the start of this century. The impact of FDIs into a particular sector depends on the government. Different sectors flourish over different governments. A histogram of the total amount of the inflow of capital before 2014 and after 2014 (Narendra Modi takes over as the Prime Minister of India) (Fig. 6) gives a clear picture of the changes in FDIs in different sectors. Services sector has remained unaffected and has been the sector receiving

Analysis of FDIs in Different Sectors of the Indian Economy

145

Fig. 5 Top 5 sectors in terms of FDI

most inflow both before and after 2014. Construction development sector has seen a decline under the current government and computer service, and hardware sector has flourished. Telecommunications sector too has been steadily growing under different governments. Drugs and pharmaceuticals sector has seen a decline since 2014 but trading sector has grown. Automobile sector has flourished since 2014.

4.1 Box–Jenkins Method Using the Box–Jenkins method (Fig. 7) to find the best fit for a model that would be able to forecast future trends of FDI in any of the sectors, while focusing only on one sector (“Services” sector) to fit and tune the model. The Box–Jenkins method is a common and systematic way of fitting and checking the model performances of time series models such as ARIMA. It is an iterative process consisting of four phases, namely “identification” of the time-series model and data, “estimation,” “model diagnostics,” and “production,” provided the “model diagnostic” curves are satisfactory. Following these steps makes sure that the best model and model parameters are chosen before the deployment of the model.

4.1.1

Identification

This is the first step of a series of steps that has been followed to build a model. A plot of the FDI into the services sector with respect to years (Fig. 8) gives an idea about the stationarity of the data. For time-series forecasting, it requires the data to be stationary. The stationarity of this series can be checked by performing an augmented

146

P. Barua et al.

Fig. 6 Top sectors in terms of FDI before (top) and after (bottom) 2014

Dickey–Fuller test on this data, which gives a test statistic of −0.816 and a p-value of 0.814 (>0.05). This suggests that the null hypothesis, that the data is non-stationary, is not rejected. Hence, this data is not stationary, and transformation is required. A non-stationary series can be converted into stationary by applying transformations on it, such as differencing or taking the log of the series. Performing differencing on the data and performing an augmented Dickey–Fuller test on this transformed data gives a test statistic of −3.866 and a p-value of 0.002 (