402 117 7MB
English Pages 270 [271] Year 2021
Artificial Intelligence in Data Mining
This page intentionally left blank
Artificial Intelligence in Data Mining Theories and Applications
Edited by
D. Binu Managing Director and cofounder, Resbee Info Technologies Pvt. Ltd, India
B. R. Rajakumar Director and cofounder, Resbee Info Technologies Pvt. Ltd, India
Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2021 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress ISBN: 978-0-12-820601-0 For Information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals
Publisher: Mara Conner Acquisitions Editor: Chris Katsaropoulos Editorial Project Manager: Rachel Pomery Production Project Manager: Sreejith Viswanathan Cover Designer: Hitchen Miles Typeset by MPS Limited, Chennai, India
Contents List of contributors Preface 1.
ix
xi
Introduction
1
D. BINU AND B. R. RAJAKUMAR
2.
1.1 Data mining
1
1.2 Description of data mining
2
1.3 Tools in data mining
5
1.4 Data mining terminologies
5
1.5 Merits of data mining
6
1.6 Disadvantages of data mining
7
1.7 Process of data mining
8
1.8 Data mining techniques
10
1.9 Data mining applications
12
1.10 Intelligent techniques of data mining
15
1.11 Expectations of data mining
19
References
19
Intelligence methods for data mining task
21
N. JAYALAKSHMI AND SRIDEVI MANTHA
2.1 Introduction
21
2.2 Procedure for intelligent data mining
22
2.3 Associate rule mining
25
v
vi
Contents
3.
2.4 Association rule mining: multiobjective optimization method
31
2.5 Intelligent methods for associate rule mining
31
2.6 Associate rule mining using genetic algorithm
33
2.7 Association rule mining using particle swarm optimization
34
2.8 Bees swarm optimizationassociation rule mining algorithm
35
2.9 Ant colony optimization algorithm
35
2.10 Penguins search optimization algorithm for association rules mining Pe-ARM
36
2.11 Deep learning in data mining
37
References
39
Unsupervised learning methods for data clustering
41
SATISH CHANDER AND P. VIJAYA
4.
3.1 Data clustering
41
3.2 Mode seeking and mixture-resolving algorithms
55
3.3 Conclusion
64
Heuristic methods for data clustering
65
RAJASEKHAR BUTTA, M. KAMARAJU AND V. SUMALATHA
5.
4.1 What is the heuristic method?
65
4.2 Summary
86
Deep learning methods for data classification
87
ARUL V. H.
5.1 Data classification
87
5.2 Data mining
88
5.3 Background and evolution of deep learning
89
5.4 Deep learning methods
90
References
108
Contents vii
6.
Neural networks for data classification
109
BALACHANDRA KUMARASWAMY
7.
6.1 Neural networks
109
6.2 Different types of neural networks
113
6.3 Training of neural network
127
6.4 Training algorithms in neural network for data classification
128
References
131
Application of artificial intelligence in the perspective of data mining
133
D. MENAGA AND S. SARAVANAN
8.
7.1 Artificial intelligence
133
7.2 Artificial intelligence versus data mining
138
7.3 Modeling theory based on artificial intelligence and data mining
139
References
154
Biomedical data mining for improved clinical diagnosis
155
G. NALINIPRIYA, M. GEETHA, R. CRISTIN AND BALAJEE MARAM
8.1 Introduction
155
8.2 Descriptions and features of data mining
155
8.3 Revolution of data mining
156
8.4 Data mining for healthcare
156
8.5 Data mining for biological application
160
8.6 Data mining for disease diagnosis
167
8.7 Data mining of drug discovery
171
References
176
viii
Contents
9.
Satellite data: big data extraction and analysis
177
RAHUL KOTAWADEKAR
9.1 Remote-sensing data: properties and analysis
177
9.2 Summary
197
References
197
10. Advancement of data mining methods for improvement of agricultural methods and productivity
199
ANUSH PRABHAKARAN, CHITHRA LEKSHMI K. S. AND GANESH JANARTHANAN
10.1 Agriculture data: properties and analysis
199
10.2 Disease prediction using data mining
214
10.3 Pests monitoring using data mining
217
10.4 Summary
220
References
220
11. Advanced data mining for defense and security applications
223
PRAMOD PANDURANG JADHAV
Index
11.1 Military data: properties and analysis
223
11.2 Applying data mining for military application
229
References
241
243
List of contributors D. Binu
Resbee Info Technologies, India
Rajasekhar Butta
ECE Department, Gudlavalleru Engineering College,
Gudlalvalleru, India
Satish Chander
Department of Computer Science & Engineering, Birla Institute of Technology, Mesra, Ranchi, India
R. Cristin Department of CSE, GMR Institute of Technology, Rajam, Andhra Pradesh, India M. Geetha SRM Institute of Science and Technology, Vadapalani Campus, Chennai, India Pramod Pandurang Jadhav
Department of Computer Engineering, G H Raisoni Institute of Engineering and Technology (GHRIET), Pune, India
Ganesh Janarthanan
L&T Technology Services Ltd, Bangalore, India
N. Jayalakshmi Raghu Institute of Technology, Visakhapatnam, India M. Kamaraju
ECE Department, Gudlavalleru Engineering College, Gudlalvalleru, India
Rahul Kotawadekar
Master of Computer Applications (MCA), Finolex Academy of Management and Technology, University of Mumbai, Ratnagiri, India
Balachandra Kumaraswamy
Department of Electronics and Telecommunication, B.M.S College of Engineering, Bengaluru, India
Chithra Lekshmi K. S.
Department of Visual Communication (Electronic Media), PSG College of Arts & Science, Coimbatore, India
Sridevi Mantha
Big Data Analytics, Quadratyx, Hyderabad, Telangana, India
Balajee Maram
Department of CSE, GMR Institute of Technology, Rajam, Andhra Pradesh, India ix
x
List of contributors
D. Menaga
KCG College of Technology, Chennai, India
G. Nalinipriya
Department of Information Technology, Saveetha Engineering College, Chennai, India
Anush Prabhakaran
Department of Mechatronics Engineering, Kumaraguru College of Technology, Saravanapatti, Coimbatore, India
B. R. Rajakumar
Resbee Info Technologies, India
S. Saravanan HCL Technologies Limited, Chennai, India V. Sumalatha Arul V. H.
ECE Department, Jntuacea, Anatapuramu, India
Department of ECE, Thejus Engineering College, Thrissur, India
P. Vijaya Department of Computer Science & Mathematics, Modern College of Business and Science, Muscat, Oman
Preface The artificial intelligence (AI) has attained a level of maturity in which several methods are proved as victorious. The ability of research is shown in different research projects ranging from decision making to rivalry of the cognitive process of human expertise. Other triumphant AI models illustrated the design of descriptive reasoning theories and usage of formal language is done to symbolize pattern discovery and relations among the data. The automation of tools in societies has considerably improved the potential for producing and accumulating data from different sources. The increasing quantity of data has flooded all factors of the lives. The growth in stored data has produced an urgent requirement for novel methods and automatic tools that can intelligently help to transform huge data into useful information and knowledge. This escorts generation of promising and budding frontier in information technologies called data mining. The data mining poses a huge capability to enhance business outcomes. The significance of AI in data mining is well known and termed as oil of the cyber world. The book is modeled to cover key factors of the subject of AI in data mining. This book is splitted into small chapters so that the topics can be arranged and understood properly and the topics within chapters are organized in proper sequence for ensuring smooth subject flow. The book utilizes understandable language for explaining the fundamentals of the subject. The book offers a logical method of explaining several complex concepts and stepwise techniques for elaborating the imperative topics. Each chapter is well modeled with essential illustrations, practical instances, and solved problems. All chapters contained in the book are organized in proper sequence which allows each topic to build upon earlier studies. All care is taken for making learners comfortable in understanding the basic concepts of the subject. The book not only covers the complete scope of subject but also illustrates the philosophy of the subject, which makes the understanding of the subject clearer and makes it more interesting. This book will provide learners adequate information to attain mastery over the data mining and its applications. It covers data mining, biomedical data mining, data clustering, and heuristic methods for clustering data, deep learning methods, neural networks for data classification, and application of data mining in defense and security applications without compromising the subject details. The motive of the book is the illustration of concepts with practical instances so that the learners can grab contents in an easier manner. Another imperative feature of the book is the elaboration of data mining algorithms with examples. Moreover, this book contains several educational features like chapter-wise abstract, summary, practical examples, and relevant references to offer sound knowledge to the beginners. It also offers students a tenet to attain knowledge on technology. We hope that this book will
xi
xii
Preface
motivate individuals of different backgrounds and experience to interchange their ideas concerning data mining so as to contribute toward further endorsement and shaping of this exhilarating and dynamic field. I wish to convey my heartfelt thanks to all those who supported to make this book a reality. Any suggestions for upgrading the book will be acknowledged and well appreciated. D. Binu and B. R. Rajakumar
1 Introduction D. Binu, B. R. Rajakumar RESBEE INFO TECHNOLOGIES, INDIA
1.1 Data mining The data mining is a trendy research domain that has fascinated the interest of many industries in day-to-day lives. Due to massive-sized data, there is an impending need to tune such data into useful data and information. The knowledge acquired from the applications involves production control, science exploration, engineering design, business management, and market analysis. Data mining is considered as the result of increasing datasets as well as the evolution of information technologies. The evolutionary paths are observed from database industries in the design of subsequent techniques, which include dataset formation, data collection, and supervision of database for data storage and retrieval to attain effective data analysis for better understanding. Ever since 1960, the information technologies and databases are evolved systematically from ancient processing models to complicated and dominant database models. The investigation and design of database models from 1970 have escorted the design of the relational databases, data organization methods, indexing, and data modeling tools. Moreover, the users acquired expedient data access with user interfaces, through query processing, and query languages. Simply stated, data mining is a technique that is employed for extracting the knowledge from massive datasets. The existing evolution of data mining products and functions formed as a result of influence considering different disciplines like information retrieval, databases, machine learning, and statistics. Other areas of computer science acquired a major issue on the Knowledge Discovery in Databases (KDDs) process related to multimedia and graphics. The KDD is referred to as the overall process of discovering useful knowledge from data. The purpose of KDD is to illustrate the outcomes of the KDD process in a significant manner as many results are generated which could form a nontrivial issue. Visualization methods contain graphics presentations and sophisticated multimedia wherein the data mining strategies can be applied for multimedia applications. In contrast to earlier researches in these data mining, a major inclination with the database community is to integrate the results from different disciplines to form a unified data or algorithmic method. The goal of the method is to devise a big picture of the areas that enable the incorporation of different types of applications into the user domains or real-world scenarios. Artificial Intelligence in Data Mining. DOI: https://doi.org/10.1016/B978-0-12-820601-0.00005-7 © 2021 Elsevier Inc. All rights reserved.
1
2
Artificial Intelligence in Data Mining
Data mining is considered as a multidisciplinary domain that maintains knowledgeable workers, who tried to mine the data-rich information from huge datasets. The data mining concept is rooted with the idea of extracting knowledge from massive data. The tools help to discover pertinent information by adapting several data analysis methods. Thus any method employed for extracting the patterns from the huge-sized data source is considered as a data mining method.
1.2 Description of data mining Data mining is considered as a part of computer vision, which refers to the process that tries to determine the patterns from huge-size datasets. Data mining utilizes several methods such as statistics, artificial intelligence, database systems, and machine learning methods. The aim of data mining is to mine essential data from the dataset and convert it into a comprehensible arrangement for later use. Moreover, the raw analysis stage assumed certain factors for database management, which involve data processing, interest metrics, inference considerations, computational complexity, visualization, and online updates for establishing effective mining of data. Data mining plays an essential role in the process of discovering knowledge, which can be instantiated by analyzing huge datasets and acquiring useful knowledge from data. Data mining is employed effectively on the business environment, medicine, insurance, weather forecast, transportation, healthcare, and government sectors. These data mining applications pose huge benefits while using specific industries.
1.2.1 Different databases adapted for data mining The data mining can be carried out using the following sets of data which are listed as follows: • • • • • • • • •
relational databases advanced databases and data repositories transactional and spatial databases object-oriented and object-relational databases data warehouses diverse databases text databases multimedia database text mining and web mining
1.2.2 Different steps in design process for mining data Fig. 11 depicts the process of mining data.
Chapter 1 • Introduction
3
FIGURE 1–1 Process of mining data.
• Understanding business This phase establishes the goals of data mining, which are listed as follows: First, an understanding of client objectives is important. The desires of the clients must be carefully examined. Consider the stockpile of the present data mining cases, which must consider certain factors like constraints, assumptions, resources, and other factors in the evaluation. The purpose of mining imperative data is clearly defined using the objectives of business and analysis of current scenarios. The best plan of data mining is elaborated and must be designed for accomplishing both data mining goals and improved business. • Understanding data This phase deals with the checking of data to determine if the data is feasible to attain the goals of data mining. First, the data are accumulated from different sources of data accessible through business. The sources of data involve different datasets, such as data cubes or flat files. There exist certain limitations, such as schema integration and object matching, which could rise during the data integration process. The method is quite complicated and tricky due to the accumulation of different sources that are improbable to match. Thus it is complex to facilitate the value of given objects are the same or not. Here, the metadata must be utilized for minimizing the errors in the process of data integration. Then, the step for searching the properties of accumulated data and the improved way for exploring the data is to answer the
4
Artificial Intelligence in Data Mining
questions of data mining using reporting, visualization tools, and queries. With the outcomes of queries the quality of data can be obtained. The missing data should be filled with dummy values. • Preparation of data This phase deals to make the data readily available for extracting the essential knowledge. In the following phase the data is processed for making it prepared for the production. Here, the data from various sources are selected, cleaned, transformed anonymized, formatted, and constructed for attaining data mining. • Data cleaning The cleaning of data is a procedure for cleaning the data by removing the noisy data and fills the values of missing. For instance, in the customer outline, if the age is not filled, then the data is said to be unfinished which must be filled. Considering some scenarios, the data can be outliers as age cannot be 300. Thus data should be consistent. • Transformation of data The operations in transforming data contribute to the success of mining process. Moreover, the function of transforming data is performed to alter data for making it useful in mining data. Some of the processes employed in the data mining process are listed as follows: • Smoothing The smoothing method helps to eliminate noise throughout data. • Aggregation The operations of aggregation are adapted in the data for establishing a precise summary. • Generalization In generalization, low-level data is replaced with sophisticated concepts. • Normalization In normalization the data attributes are scaled to normalize it in a certain range. For instance, the data can fall in the range 0 to 1 in normalization. • Attribute design The attributes are designed and considered with the given attributes for assisting data mining. The transformed data can be utilized as the final dataset for performing modeling. • Modeling The modeling phase utilizes mathematical models for determining the patterns of data.
Chapter 1 • Introduction
5
Considering these business objectives, the appropriate modeling methods can be chosen for the prepared dataset. Construct the scenario for testing the quality and model validity. Execute the model using the equipped dataset. Results must be evaluated with the stakeholder for making sure that the model could satisfy all objectives of mining useful data. • Evaluation In this stage, the acknowledged patterns are computed with the goals of the business. The results produced by the data mining framework can be computed using the set of business objectives. Acquiring business understanding is a repeated process. While consolidating, novel business needs can be raised due to data mining. The final decision can be considered for moving the model into the deployment phase. • Deployment In this stage, the discoveries of data mining can be used for dealing with different business operations. The information or knowledge discovered from the process of data mining can be easily understood by nontechnical stakeholders. A comprehensive deployment plan can be utilized for monitoring data and mining the crucial data. The final report is used with the lessons learned and can be used for enhancing the business policies of organizations.
1.3 Tools in data mining The two data mining tools that are employed broadly in the industry are listed as follows: 1. R-language R-language is a type of free tool for dealing with graphics and statistical computing methods. R poses an assortment of classical statistical tests, graphical methods, and time-series analysis. Moreover, this tool provides effective handling of data with high storage facility. 2. Oracle data mining (ODM) ODM utilizes a component of Oracle Advanced Analytics Database. This tool permits analysts to produce detailed insights and makes the prediction more accurate. Moreover, this tool helps to predict the behavior of the customer and design the customer profiles and identifies cross-selling.
1.4 Data mining terminologies A general data mining model consists of the following components:
6
Artificial Intelligence in Data Mining
1. Data warehouse, database, or other information repositories This module consists of a data store, databases, worksheet, or erstwhile types of information repositories. The data integration and the data cleaning mechanisms are carried out on the data. 2. Data warehouse server The server of data warehouse or database is liable for obtaining pertinent data using the request of data mining. 3. Knowledge base The domain knowledge is utilized for guiding the search or evaluating the interest of resultant patterns. This knowledge involves hierarchies of concepts that are utilized for organizing the attribute values into abstraction levels. Knowledge, like user beliefs, is utilized for assessing the patterns of interestingness based on the unexpectedness. Other instances of domain knowledge include thresholds, interestingness constraints, or metadata. 4. Data mining engine This is important in the data mining model and comprises a set of well-designed modules for processing tasks like association analysis, deviation, characterization, evolution analysis, and classification. 5. Module for pattern evaluation This module adapts interestingness metrics and interrelates with the modules of data mining to spotlight on extracting useful patterns. This module access the thresholds accumulated in the knowledge base. On the other hand, the assessment of patterns may be combined using the mining unit based on the execution of data mining models. For proficient data mining, it is suggested to compute the interestingness of pattern into the mining process for confining the search into interesting patterns. 6. Graphical user interface (GUI) This GUI model provides an interface between data mining models and users for permitting the user to cooperate with the system by computing data mining queries by offering information to concentrate on investigation and perform tentative data mining using results of intermediate data. Moreover, GUI permitted users to surf the dataset and schemas of data centers by evaluating structures of data and mined patterns for visualizing the patterns into various forms.
1.5 Merits of data mining The data mining is benefitted in several areas, in which some of them are listed as follows: 1. Marketing or retail industries for making campaigns Data mining helps the marketing industries in building models on the basis of historical data for predicting the response to make novel marketing promotions such as the campaign on online marketing and direct mail and so on. Throughout these results, the marketers hold a suitable method for selling cost-effective products to the targeted customers.
Chapter 1 • Introduction
7
Data mining holds many benefits in the case of retail companies through marketing. With market basket analysis a store could pose a suitable production arrangement such that the customers buy the products frequently with a pleasant mind. Moreover, the method helps retail companies to provide some discounts to a specific product that acquires the interest of many customers. 2. Finance or banking for determining fraudulent transactions Data mining provides considerable attention in the financial institutions for acquiring the data about the loan. By designing a replica from the data of customers the bank can find better loans. Moreover, data mining assists the banks to determine the deceptive transactions for protecting the owners of credit card. 3. Manufacturing By implementing data mining the manufacturers can determine the faulty tools and find the most favorable control parameters. In addition, data mining is applied for determining the control parameters that could direct to high production. Then, these parameters were used by manufacturers for qualitative data mining. 4. Governments The data mining helped government agencies by evaluating records of financial transactions by building the pattern, which poses the ability to determine the criminal or money offenses.
1.6 Disadvantages of data mining Some of the obstacles faced by the data mining methods are elaborated as follows: 1. Human interaction As data mining issues are not accurately stated, the interfaces are needed with both domain experts and technical person. The technical experts are utilized for formulating queries and interpreting the results. The users are required for identifying the training data to produce the desired results. 2. Overfitting When the model is produced with a given database, then it is enviable that model is fit for executing future states. Overfitting issue occurs when the model is unfit with the future states. This may be caused by the supposition that is made with the data or caused by the small-sized training datasets. Overfitting can occur with other situations as well, even though the data are not distorted. 3. Outliers There exist numerous data entries that do not fit into the derived model. This became an issue considering huge databases. If the model is designed that includes these outliers, then the model may not perform well with data that are not outliers. 4. Massive datasets The huge-size data are linked with data mining that creates issues when applying techniques designed for small datasets. Numerous modeling applications are devised on
8
Artificial Intelligence in Data Mining
the literature which is inefficient for huge datasets. Parallelization and sampling are tools to attack the scalability issue. 5. High dimensionality The classical database models consist of various attributes. The issue here is that these attributes are needed for solving the issue of data mining. The usage of specific attributes may with the correct completion of the data mining task. The use of other attributes may increase the complexity and minimize the algorithm efficiency. This issue is known as the dimensionality curse wherein many attributes are involved that are complex to determine. One resolution is to reduce the count of attributes, which is termed as reducing the number of attributes. However, the determination of important attributes is a complex task. 6. Security issues Security is a major issue while dealing with massive datasets. Here, the business possesses information about the customers. However, the maintenance of information is a major drawback in which the hackers can access and stole essential data of customers which can become a major theft in data mining.
1.7 Process of data mining The heart of the KDD process is data mining techniques for refining patterns from the massive datasets. These techniques pose different performance goals on the basis of the intended outcome of the complete KDD process. It can be observed that numerous techniques with different aims can be utilized to attain the required result. The majority of goals in data mining domain fall in these steps: • Processing of data Based on the desires of KDD process, the analyst can aggregate, filter, clean, sample, and alter data for analysis. Mechanizing numerous tasks of data processing and combining them impeccably into the complete process may remove or minimize the program focused routines for data import/export to enhance the productivity of analysts. • Prediction For a data item or a predictive scheme, one can forecast the particular attribute value or a data item. For instance, a predictive scheme for the transactions done using the credit card can be utilized to predict the likelihood of a fraudulent transaction. The prediction can be utilized for validating the detected hypothesis. • Regression For group of data items the regression represents the evaluation of dependence with a number of attribute that values other items considering same item and a habitual invention of a model, which could foresee the values of attributes considering new record. Regression analysis can be utilized for modeling the relation between different dependent and independent variables. For sovereign variables the attributes are termed as response variables that are utilized to make a prediction. Various issues of the realworld are considered for enhancing the process of data mining.
Chapter 1 • Introduction
9
For example, the sales volumes, prices of stocks, and rates of product failures are complex to forecast as they are based on complicated interfaces of different variables or predictors. Thus additional methods such as decision trees, logistic regression, and neural networks (NNs) are essential to forecast the values of the future. Similar models are utilized for both classification and regression. For instance, the Classification and Regression Trees is the algorithm of a decision tree which is utilized for building the regression trees to forecast continuous response variables and classification trees for classifying categorical response variables. NNs can be constructed as a regression or classification models. Different types of regression techniques utilized for data mining are listed as follows: • nonlinear regression • multivariate nonlinear regression • linear regression • multivariate linear regression • Classification With a set of definite categorical classes the determination of class for a specific data item is a major requirement. Classification is a widely utilized data mining method that adapts a group of determined class to design a model, which categorizes data with respect to its class. Credit risk applications and fraud detection are broadly suited for these types of analysis. The method adapts NN-based classification algorithms and decision tree for classifying the huge data. The data classification process consists of classification and learning. In learning, the training data are evaluated by the classification method. In classification, the test data are utilized for eliminating the precision of classification rules. If the correctness is satisfactory, then the rules are adapted with the new data. For fraud detection the data could involve whole records of valid activities, and deceitful cases discovered by the technique are eliminated. The classifier-based training algorithms utilize preclassified instances for determining the parameters set needed for correct discrimination. The algorithms encode these attributes with a model named as a classifier. Different types of classification techniques: • decision tree models • NNs • classification based on Bayesian rules • classification based on associations • support vector machines (SVM) • Clustering Considering a group of data items, the first step is partitioning of data into different classes like items with the same properties are grouped together. Clustering is a technique, which is utilized for determining the groups of item that are related. For instance, for specified dataset, the identification of subgroups that have the same buying behavior is a major task.
10
Artificial Intelligence in Data Mining
Clustering is stated as the identification of the same class of objects. By utilizing clustering methods, one can further detect the sparse and dense regions from the data and can determine the whole distribution pattern amongst the attributes of data. The classification method can be utilized as an effective method for distinguishing the classes. Types of clustering methods: • partitioning techniques • density-based methods • hierarchical agglomerative techniques • model-based techniques • grid-based techniques • constraint-based techniques • Link analysis With a group of data items the identification of relations amongst the items and the attributes is important. These relations can be an association between the attributes with the same data item or association with different data items. The analysis of relations between items over a specific time is known as sequential analyses of patterns. • Model visualization The visualizations are considered an imperative part of building the extracted knowledge comprehensible and easily interpretable by the humans. The model remained as a topmost device. Visualization methods can be trouble-free histogram plots and scatter plots with parallel coordinates. • Exploratory data analysis (EDA) This analysis is an interactive examination of the dataset with heavy reliance on defined models. Thus EDA attempts to recognize the interesting patterns. The graphical representation of data is utilized for exploiting the human intuition. There are numerous software packets that are available to maintain data exploration which is desirable for integrating these methods into the KDD environment.
1.8 Data mining techniques Data mining is any method that helps to acquire information from the data. Numerous techniques helped to attain information of data. Various techniques are employed for data mining wherein each method serves different principles. Each method offers its own merits and demerits. Meanwhile, the methods employed for data mining are categorized into the following categories: • Statistical techniques The statistical methods concentrated mostly on testing predetermined supposition and fitting models of the data. The statistical method relies on a fundamental probability model. Moreover, it is considered that these techniques can be utilized by the statisticians for mining data from massive datasets. Thus human intervention is needed for generating candidate models and hypotheses.
Chapter 1 • Introduction
11
• Case-based reasoning (CBR) CBR is a strategy, which attempts to address issues by using previous solutions and experiences. The case is a definite issue, which is addressed. For a specific issue, CBR analyzes the stored cases and determines similar ones. If the same cases exist, then their solution is adapted for solving the issues for future reference. • Association rule Association rule [1] is employed for determining the frequent item set amongst massive size datasets. These types of findings are used for developing the businesses to make distinct decisions such as cross-marketing, catalog design, and behavior analysis of shopping customers. The algorithms based on association rule need to produce rules with confidence. On the other hand, the count of probable association rules considering a provided dataset is usually large and poses a high amount of rules which are typical of less value. Kinds of association rule: • quantitative association rule • multilevel association rule • multidimensional association rule • NNs NNs [2] are the set of systems that are formed with a huge number of simulated neurons connected by synapses. The NN is constructed by a huge number of simulated neurons, interconnected with each other. The strength of neurons interconnections may alter with respect to the accessible stimulus or output that enable the network to discover more knowledge. The NN contains three types of layers: hidden, output, and input. The input layer represents the features of the input obtained by analyzing the data. Then, the normalization of the obtained features is carried out for improving the precision of the network. The hidden layers contain the neurons which are employed for extracting the features by performing an internal operation over the network. At last, the output layer generates the output and each layer is connected to each other to construct a network. In addition, the NN is a group of connected output units or input units wherein each connection poses weight presented with it. In the learning phase the network learns by tuning the weights for predicting the precise labels of the class using input tuples. NN poses an incredible capability for deriving the connotation from complex or vague data, which can be utilized to mine patterns and determine trends that are too complicated to compute using computer methods. These methods are compatible with incessant valued inputsoutputs. Type of NN: • Back propagation NN 1. Decision trees The decision tree represents a tree wherein each nonterminal node denotes a test or decision using the employed data item. Based on the test result, one can select a definite branch. For classifying specific data items, one can follow the root node with the allegation to reach the terminal node. When the terminal node is reached, then the
12
2.
3.
4.
5.
6.
Artificial Intelligence in Data Mining
decisions are made. The decision trees are interpreted with a special rule set considered by the hierarchical rule organization. Rule induction The rule states an arithmetic correlation with the presence of certain parameters in the data item or between certain items using a dataset. Bayesian belief networks (BBNs) BBN [3] is the graph illustration of distributed probabilities obtained through cooccurrence counts using a group of data items. Particularly, BBN is concentrated with an acyclic graph wherein the nodes specify variables of an attribute, and edges denote probabilistic dependencies. The conditional probability distributions of each node portray the relation between parents and their nodes. Evolutionary programming The evolutionary programming is the optimization methods, which are motivated by behaviors observed with the natural evolution. The best solutions are chosen and integrated with each other to increase the overall goodness of the solution set. The evolutionary programmings are utilized in data mining for formulating the hypothesis about the reliance between the variables considering internal formalism or another association rules. Fuzzy sets Fuzzy sets are the key methods for signifying the uncertainty. The uncertainty rises in many ways such as inconsistency, nonspecificity, vagueness, and imprecision. The fuzzy sets utilize uncertainty in methods for making system complexities more convenient. Fuzzy sets comprise a powerful method for dealing with imperfect or vague data but may be useful for designing uncertain models of data which provides smooth performance than conventional methods. As fuzzy systems can bear uncertainty and even use language like vagueness for smoothing the data lags, which may provide noise-tolerant models that are robust and obtain accurately. Rough sets A rough set is described by the upper and lower bound of the set. Each element of the lower bound poses a certain member set. The upper bound of the rough set is the union of lower bound and boundary region. The rough sets can be viewed as a fuzzy set containing three-valued membership function. Similar to fuzzy sets, the rough sets are a mathematical concept that deals with uncertainty in data. Also, the rough sets are rarely used as an impartial solution which is integrated with other techniques such as clustering, classification, and rule induction methods.
1.9 Data mining applications Data mining is a novel technology that is not fully matured, still there are numerous industries that are using these technologies on a regular base. Some of the organizations comprise
Chapter 1 • Introduction
13
hospitals, insurance companies, retail stores, and hospitals. Most of these organizations combined data mining with these things as statistics and pattern recognition are other significant tools. Data mining can be utilized for finding the patterns and connection that are complicated to determine. This method was popular with many businesses as it allows them to learn about customers to make effective decisions by marketing. • Future healthcare Data mining poses more potential to enhance the health system. Moreover, it utilizes analytics for identifying the best practices that the data mining holds a major position in enhancing the health systems and reduce costs. Moreover, the researchers used data mining techniques such as machine learning, statistics, data visualization, multidimensional databases, and soft computing. Mining can be utilized for predicting the disease of patients. The processes are designed to build the patients care. Thus data mining assists the detection of diseases using advanced healthcare applications. • Market basket analysis Market basket analysis is a method that is devised on the basis of theory that if one buys a certain item, then it is more probable to buy other items. This method allows the seller to comprehend the purchase history of the buyer. This information can assist the retailer to recognize the needs of the buyer and alter the layouts of stores consequently. • Education There is an upcoming field named educational data mining (EDM), which helps to devise techniques that determine knowledge from the data using educational applications. The plan of EDM can be recognized by forecasting the learning behavior of students by advancing scientific knowledge and analyzing the effects of educational support. Data mining can be utilized as an institution, which is used for making precise decisions and also for predicting the students result. Based on the outcomes, the organization focuses on what to train and how to train. The learning patterns of students are acquired for developing intelligent techniques. • Treatment effectiveness The data mining applications are designed for evaluating the effectiveness of medical treatments. Data mining can help to predict the treatment effectiveness by comparing the causes, symptoms, and course of treatments. • Pharmaceutical industry The tool is utilized for helping the pharmaceutical industries by managing their pharmaceutical firms using novel services and products. The deeper understanding of knowledge hidden in the pharmaceutical data is adapted for making effective decisions. • Hospital management The contemporary hospitals are able to generate and collect a huge number of data. The mined data are accumulated in the hospital information system wherein chronological behavior of global hospital activities needs to be visualized.
14
Artificial Intelligence in Data Mining
The three layers employed for managing the hospital are: • • • •
•
•
•
•
services for medical staff services for hospital management services for patients Industrialized engineering Knowledge acquaintance is a vital skill that industrializes the data mining process. The tools of data mining are essential to determine patterns in complicated process of manufacturing. Data mining is adapted in the system-level design for extracting the relation between the customers’ needs data, product portfolio, and product architecture. The method can be utilized for predicting the cost, development time, and dependencies on other tasks. Customer relationship management (CRM) CRM is defined as the acquisition and retaining of customers by enhancing the customers’ loyalty by adapting strategies focused on the customer. For maintaining proper relations with customers, businesses have to accumulate data and evaluate the information. Due to the data mining methods, the accumulated data could be utilized for focusing on the technologies to retain customer. Fraud detection Conventional techniques of fraud detection are lengthy and a complicated process. Data mining helps to provide significant patterns and transform the data into information. The information that is legitimate and constructive is termed as knowledge. The faultless fraud detection model must preserve the information of each user. The supervised technique involves the accumulation of data samples. These data samples are categorized into fraudulent and nonfraudulent models. The model is constructed using the data and can be employed for detecting if the data is deceitful or not. Intrusion detection Whichever act that negotiates the confidentiality and integrity of the resources is termed as an intrusion. The suspicious terms for avoiding an intrusion involve information protection, programming errors, and user authentication. The data mining can be utilized for enhancing the intrusion detection by incorporating the focus on anomaly detection. The method can be utilized for distinguishing an activity from day-today activities of the network. Moreover, data mining can help to mine the data which are more related to the issue. Deception detection Capturing the illegitimate data is an easy task, whereas obtaining the reality from the criminal is a complicated task. The law enforcement utilizes different mining methods for investigating the crimes by monitoring the alleged terrorists. The data mining process can help to determine the significant patterns from data which is in an unstructured format. The samples of data accumulated from earlier investigations are compared and modeled for lie detection. Using these models, the processes can be tested based on the inevitability.
Chapter 1 • Introduction
15
• Customer segmentation Conventional market research assists to divide the customers, but the mining of useful data maximizes the effectiveness of the market. The data mining helps the customers into distinct segments, which manages the needs based on the customer. The market is all about holding the customers. The data mining helps to determine customers segment on the basis of business and vulnerability which might provide extraordinary offers and superior contentment. • Financial banking Due to computerized banking, massive data are generated with new transactions. The data mining donates to solve the issues of business in finance and banking by determining new patterns in business information. The managers may perhaps determine this information for improved targeting, retaining, segmenting, acquiring, and maintaining profitable customer. • Corporate surveillance Corporate surveillance helps to monitor the group’s or person’s behavior by adapting a corporation. The data accumulated is utilized for marketing and can be utilized for business to acclimatize the products using customers. The data should be utilized for marketing such as beleaguered advertisements, wherein the users analyze the search history using search engines. • Research analysis The data mining is useful in data cleaning, preprocessing of data, and integration of multiple datasets. The researchers can determine similar data using the dataset which could bring the transform on the research. Visual data mining and data visualization provide a clear picture of the data. • Criminal investigation Criminology is defined as the process, which aims to recognize the characteristics of crime. The analysis of crime is carried out by detecting and exploring the crimes and their relations with the criminals. A large amount of crime datasets and the complication of relations between different types of data have made the criminology a suitable domain for utilizing data mining methods.
1.10 Intelligent techniques of data mining Intelligent data mining contributes intelligent search for discovering the information within the data warehouses which reports and queries to determine patterns from the data and infer rules. The patterns and rules determined from the data can help to guide the forecasting and decision-making. Major tools utilized in intelligent data mining are intelligent agents, neural computing, CBR, and other tools such as data visualization, rule induction, and decision trees. Designing intelligent models, which are able to extract high-level representations using huge data, lies in addressing several artificial intelligence-related tasks, such as pattern recognition, language understanding, a visual object, and speech perception. Numerous
16
Artificial Intelligence in Data Mining
learning methods employ shallow architectures such as NNs, kernel logistic regression, and SVM for mining essential data from huge datasets. The integration of decision support systems and artificial intelligence offers computational support to humans that can extend their capabilities in multifaceted stressful environments. Decision support system allowed intelligence and domain expertise to utilize comprehensive analysis of tools for inducing intelligence in the systems. Intelligent methods are utilized for evaluating data and offering forecasts via enumerating uncertainty by offering information and signifying the course of action. This method provides background information for making decisions and identifies the key methods that contribute to the intelligent decision-making process. Techniques employed in these systems involve fuzzy logic, data mining, artificial NNs (ANNs), CBR, evolutionary computing, machine learning, and intelligent agents. On the other hand, the intelligent decision-making techniques focus on designing hybrid methods, which could be adapted for making effective decisions. 1. Deep learning methods Deep learning is one of the significant machine learning methods that has attained great success in several applications such as text understanding, speech recognition, and image analysis. The methods utilize unsupervised and supervised methods for learning multilevel representations for pattern recognition and classification. Deep learning is the outstanding machine learning method, which provided great success in several applications like text understanding, speech recognition, and image analysis. Deep learning plays an important role in solving the issues of big data as it can mine precious knowledge from multifaceted systems. Deep learning is considered as vigorous research in terms of machine learning. However, the conventional training methods employed multilayer NNs, which result in the best solution by guaranteeing the convergence. Thus the multilayer NNs attained improved performance for representation learning. • Types of deep learning models Numerous deep learning models are developed for mining significant data from huge datasets. The mostly used deep learning models include stacked autoencoder (SAE), deep belief network (DBN), deep convolutional NN (DCNN). Recurrent NN (RNN), optimization methods, expert system, ANN, fuzzy system, evolutionary computing, rough set theory, and intelligent agent are detailed in this section. • SAE The SAE model is designed by stacking numerous auto-encoders, which are mostly feed-forward NNs. Autoencoder plays an important part in NN. The autoencoder takes the relevant features of the input using principal component analysis (PCA). The singlelayer autoencoder does not contain directed loops. This autoencoder comprises X input visible units, Y hidden units, and Z output visible units. The autoencoder is progressed by encoding the input vector into an advanced phase hidden version. The basic autoencoder poses two stages, namely the decoding and encoding stage for processing data.
Chapter 1 • Introduction
17
• DBN The foremost deep learning framework which is effectively trained is the DBN. The DBN is a part of NN and contains different layers of multilayer perceptrons (MLPs) and restricted Boltzmann machines (RBMs). The DBN is stacked by different RBMs, which contained two layers—hidden and visible. RBMs consist of visible and hidden units, which are linked on the basis of weighted connections. The MLPs are termed as the feedforward networks which contained hidden, input, and output layers. The network with different layers poses the capability to address multiple complicated tasks formulates the classification as more effectual. • DCNN DCNN is mostly used deep learning technique for dealing with massive-sized data. The DCNN comprises three layers—pooling (POOL), fully connected (FC), and convolutional (conv). The basic model of the DCNN is discussed in this section. The DCNN contains a number of conv layers, POOL layers, and FC layers, each performing a specific task. The main aim of the conv layers is to establish the feature maps using the feature vector obtained from input data, and the developed feature maps are subsampled using POOL layers, which is the second layer in DCNN. The main role of the third layer, the FC layer, leads the classification process. The output maps from the conv layers are the result of convoluting the input maps with the conv kernels such that the generated output maps are equal to the kernel number and the size of the kernel matrix. Thus the conv layers are organized as a multilayer loop of input maps, kernel weights, and output maps. The number of the input and output data reduces with the successive conv layers in comparison with the first conv layer, and it is noted that the accurateness of classification depends on the number of the layers in DCNN. • RNN The RNN [4] in data mining is the conventional deep learning models such as SAE, DBN, and DCNN. RNN is a type of supervised machine learning model that is composed of artificial neurons with one recurrent loop known as a feedback loop. The feedback loops refer to the recurrent cycles that run over time in a sequential manner. The training of RNN is done in a supervised fashion, which requires a training dataset that consists of input-target pairs. The goal of the RNN is to minimize the difference between the output and the target groups by adapting the weight optimization procedure in the network. RNN is capable to process the sequential inputs by adapting a recurrent hidden state, whose activation is based on the previous step. In this way, the network exhibits dynamic temporal behavior. Moreover, the recurrent layer permits feeding the information of previous time steps and combines the resultant output with the input of the current time steps, which means the order of input information plays a vital role in processing. The algorithm permits continuous improvement of required output and, thus, makes the network to be more intelligent with each new update. In addition, RNN is applicable in several fields in which the data is represented in sequence.
18
Artificial Intelligence in Data Mining
• Optimization methods Optimization methods determine the optimal solution with the most cost-effective and feasible performance under the given constriction, by maximizing desired ones. Optimization is restricted by the deficiency of complete information and lack of time to compute the information by adapting the optimum solution. Optimization methods are applied in many research for defining the optimum solution. Optimization can be performed for generating global and local optima. Optimization contributes to data mining in many ways: Optimization can be a kind of huge data mining process. Novel data mining methods can be designed using an optimization-based Method, which is also known as optimization-based approach for data mining. • Expert system An expert system is capable to collect, broadcast, and evaluate the data for performing detailed analysis. Many expert systems such as PROSPECTOR7 and MYCIN6 are devised in the literature for mining data. However, this method can handle the complication of biomedicine and can exchange information considering different systems in clinical settings. • ANN NN is motivated by the human nervous system. Similar to human being, the ANN [5] is trained for performing complex functions such as classification, pattern recognition, and speaker identification. Learning is the basic of ANN whose features are practiced for intelligent functioning. The victory and malfunction of an ANN are based on the types and rates of learning. The network can generate an output from the set of given inputs. The ANN utilizes intermediate output and connection weights for producing the output that converges to desired output. The benefit of ANN over other deep learning models is that they are capable to solve an issue that does not pose an algorithmic solution. • Fuzzy system Numerous methods are employed for data mining, but the fuzzy system is a significant method in addressing the uncertainty information. Several intelligent methods justified their worth using expert systems and decision support systems. However, the issue exists with linguistic or subjective information, which cannot be managed by the expert system. Here, the qualitative information is transformed into quantitative information with fuzzy logic using set of “IF-THEN” rules. • Evolutionary computing Evolutionary computations are used for solving complicated issues with multimodal functions, and complex nonlinear dynamics, which is beyond the capacity of conventional techniques. The evolutionary computing methods are inspired by the population of genetics using mutation and crossover operators for producing novel solutions. Finally, the fitness is used to choose the best individuals. • Rough set theory (RST) RST is a type of approach that helps to deal with vague and imperfect knowledge in an information system. RST is considered as an uncertain data analysis as it does not
Chapter 1 • Introduction
19
need prior or further information about data. RST is applied in many fields while making proper decisions in many fields. • Intelligent agent Artificial intelligence utilizes control of cognitive science, and programming for capturing human behavior. An agent is an entity that is equipped with actuators and sensors. An agent contains hardware and software which helps to map percept sequences into actions. An agent exhibiting certain intelligence with its actions is termed as an intelligent agent.
1.11 Expectations of data mining The obligation of industrial applications is to speed up learning and develop data mining models. Lots of issues occur due to incessantly changing customer and technical needs of manufacturing industries. For effective usage of data mining methods the data mining software with advanced qualities needs to be designed. This in turn requires accessibility of more robust data mining methods for dealing with quality issues. The advanced data mining software should pose the ability to help users for selecting the most suitable method for interpreting results generated from the applications. This study reveals the effort for improving the tools by offering critical information with preferred data mining functions and results.
References [1] Jiang N, Gruenwald L. Research issues in data stream association rule mining. ACM Sigmod Record 2006;35(1):1419. [2] Singh Y, Chauhan AS. Neural networks in data mining. J Theor Appl Inf Technol 2009;5(1). [3] Heckerman D. Bayesian networks for data mining. Data Min Knowl Discovery 1997;1(1):79119. [4] Salehinejad H, Sankar S, Barfett J, Colak E, Valaee, S., Recent advances in recurrent neural networks; 2017;121. [5] Agatonovic-Kustrin S, Beresford R. Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research. J Pharm Biomed Anal 2000;22(5):71727.
This page intentionally left blank
2 Intelligence methods for data mining task N. Jayalakshmi1, Sridevi Mantha2 1
RAGHU INSTITUT E OF T ECHNOLOGY, VISAKHAPATNAM, INDIA 2 BIG DATA ANALYTICS, QUADRATYX, HYDERABAD, TELANGANA, INDIA
2.1 Introduction The advancement in artificial intelligence (AI) gained significant attention in the revitalization of knowledge base design for dealing with the highly complicated task, which involves the extraction of knowledge, perception of the knowledge structure, learning, and reasoning. The intelligent technologies made it possible for the service providers to automatically expanding the services and incessantly make service delivery effective. In an Information Technology (IT) model, plentiful valuable domain human knowledge is comprised wherein the step-wise resolution descriptions are logged with consequent challenging incidents. Sculpting, utilizing, and gathering of domain knowledge are gradually more crucial as domain knowledge is essential for fully automating the entire service management. Inspired by the rapid innovation of the economic environment, the business enterprises repeatedly compute the economical position in the market and tried to arise with pioneering activities for gaining competitive benefits. Thus creating valuable activities cannot be obtained without solid and incessant delivery of IT services. The huge complication of the IT environment utters the use of cognitive incident management and is one of the most crucial processes in managing IT services. The intelligent methods address the issues and restore the service of provision while relying on human intervention. More recently, online learning management system (LMS) has been significantly utilizing the intelligent methods for better IT-enabled service delivery. One of the recent LMS platforms strengthened with intelligent methods is e-khool LMS [1] which is effectively applying the data mining methods for effective data storage, data and content retrieval, and management of learning services. The succession in data gathering produced inevitable data increment, which needs to be managed. The diversity, complexity, and volume bought the data to a certain height, which obstructs the analysis and knowledge extraction processes. One of the valuable tasks from the applications of social media, biomedicine, and marketing is a big data classification. The model mostly used for solving the challenges of data is the single traditional classification model. The intelligent models employed for data classification includes k-nearest neighbors (KNN), Artificial Intelligence in Data Mining. DOI: https://doi.org/10.1016/B978-0-12-820601-0.00007-0 © 2021 Elsevier Inc. All rights reserved.
21
22
Artificial Intelligence in Data Mining
Naive Bayes (NB), and support vector machine (SVM). The NB classifiers are most used for gaining information. The data fusion machine analytics is considered for classifying huge datasets, whereas NB is utilized for classification and tracking targets in cloud computing and robotic controls. Image and texts are mostly used for analyzing data and for cyber analysis and comprise elastic learning methods and scalable methods. Moreover, the KNN classifier is a timeconsuming classifier. SVM is adapted as a binary classifier, and recently, multiclass SVMs are utilized for classifying the vectors into different sets with largely trained oracles and are broadly used. In SVM the activities of training are mostly implemented by optimal hyperplanes. The progression in data expands every year due to increasing technologies and applications. The uprising technologies and science affect the data size that increases rapidly with the goal to enhance profitable activities. The data is a huge dataset, which is generally complex to capture, store, and filter using conventional techniques. The internet-of-things system is used for accumulating large data, usually termed as big data, and is adapted in the sensing layer and then relocates the data in the data processing layer. The data employs the data analytics for examining the features of this huge sized data by filtering the valuable patterns. The generated data is originally streaming data due to its measurements, representation of data actions, and interactions that generate from the internet. The data is generated from different time intervals. In the streaming framework the algorithms and high-speed data are processed by satisfying time and space constraints. The streaming algorithms use data structures for providing improved speed and to provide optimal answers. Thus the taxonomy of data becomes an important mission in social media, marketing, and biomedicines. The data intelligence is the interface and the analysis of the diverse configuration of data is meaningful for transforming the data into forms that offer insight into companies or institution decision-making for undertaking future conventions. Once, a huge amount of data is obtained, then the big data needs to be analyzed for drawing relevant and perceptive information, the data must be categorized into meaningful groups for further analysis. Data mining is a procedure wherein huge datasets are evaluated with the goal of determining patterns that are then grouped for future data analysis. Categorizing the data into groups and then storing is carried out using objects of efficient analysis using intelligent techniques. Data mining permits organizations for predicting and following future trends in the industries. The merits of data mining are based on situations where there is a requirement for predicting the behavior of the consumer and predicting the analysis together as the patterns uncover organizations to make predictions which are benefitted to their business.
2.2 Procedure for intelligent data mining The intelligent data mining needs rigid assistance between the domain experts wherein the data mining experts and the quality managers to contain data-driven and interest-driven analyses. The whole process is handled by Knowledge Discovery Assistant (KDA) and the data mining tool. Fig. 21 illustrates the schematic view of intelligent data mining methods.
Chapter 2 • Intelligence methods for data mining task
23
FIGURE 2–1 Types of intelligent data mining methods.
2.2.1 Interest-driven data mining The interest-driven processes are partitioned into seven categories: 1. Domain knowledge acquisition Domain knowledge is deeply required for intelligent mining of data using prearranged expert interviews throughout the process. The domain knowledge can help others to form interesting groups and concepts that can be devised from the task of managing medical quality. This knowledge is acquired for defining the search space to search, filter, and sort the results. Thus the user is provided with quick access controls for extracting the most crucial results to deploy the gained knowledge. 2. Devise business questions With the nomenclature of different tasks the medical quality manager can devise questions, which appear to be pertinent for the tasks solution and attaining the goals. Here, the Knowledge Discovery Query Language (KDQL) is utilized for formulating the questions which are utilized for dealing with data mining queries and to structure the data mining results for refining the business questions. The majority of KDQL questions can be modeled using business users, which are not directly exchangeable into data mining queries as they consist of concepts that are not the element of the data mining world. Thus for making them processable, the data mining techniques and the questions need to be refined using KDA with different concept taxonomies using the components of questions. 3. Conversion of data mining queries into business questions After the refinement of KDA the method is transformed into data mining queries. The KDA offers knowledge on the configuration and selection of data mining techniques with suitable parameter settings. The KDA offers knowledge on the configuration and selection of data mining techniques and on suitable parameter settings. With numerous criteria of data mining methods the solutions of the questions are derived. This method provides a fully specified functioning which is independent of data mining queries that allow the KDA to be utilized as front-end for providing an implementation of data mining methods. Finally, the self-governing data mining queries are employed to fulfill the requirements of
24
4.
5.
6.
7.
Artificial Intelligence in Data Mining
the concrete execution of an algorithm. In short, the conversion of the object matches the object mappings to several data mining techniques or statistical tests. Implementation of data mining queries The KDA implements the data mining queries automatically and passes results to the data mining toolbox and obtain structured findings with respect to the issues of the quality manager. Acquisition of findings in data mining The majority of data mining techniques overcome the user with an inundation of results. Thus the KDA improves each result with an interesting value. The KDA handles the user by producing visualizations of findings and permits navigation in many findings, for searching and sorting the findings and filter uninteresting and choose interesting findings that help to address the tasks in managing medical quality. Conversion of data mining marks into solutions With respect to the question in the language of the business user, the natural language answers are generated by the intangible results of data mining. However, this conversion was not supported by the present execution of the KDA. Generate new questions The visualization of answers causes the quality managers to come up with different questions that result in the modeling of new questions.
2.2.2 Data-driven data mining Solely, the interest-driven analyses summarize the unexpected patterns of the data. To avoid the limitations, data-driven mining methods are utilized for the analysis with interest-driven analyses. Efficient techniques are devised for discovering the association rules that are devised based on intelligent caching methods that are deployed for the analysis. For data-driven analysis the interests of business users are executed for structuring the findings. This hybrid method ensured that one hand users are not overcome by floods of findings that are beyond the interests of unexpected patterns that do not escape their notice. The idiosyncratic feature of conventional data mining is that Knowledge Discovery from data (KDD). The major goal of KDD is to determine the knowledge which is the fundamental focus to devise legitimate business requirements with user preferences but the KDD is assumed and is considered as a preset process. The target of the method is the fabrication of automatic techniques and tools. Accordingly, the tools and algorithms are designed which does not have the capability to adapt external environment constraints. The traditional KDD is a data-centered and is a theoretically dominated course that targets the automated hidden pattern mining. The goal of the classical data mining method is to let the data verify the research innovation and track the elevated performance of the algorithm to design new algorithms. Thus the mining process helped in discovering the knowledge which is important for academics and industrial entities. In general, the determination and transformation of knowledge are actionable in answering the business issues and is termed as the basic nature of KDD. However, conventional
Chapter 2 • Intelligence methods for data mining task
25
data mining is mainly data-centered and theoretically conquered, which stopped the hidden pattern mining supporting high-level expectation and technical concerns. In addition, several features are included, which surrounds the business issues that are not balanced or exhaustively considered. These methods are the major issues in future KDD methods. Domain-driven data mining is defined as a repository of techniques, models, and tools which deliver knowledge to the organization, humans, and networks for developing innovative products by delivering actionable knowledge. Actionable knowledge refers to the business response that reflects the business needs and user preferences which can be readily adapted by the business individuals for making decisions and taking actions. The fundamentals of domain-driven data mining methods are listed as follows: 1. Analysis of constraint-based context a. Domain constraints b. Data constraints c. Rule constraints d. Interestingness constraints 2. Extraction of in-depth patterns a. The in-depth patterns and actionable patterns are more essential for making effective decisions. b. The in-depth patterns are a crucial part of data miners and for decision-makers. Actionable trading methods are determined using model refinement by tuning parameters. 3. Discovery of cooperated interactive knowledge The in-depth discovery of patterns relies on the assistance of data analysts and business analysts. 4. Visualizing data mining as a process of loop-closed iterative refinement The process of data mining is enclosed with an iterative refinement which feedbacks the features, hypotheses, models, explanations, and evaluation in the centered context.
2.3 Associate rule mining The Apriori algorithm is considered as the widely utilized algorithm for mining association rules as devised in 1993. The major task contained in the Apriori ARM (association rule mining) algorithm is divided into two subtasks. The first task extracted the itemsets that satisfied the minimum support threshold. The second task extracted the high confidence rules using the frequent itemset which is generated from the previous subtask. The goal of ARM is to uncover the interesting and useful patterns from the transactional database. The association rules are a kind of conditional statements which are devised in the form of P!Q in which P and Q are itemsets that are disjoint in nature. Association rules are utilized in evaluating and predicting the behavior of the customer and are essential for making decisions. Confidence and support are the major elements utilized for evaluating the quality of generated rules. The major issues in the ARM are the production of huge volumes of rules. The
26
Artificial Intelligence in Data Mining
extraction of useful knowledge from the huge volumes of the rule is complicated as the pertinent information for the mining is hidden in these rules. Fixing the threshold of confidence and support plays a major role in the ARM. However, fixing confidence and support parameters with low threshold lead to less precedence rules, whereas setting too high threshold may generate rules of common knowledge. The intelligent data mining methods are ARM, cluster analysis, time-series mining, and classification analysis. Data mining methods are devised based on statistics and machine learning wherein the patterns are inferred with different data types. The methods utilized in data mining include machine learning, which belongs to the area of AI. AI frameworks are developed for learning by adapting different methods. The AI can be described on the basis of the Turing test. Here, Alan Turing commences the ability of machines to reveal intelligence. The testing is conducted using the human, machine, and human judge. The human judge connects to humans and machines with natural language conversation. Here, both the human and machine try to present itself as a human which tried to appear as human. The goal of the human judge is to differentiate the human from the machine on the basis of the conversation. When the judge failed to differentiate between machine and human, then the machine is considered as intelligent. The recent works on the development of quantitative trading tools are devised on the basis of data mining methods. The data mining can be used for determining the association between the assets and create forecasting models on the basis of data. Based on the data, the rates of interests, exchange rates, and stock processes are predicted. Numerous trading methods are employed for evaluating confidence and support on data. The association rules are found to be useful for intelligent data mining in the form of classification. Numerous data mining methods are devised in the community which performed the illustration of data, data classification based on deviation detection, target attribute, and other forms of interpretation and data characterization. The most widely used pattern extraction with popular summarization is an association rule algorithm that identifies the correlations between the transactional databases. ARM is the essential task in data mining methods, wherein the relations between different attributes in transaction or database are determined. The ARM is not a type of single-objective issue rather it is a type of multiobjective optimization issue. Multiobjective optimization [2] helps to determine the solution fulfilling all the objectives of optimization contained in an uncertain point set. The evaluation metrics such as support, confidence, and lift [3] are considered as objectives of multiobjective optimization techniques. Association rule is usually in the form X!Y wherein X and Y represent itemsets. Here, X represents antecedent and Y denotes consequent. Each rule poses predefined support and confidence. Conventional ARM adapts a model for support-confidence [4] mining rules wherein the confidence and support are greater than minimum confidence and minimum support specified by the user. These rules are predefined as strong association rules. However, many times, the strong association rules may not be interesting to the users but may be essential for evaluating the results and maximizing the evaluation metrics, which may take the ARM
Chapter 2 • Intelligence methods for data mining task
27
more effectively by satisfying the needs of the user. Thus the researcher has put forth many evaluation metrics of association rules [5] and used them in ARM for limiting and constraining the association rules produced by the algorithm. With respect to numerous evaluation measures devised in the ARM, the ARM can be viewed as a multiobjective optimization issue rather than the single-objective issue. The two numbers are linked with each rule which indicated the confidence and support for each rule. Support is a metric, which models the fraction of transactions that fulfill both consequent and antecedent of the rule. The support of rule X and Y denotes the transaction percentage from the original database, which consists of both X and Y. • Confidence computes how often the value of consequent is true when the antecedent is true. The confidence of X and Y rule denotes the transaction percentage wherein the items contained in X must contain in Y. The ARM is applicable in several domains which include: 1. 2. 3. 4. •
attached mailing catalog design cross marketing customer segmentation Procedure of ARM ARM is a two-step procedure that consists of extraction of frequent itemset that followed the extraction of strong association rules using the generated frequent itemsets. Fig. 22 elaborates the architecture of optimization-driven ARM.
FIGURE 2–2 Architecture of optimization-driven ARM. ARM, Association rule mining.
28
Artificial Intelligence in Data Mining
Association rules gained importance due to support and confidence factors, which helps to mine the frequent itemsets from the massive data. The itemsets that are considered to be frequent are termed as candidate itemsets. • ARM Algorithm Step 1: Assume k 5 1, and produce frequent itemsets of length 1. Step 2: Recur till recent frequent itemsets are found. • • • •
Generate (k 1 1) candidate itemsets using k-length frequent itemsets. Reduce infrequent candidate itemsets with subsets of length k. Compute support for each candidate itemset using the complete database. Itemsets that do not have minimum support are eliminated which are called k-itemsets.
In 1994 the Apriori algorithm [2] is devised by Agrawal for extracting frequent itemsets employing the most significant ARM algorithm but poses many drawbacks and researchers tried to devise new algorithms for overcoming the limitations. The Apriori algorithms pose many drawbacks. • First, it is time-consuming, as the algorithm takes a huge time for processing the candidate itemsets. • Second, it examines the database constantly, which tools more input and output time for execution. • Third, this algorithm only considers minimum support and minimum confidence which are fixed by the user. In 2000 the FP (frequent pattern)-growth algorithm [6] is introduced by HanJiawei for efficient ARM which differed from the efficient Apriori algorithm that required two scans on the database. The FP-growth evaluates the frequent items list that is sorted using frequencies of descending order (F-List) in the first database scan that takes a huge time for computation. In the second scan the database is compressed in a FP-tree that needed huge storage spaces. The FPgrowth started to extract the FP-tree for each time whose support is greater than minimum support that iteratively built the FP-tree with minimum confidence and minimum support. The analysis on ARM use swarm intelligence methods which can address these issues. As per the analysis of swarm intelligence methods, the methods are more efficient than the conventional evolutionary methods like genetic algorithm [7]. Table 21 describes different kinds of ARM algorithm with their merits and description.
2.3.1 Different kinds of association rules The dimension of association rules can be defined using the number of data dimensions contained in the rule. 1. Single-dimensional association rule An association rule is considered to be single-dimensional, if the attributes or items in the association rule mention a single dimension. For instance, if X defines an itemset, then a single-dimensional rule can be written as buys (X, “bread”) and buys (X, “milk”).
Chapter 2 • Intelligence methods for data mining task
Table 2–1
29
Different types of association rule mining (ARM) algorithms.
Sr no.
Algorithm
Explanation
Advantage
1
AIS
More competent in producing frequent itemsets
2
Apriori
3
AprioriTiD
4
FP-tree
5
Continuous ARM
6
Rapid ARM
7
Extended ARM
8
Generalized disjunctive association rule (d-rules) mining
This method tries to enhance the database quality for processing queries and to produce association rules by minimizing the count of database scans. This method consists of two steps which help to determine large itemsets and then scans the database for checking the support value of matching itemsets. A hybrid algorithm is devised which utilize Apriori initially and later on utilize AprioriTiD. This method devises the scalability concept for moving itemsets using massive databases. This method examines the complete database twice for producing frequent itemsets without any iteration process. Here, the first scan designed an FP-tree and the other scan produces frequent itemsets with FP-tree using the FP-growth algorithm. This method computes many itemsets online, which consist of two scans with sequential transactions for generating huge itemsets. In the first scan the algorithm designs a lattice with all large itemsets and then incessantly eliminates all trivially small itemsets with less parameter values than thresholds in the second scan. This method utilizes a tree structure for representing the original database without producing candidate itemsets. It facilitates the second scan of the database by producing 1-itemset and 2-itemsets quickly by adapting the structure of the SOTieIT. This method is devised on the basis of interval and ratio variables which are nominal and ordinal variables. The rules contain quantitative variables that consist of related metrics and statistics. This method permitted the disjunction of different conjuncts to mine contextual interrelationships from data items.
More competent in producing frequent itemsets
Able to generate frequent itemsets using massive databases
Can generate frequent itemsets using two database scans
The method needs less database scans in processing sequential transactional database
This method is proficient due to tree structure and reduces the database scans
This method supported rule mining using statistical and numeric data
This method tries to reduce harmful impacts and increase probable benefits in the process of mining
AIS, Automatic identification system; FP, frequent pattern; SOTieIT, support-oriented tie itemset.
30
Artificial Intelligence in Data Mining
2. Multidimensional association rule If the rule defines more than one dimension such as buys, study-level, and income, then it is a multidimensional association rule. 3. Boolean association rule A rule is said to Boolean association rule, if it consists of associations amongst the absence or presence of items. For instance, if user X buys a computer there is a probability that the user X buys scanner. buys(X, “computer”)!buys(X,“scanner”). 4. Quantitative association rule The quantitative association rule is a rule which defines the associations between quantitative attributes or items. In these rules the quantitative values for each item or attribute are divided into intervals. 5. Correlation rule The correlation rule generates a huge number of rules wherein many of which are superfluous that does not represent a relation amongst itemsets. As a result, the discovered association is further evaluated to expose statistical correlations, thereby discovering correlation rules. Other sets of association rules can be given as follows: 6. Multi-Relation Association Rules (MRAR) The MRAR are the rules wherein each item consists of numerous relations. These relations represent an indirect relation with respect to the entities. Assume the following MRAR wherein the first item contained three relations: humid, live in, and nearby. Thus the rule can be given as, “Those who live in a place which is nearby to a city with humid climate type are younger with good health.” These types of association rules are extracted from semantic web data. 7. Context-Based Association Rules The Context-Based Association Rules poses more precision in ARM and considers a hidden variable, namely, context variable which alters the final association rules based on the context variables. 8. Contrast set learning Contrast set learners utilize the rules which differ significantly along with their distribution across a range of subsets. 9. Weighted class learning The weighted class learning is another type of associative learning wherein weight is allocated to classes, which concentrates on the specific issue for predicting the results of data mining. 10. High-order pattern discovery The high-order pattern discovery incarcerates high-order patterns, which are linked to real-world data. 11. K-optimal pattern discovery It is given as a substitute to the standard method for association rule learning, which determines the patterns that appeared frequently in the data.
Chapter 2 • Intelligence methods for data mining task
31
2.4 Association rule mining: multiobjective optimization method The association rule is a kind of multiobjective optimization wherein confidence and support are the objectives of optimization. The consideration of the Multiobjective Evolutionary Algorithm (MOEA) obtains confidence and support values to generate interesting rules. Considering advanced interestingness measures in the rule mining process is a postprocessing step in the KDD method. However, the development of MOEAs allowed to initiate new measures in the rule mining process. Moreover, conventional data mining techniques determine the frequent sets and mine the rules based on frequent sets. The addition of interestingness measures in the frequent set mining is a complicated process due to the fact that interestingness measures are pertinent only with association rules. Using MOEA’s association rules is mined which facilitates the integration of different interestingness measures in the mining process. Numerous interestingness measures are employed in the optimization methods which hold different goals of optimization issues and the solution to this is the association rules. The development of MOEAs permits different interestingness measures that are integrated by altering the objective function. Thus many association rules are obtained using a set of interestingness measures. The major goal of MOEA is the rule mining using interestingness measures. The optimization methods satisfy the need of obtaining rules using different interestingness measures as there is no single measure that outperformed other measures in all domains. However, there exist certain issues in mining rules using MOEA.
2.5 Intelligent methods for associate rule mining To provide securities to the database, the combination of several methods is required. For dealing with data having different sensitivity levels, there exists an option in which the data is categorized into different levels and made accessible to those subjects with clearance. Moreover, the restriction of sensitive data does not facilitate complete data preservation. For instance, the sensitive or high data items can be contingent on nonsensitive or low data through some inference process using knowledge and semantics of the application. This type of issue is known as Inference Problem. The association rules can be contained in this method. The solution solves the issues of how to stop disclosure of confidential data using the combination of known inference rules using nonsensitive data.
2.5.1 Associate rule mining based on the optimization method In the context of the real-world, most of the issues consist of different goals, and these goals conflict with high-dimensional spaces. Multiobjective optimization offers a set of solutions that chooses the acceptable solutions considering decision-maker. This is termed as Pareto optimality, and the solution set is called Pareto-optimal set, which is also called nondominated solutions.
32
Artificial Intelligence in Data Mining
FIGURE 2–3 Optimization-based associate rule mining.
The instantaneous optimization of different objective functions diverges from the single function optimization to obtain a precise solution. The single-objective optimization poses different objective functions wherein a global optimum is found after the optimization. The multiobjective optimization issues are considered employing a set of alternatives which are comparable considering the relevance of each objective. Evolutionary techniques are employed for addressing the issues of multiobjective optimization as they pose good searching abilities using complex spaces. It can generate solutions for determining the complete set of Pareto-optimal solutions with a single run. Fig. 23 shows the optimization-based associate rule mining. The optimization of rules is a simple process considering all the existing techniques. The description of components using optimization-based associate rule mining system is represented as: 1. Input dataset The set of transactional data contained in the transactional dataset is fed as an input to the optimization enabled ARM system. 2. Itemsets The complete dataset is computed in this phase to recover the list of symbols accessible with respect to the dataset. The set of symbols is called the itemset for the algorithm. 3. Transaction set The items obtained from the previous phases are utilized for producing the list of transactions. The obtainable transactions are utilized for processing the algorithm. 4. Apriori algorithm In this phase the conventional Apriori algorithm is utilized for processing the itemsets and transaction sets to produce the list of association rules.
Chapter 2 • Intelligence methods for data mining task
33
5. FP-tree In this phase the conventional FP-tree is employed to generate the FP-trees considering the algorithm. Moreover, these trees are reformed using the set of association rules. There exist numerous optimization methods, which are applicable to ARM and some of the methods are described in the next sections.
2.6 Associate rule mining using genetic algorithm A genetic algorithm works using population which helps to produce prospective solutions. The genetic algorithm maps the numbers of each solution wherein each solution is considered as an individual in the population and each string is denoted as the representation of the individual. The genetic algorithm controls the strings in the search for generating enhanced solutions. 1. 2. 3. 4.
formation of string population, computation of each string, choosing the best strings, and manipulation of genetic algorithm to construct a new population of strings.
The benchmark genetic algorithm applies genetic operators such as mutation, crossover, and selection for evaluating the complete generation of new strings [3]. Genetic algorithms are implemented to produce the solutions of successive generations. The prospect of individual reproduction is directly proportional to the integrity of the solution. Hence, the solution quality is improved and the process is terminated when the optimal solution is determined. The genetic algorithm is used over the rules obtained from the Apriori ARM. The method to produce an association rule using a genetic algorithm is given as follows: 1. Begin 2. Consider the sample of records obtained from the dataset, which fits in the memory. 3. The application of the Apriori algorithm is carried out to determine frequent itemsets using minimum support. Consider A represents a set of frequent itemsets produced using the Apriori algorithm. 4. Fix B 5 Φ wherein B represent the output set, which consists of the association rule. 5. Process the stopping criterion of the genetic algorithm. 6. Symbolize each frequent itemset of A using binary string considering the grouping of representation contained in method 2 above. 7. Choose two members from the frequent itemset considering the Roulette Wheel sampling technique. 8. Implement the mutation and crossover on the chosen members to produce association rules. 9. Determine the fitness function for each rule x!y and validate the following condition. 10. If (fitness function . minimum confidence)
34
Artificial Intelligence in Data Mining
11. Fix B 5 B U {x !y} 12. If the desired number of generations is not completed, then go to Step 3. 13. End
2.7 Association rule mining using particle swarm optimization The particle swarm optimization (PSO) is a global optimization strategy initiated by Eberhart and Kennedy. The PSO is a computation method determined from the simulation of combined social behavior such as fish schooling and bird flocking. In PSO, the particles are represented as candidate solutions using the solution space and the optimum solution is determined using the moving particles considered in the solution space. Each particle flies using an S-dimensional search space considering velocity dynamically adjusted based on flying experience. The application of PSO to association mining is a significant part of this study. Here, the PSO is utilized as a module for mining the best fitness value. The algorithmic process is similar to the genetic algorithm but the procedures are different which involved steps such as encoding, computation of fitness values, generation of population, finding the best search particle, and termination condition. Each step of the PSO algorithm and the process of producing association rules are detailed as follows: 1. Encoding As per the description of ARM, the connection of association rule with itemset X to itemset Y (X!Y) must be empty. The itemset before front partition point is termed as “itemset X,” whereas the point between the front partition and back partition points is termed “itemset Y.” 2. Computation of fitness value The fitness value is employed for evaluating the significance of each particle. The fitness value of each particle is obtained using the fitness function. The fitness function can be derived using the confidence and support considering association rule type k. 3. Population generation For applying the evolution process of the PSO algorithm, it is essential to produce the initial population. Here, the population contains particles, which are selected using huge fitness values and the particles in the population are termed as initial particles. 4. Search the best particle The particle with high fitness value amongst the population is chosen as the best solution. 5. Stopping criterion For terminating the evolution of particle the development of stopping criterion is crucial. Here, the evolution finishes when the fitness values of each particle are the same. The position of each particle is fixed. After determining the best particle, support and confidence are employed as minimal support and minimal confidence. These parameters are utilized in ARM for extracting sensitive information.
Chapter 2 • Intelligence methods for data mining task
35
2.8 Bees swarm optimizationassociation rule mining algorithm Here, the bees swarm optimization (BSO) algorithm is designed for mining association rules. For adapting BSO to associate rule mining the components such as encoding solution, Search Area strategy, and the fitness function are determined. 1. Encodings solution In ARM, two well-known representations can be quoted, namely, Integer encoding and Binary encoding. a. Binary encoding Each rule is given by a vector S with n elements wherein n represents the item number. If the item i is in rule then S½i ¼ 1, otherwise 0. b. Integer encoding The rule is given by a vector S of k þ 1 elements wherein k represents the rule size. The first element represents the separator index which represents antecedent and consequent parts of the solution. For all items i in S, if S½i ¼ j, then the item j appears in ith position of the rule. The BSOARM integrates both binary encoding and integer encoding for enforcing BSO operations to determine the neighborhood search and search area. Assuming each solution S is a vector with n items where: i. S½i ¼ index separator amongst the antecedent and consequent parts if i ¼ 1. ii. S½i ¼ j where j . 0 if the item j appears in the ith position of S. iii. S½i ¼ 0, if there is an item in the ith position of the solution S. The ARM is employed to determine all rules that satisfy MinSup and MinConf respectively. Assume α and β represent two empirical parameters, and the fitness function of the solution S is represented as: F ¼ α 3 confidence þ β 3 support
2. The determination of search area The method is utilized to produce a set of solutions amongst which the best solution is determined using the set of strategies. 3. The neighborhood search The neighborhood search is computed from each solution using the addition or subtraction of s. The solution may generate nonadmissible solutions, which is solved using delete and decompose strategy.
2.9 Ant colony optimization algorithm The ant colony optimization (ACO) algorithm is motivated by the experiments considering the grouping of ants in real environments. The algorithm is devised to study and scrutinize
36
Artificial Intelligence in Data Mining
the behavior of real ants to choose the shortest path between their nest and food resources. This behavior of ants is formulated as an ant system and is modeled with the ACO algorithm. In the ACO algorithm the optimization issue can be represented as a formulated graph G ¼ ðC; LÞ wherein C represents the set of components and L indicates the possible connections amongst C.
2.10 Penguins search optimization algorithm for association rules mining Pe-ARM The individual rule presentation is adapted for encoding the given individual into binary and integer encoding. In binary encoding, each solution is given by a vector S of n elements wherein n indicates the number of items. In integer encoding the solution is expressed using a vector S of k þ 1 elements where k indicates the size of rule. In Pe-ARM, the binary encoding and integer encoding are integrated to make an application of the penguins search operations and the fitness computation process easier. Indeed, these three values (0,1,2) are utilized for interpreting the presence of the given item in the rule. The value 0 indicates the item is absent in the rule and the value 1 represents the item is present in the antecedent part of the rule. The value 2 indicates that the item contributes in the consequent part of the rule. Thus 1. S[i]=0 if the item i is not in the solution S. 2. S[i]=1 if the item i belongs to the antecedent part of the solution S. 3. S[i]=2 if the item i belongs to the consequent part of the solution S. This representation is utilized to divide the antecedent part from the consequent part where each single position of a given solution poses complete interpretable information. In addition, this representation is flexible and used in the calculation of the overlap measure. The goal of ARM optimization is to increase the average of confidence and support for generating effective rules. Optimization techniques provide a set of pertinent rules with high confidence and support values. However, the generated rules may be superfluous or similar. To deal with this issue a new measure is devised for evaluating the correlation between the generated rules, to increase the coverage of target data. This new measure provides a set of consistent rules considering the minimum overlap. The aim of penguin is to optimize the energy outflow and to enhance the quality of generated rules. The generated rules pose collective as well as individual quality. The first quality computes the statistical measure considering support and confidence which is computed from the transactional database using the rule, whereas the second one is employed to represent the correlation between the rules. For fitness computing the first aspect is focused by adapting the rules which increase the average of support (Supp) and confidence (Conf). The fitness value is evaluated using the rules which satisfy the maximum accepted overlap (Max-Overlap) and it is a predefined value that represents the maximum accepted distance considering each pair of rules.
Chapter 2 • Intelligence methods for data mining task
37
2.11 Deep learning in data mining Deep learning is a set of techniques in the machine learning domain that tries to deal with model high-level abstractions in data using model architectures containing different nonlinear transformations. Deep learning models have attained improved results in computer vision and speech recognition applications. A convolutional neural network (CNN) is the conventional deep learning model employed for intelligent data mining. CNN is a neural network which utilizes numerous layers along with the convolution filters wherein each computation unit reacts with the input data. CNN is applicable in various domains such as document classification, sentence behavior, product feature mining, and so on. Numerous machine learning methods utilize shallow architectures such as neural networks considering kernel logistic regression, SVM, and many others. However, sometimes, these methods are incapable to mine meaningful patterns using high-dimensional input such as video files or images. Stochastic backpropagation and gradient descent are the two optimization techniques employed for learning the weights in a neural network. Backpropagation is a simple method for adjusting the weights of artificial neural network (ANN) during the process of training. The backpropagation alters the weights using the final result. It computes the gradient of loss function using the obtained weights, which is termed as stochastic backpropagation. The loss function is also differentiable like an activation function which defines the node’s output using the given input set. With the loss function the negative log-likelihood can be determined. The optimization technique to reduce the loss function is termed as gradient descent. Here, the loss function is utilized for iteratively adjusting the function parameters till the minimum is reached. The learning rate is utilized to find the step size using local minimum Gradient descent is considered as an optimization process which utilizes the information for adjusting the parameters of the function. Gradient descent is a general-purpose optimization method that can be applied for devising a precise objective function. Deep-ANN is modeled using different layers of nonlinear processing units. Here, the deep learning model consists of different stacks of simpler modules which are fed to learning for computing the mapping of inputoutput. Each layer presents a set of features at a higher level forming a hierarchical representation. Here, the layers of ANN are programmed by the engineers and trained using a learning scheme with different weighted connections. Other machine learning models can be expanded to perform deep learning; however, the majority of the methods utilized are ANN. For example, when the deep learning method is fed with an image, then the first layer extracted the low-level features such as gradients, corners, and edges. The second layer utilizes that information for identifying the shapes. Finally, the position of the shapes is learned by the classifier with the objects present in the image. Deep learning is considered as a time-sensitive task which acquires the miniature details such as face traits, the background of pictures, and lightning-based on the time of the day and other objects in the picture. Deep learning is a kind of fundamental credit assignment issue. Learning, or credit assignment, means altering the weights which uphold the NN for
38
Artificial Intelligence in Data Mining
exhibiting the desired behavior. In contrast to deep shallow algorithms the count of parameterized transformations in the signal encounters with the input and output layer is utilized. Also, in a recurrent ANN, a gesture might negotiate the layer more than once. Usually, deep learning is meant to assign credit across numerous phases. 1. CNN There exist numerous types of deep learners wherein the CNN is much simpler for training and generalizing the networks which are fully linked with the adjacent layers. Their capacity can be maintained by altering the breadth and depth, and make strong and accurate assumptions regarding the features of images, namely, stationary of statistics and dependencies between the pixels. A CNN is a backpropagation network that undergoes many stages. The convolutional stage determines the local conjunctions; and the pooling stage combines semantically similar functions into one unit. The pooling stages are frequently followed by fully connected layers. As an alternative, numerous sigmoid curve function and the rectified linear unit are utilized in the CNN. These networks use spatially local correlations by imposing definite patterns amongst the adjoining layers. The first hidden layer formed a subset in the input layer. CNN imitates subsets in the complete visual field. They also distribute the same weight and bias using the feature map. A feature map is a discrete representation of the image with lowdimensional size. 2. Recurrent neural networks (RNNs) RNN is a type of ANN which consists of directed cycles between the units. The RNN is considered as the topmost machine learning method due to its ability to learn and carry out complex transformations of data over extended periods of time. For chores that engross sequential inputs, such as language and speech, it is often better to use RNNs. RNN upholds an inner state containing vibrant temporal behavior. RNN is employed for processing an input sequence considering one element at a specific time, by maintaining its hidden units using a “state vector” that completely contained information of the history using the past attributes of the sequence. They consist of state vectors wherein the hidden layers preserve information concerning the history of the past attributes of the input sequence. RNN is a type of supervised machine learning model that is composed of artificial neurons with one recurrent loop known as a feedback loop. The feedback loops refer to the recurrent cycles that run overtime in a sequential manner. The training of RNN is done in a supervised fashion, which requires a training dataset that consists of inputtarget pairs. The goal of the RNN is to reduce the difference. RNN is capable to process the sequential inputs by adapting a recurrent hidden state, the activation of which is based on the previous step. In this way the network exhibits dynamic temporal behavior. Moreover, the recurrent layer permits feeding the information of previous time steps and combines the resultant output with the input of the current time steps, which means the order of input information plays a vital role in processing. The algorithm permits continuous improvement of required output and, thus, makes the network to be more
Chapter 2 • Intelligence methods for data mining task
39
intelligent with each new update. In addition, RNN is applicable in several fields in which the data is represented in sequence. Thus data mining is an important force in economic development. It extracts useful knowledge and rules for decision-making from large databases or data warehouses. The ARM [4] is a very important research subject in data mining and is extensively used in different fields such as business and industries. Association rule can test the knowledge model as well as discover new laws in a large data pool. Effectively finding, understanding, and using association rule is a crucial method to finish the data mining, so ARM has important theoretical and practical value.
References [1] Ekhool Online learning management system from https://ekhool.com/, accessed on January 2020. [2] Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large database. Morgan Kaufmann An Pub Inc; 1994. pp. 4879. [3] Qin Y. Novelty evaluation of association rules. Res Appl Comput 2004;1:1719. [4] Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD conference on management of data. ACM Press; 1993. pp. 20716. [5] Su Z-D, You FC, Yang BR. Comprehensive evaluation method of association rules and an practical example. Comput Appl 2004;24(10):1720. [6] Han J, Pei J, Yin Y, et al. An frequent patterns without candidate generation: a frequent-pattern tree approach. In: Proceedings of of ACM SIGMOD international conference on management of data. ACM Press; 2000. pp. 112. [7] Yang XS, Deb S. Engineering optimization by cuckoo search. Int J Math Model Numer 2010;1(4):33043.
This page intentionally left blank
3 Unsupervised learning methods for data clustering Satish Chander1, P. Vijaya2 1
DEPARTME NT OF COMPUTER SCIENCE & ENGINE ER I N G , B I R L A INS TIT UT E O F T EC H NOL O G Y, MESRA, RANCHI, I NDIA 2 DEPARTME NT OF COMPUTER SCIENCE & MATHEMATICS, MODERN COLLEGE OF B USINESS AND SCIENCE, MUSCAT, OMAN
3.1 Data clustering The custom of categorizing objects based on its apparent similarity is the base in data science. Clustering is considered a procedure to group the same entities in a collective manner. Organization of data into sensible combinations is an essential mode for acquiring knowledge through learning.
3.1.1 Why clustering? The acquisition of similar entities collectively assists to outline the attributes of various groups. Moreover, the clustering provides an insight for understanding different patterns with different groups. There exist several applications for categorizing unlabeled data, for instance, one can spot different segments or groups of customers and the market for maximizing the revenue. Another instance is placing documents together that belong to the same topics. Furthermore, the clustering is employed for minimizing data dimensionality while handling ample variables. By definition, data clustering is an investigative and evocative data analysis method that acquired the interest of many researchers in certain domains that includes pattern recognition, statistics, and data mining. This method is an illustrative way of examining different datasets that consist of several types of data. These datasets tend to diverge from each other with respect to size, dimensions, and count of objects and they consist of numerous types of data. The data clustering is considered an imperative method in data mining, in which the focus is on massive datasets having indefinite structures. The clustering of data is an essential unsupervised learning issue with its relevance in various domains. The purpose of clustering data is termed as cluster analysis which helps to determine the “natural” grouping(s) for a group of objects, points, or patterns. The cluster analysis is rampant in any discipline that consists of the processing of multivariate data. In pattern recognition issues the training data contains a group of input vectors without having equivalent target values. Artificial Intelligence in Data Mining. DOI: https://doi.org/10.1016/B978-0-12-820601-0.00002-1 © 2021 Elsevier Inc. All rights reserved.
41
42
Artificial Intelligence in Data Mining
The purpose of unsupervised learning methods is to detect groups of the same instances of the data which is known as clustering. In addition, this method helps to discover data distribution in space and is termed as density estimation. For a n sampled space x1 to xn contain true class labels that are not offered by each sample and are termed as learning without a teacher. The deficiency of information category differentiates the analysis of cluster (unsupervised learning) from discriminant analysis (supervised learning). The goal of cluster analysis is to discover a convenient and legitimate association of data and not to devise rules for categorizing future data into different sets.
3.1.2 Fundamentals of cluster analysis The instinctive idea behind cluster analysis is easy, wherein the triumphant completion of tasks deduces a huge count of accurate decisions and choices with various substitutes. There are a total of nine fundamentals in the complete cluster analysis studies through which the final result is acquired. The present real-world datasets consist of missing values that are completed by employing a missing data method and data presentation. • • • • • • • • • • •
Presentation of data. Selection of objects. Selection of variables. What to cluster: variables or data units? Variable Normalization. Selection of (dis)similarity measures. Selection of clustering criterion (objective function). Selection of missing data strategy. Techniques and computer execution (and its reliability, such as convergence). Count of clusters. Understanding of results.
3.1.3 Needs of unsupervised learning: Why? Annotation of massive datasets is expensive and thus one can label only a small number of instances manually. An example is speech recognition. There exist several instances in which one cannot understand how many classes are the data partitioned like data mining. The clustering is utilized for gaining insight to structure data before devising a classifier. Unsupervised learning is categorized into two classes: • Parametric unsupervised learning For this instance, a parametric distribution of data is assumed in which sample data arrives from the population that pursues a probability distribution on the basis of preset parameters. For common family distributions, all members possess similar shapes and are modeled by standard deviation and mean. It means if you discern standard deviation
Chapter 3 • Unsupervised learning methods for data clustering
43
and mean and that the distribution is normal then you can acquire the probability of any future inspection. Parametric unsupervised learning consists of the expectationmaximization (EM) algorithm which is employed in Gaussian Mixture Models for predicting the samples from the set of classes. This case is more complex than the benchmark supervised learning as there are no answer labels available and thus there are no measures to validate the result. • Nonparametric unsupervised learning In nonparameterized unsupervised learning the data are assembled in clusters wherein each cluster depicts information about classes in which the data is placed. This method is usually utilized for modeling and analyzing the data using tiny samples. The nonparametric model does not need a modeler for making any suppositions regarding population distribution and is known as a distribution-free method. The nomenclature of clustering mechanisms is devised in the literary works. However, the higher number and sturdy multiplicity of conventional clustering techniques are infeasible to acquire categorization which is both absolute and historic. By concentrating on discerning criteria, the methods are classified and represented in Fig. 31.
3.1.4 Partitional clustering Partitioning-based clustering techniques are effective techniques devised on the basis of iterative relocation considering data points amidst clusters. The eminence of the solution is computed by clustering criteria. For each iteration the iterative relocation techniques minimize the criterion function value till convergence. By altering the clustering criteria, it becomes feasible to devise efficient clustering techniques that are more insensitive to error and missing data values than conventional techniques. The goal of partitional clustering techniques is to generate a single partition with a collection of items into clusters. All of these techniques
FIGURE 3–1 Nomenclature of clustering methods.
44
Artificial Intelligence in Data Mining
are devised on the basis of iterative optimization with criterion function that reflects the agreement amongst data and partition. Techniques with squared error are based on the prospect for representing each cluster with a prototype and try to reduce a cost function which is the sum of all data items of squared distance between item and cluster prototype. Generally, prototypes are the cluster centroids that are popularly termed as a k-means algorithm. a. k-means algorithm k-means is an effective unsupervised learning algorithm that addresses the clustering issue. The purpose of the algorithm is to determine the groups in the data using the count of groups denoted by k. The technique processes iteratively for assigning each data point to one of the k groups using the offered features. The data is clustered on the basis of feature similarity. The outcomes of k-means clustering algorithms are given as follows: (1) The centroids of k clusters can be utilized for labeling new data. (2) Then the labels for each data are allocated wherein each data is allocated to a single cluster. Each centroid of a cluster is a combination of feature values that describe the resultant groups. Analyzing the centroid feature weights can be utilized for qualitatively interpreting the type of group that each cluster represents. The process follows an easier way for classifying the provided database using a specific number of clusters with fixed Apriori. The goal is to describe k centers considering each cluster. The centers should be allocated in a crafty way as different location produces different outcomes. Thus the best choice is to place them far from each other. • Algorithmic steps for k-means clustering: Step 1: Select k cluster centers for coinciding using k randomly selected patterns or k randomly defined points in the hypervolume using pattern set. Step 2: Allocate each pattern to the nearest cluster center. Step 3: Reevaluate the cluster centers with the current cluster memberships. Step 4: If the criteria of convergence are not satisfied, go to step 2. Typical convergence criteria are minimal reassignment of patterns to new cluster centers or less reduction in squared error. • Step-by-step process The step-by-step process for finding cluster centroids is depicted in Fig. 32. Various variants of k-means algorithms are devised in the literary works. Some of the methods try to choose a good initial partition so that the technique is more likely to determine the global minimum rate. Another variation is to allow splitting and merging of resultant clusters. A cluster is partitioned whenever its variance is placed above a prespecified threshold and the cluster is merged when the distance between the centroids is below the prespecified threshold. Considering this variant, it becomes possible to generate the optimal partition considering a random position with proper threshold values. The popularly known ISODATA algorithm adapts the technique of merging and splitting clusters. If ISODATA is considered an ellipse partitioning as shown in Fig. 33 as initial partitioning, then it will produce three-cluster partitioning.
Chapter 3 • Unsupervised learning methods for data clustering
45
FIGURE 3–2 Step-by-step process for finding cluster centroids.
FIGURE 3–3 K-means algorithm sensitive to initial partition.
ISODATA merges the cluster fAg and fBCg into one cluster because the distance between centroids is smaller and then split the cluster fD; E; F; Gg into two clusters fD; Eg and fF; Gg which possess large variance. • Step-by-step example For illustrating the k-means algorithm, assume the following dataset containing the scores of two variables with seven individuals as depicted in Table 31.
46
Artificial Intelligence in Data Mining
Table 3–1
Scores of two variables with seven individuals.
Subject
A
B
1 2 3 4 5 6 7
1.0 1.5 3.0 5.0 3.5 4.5 3.5
1.0 2.0 4.0 7.0 5.0 5.0 4.5
Table 3–2
Values of two individuals using Euclidean distance.
Group 1 Group 2
Individual
Mean vector (centroid)
1 4
(1.0, 1.0) (5.0, 7.0)
Table 3–3
Recomputed mean vector.
Step
Individual
1 2 3 4 5 6
1 1, 2 1, 2, 3 1, 2, 3 1, 2, 3 1, 2, 3
Table 3–4
Cluster 1 Cluster 2
Cluster 1 Mean vector (centroid)
(1.8, 2.3) (1.8, 2.3) (1.8, 2.3) (1.8, 2.3)
Individual 4 4 4 4, 5 4, 5, 6 4, 5, 6, 7
Cluster 2 Mean vector (centroid) (5.0, 7.0) (5.0, 7.0) (5.0, 7.0) (4.2, 6.0) (4.3, 5.7) (4.1, 5.4)
Obtained clusters. Individual
Mean vector (centroid)
1, 2, 3 4, 5, 6, 7
(1.8, 2.3) (4.1, 5.4)
This data should be modeled into two different clusters. Initially, the partitions criteria should be determined. Assume A and B be the values of two individuals farthest points considering the Euclidean distance measure which is defined as the preliminary cluster means and is expressed in Table 32. The remaining entities are evaluated in sequence and assigned to the cluster for which they are nearest based on cluster mean and Euclidean distance. The mean vector is recomputed each time whenever a new member is added as depicted in Table 33. This led to following the sequence of steps. At present, the preliminary partition has distorted, and two clusters at this phase pose the following features as depicted in Table 34.
Chapter 3 • Unsupervised learning methods for data clustering
47
It is not sure that each individual is allocated to the correct cluster. Hence, the comparison of each individual distance to its own cluster means is done and represented in Table 35. Only individual 3 is nearest to the mean of opposite cluster 2 than cluster 1. Moreover, each individual’s distance to its own cluster means must be smaller than the distance to another cluster mean. Individual 3 is relocated to cluster 2 which results into new partition as portrayed in Table 36. The iterative relocation is persistent from the new partition till no more relocation occurred. However, in this instance, each individual is now nearer to its own cluster mean than that of the other cluster and the iteration stops selecting the latest partitioning as the final cluster solution. • Selection of k The aforementioned algorithm determines the cluster and labels of the dataset for specific prechosen k. To determine the count of clusters in the data, the user requires running the kmeans clustering technique for a series of k values and then comparing the outcomes. Generally, there is no technique for finding the precise value of k, but a precise estimate can be produced with the following methods. One of the measures that are usually utilized for comparing the results with different values of k is the mean distance between the cluster centroid and data points. Maximizing the count of clusters will minimize the distance of data points and thereby maximize the values of k and reduce the measure. Hence, this measure cannot be utilized as a sole target for determining the clusters. The mean distance to the centroid is considered a function of k and is plotted as an elbow point in which the rate of diminishing sharply shifts that can be utilized for roughly determining k. Various methods subsist for verifying k that involves information criteria, crossvalidation, silhouette method, information-theoretic jump method, and G-means Table 3–5
Comparison of each individual distance to its own cluster mean.
Individual
Distance to mean (centroid) of cluster 1
Distance to mean (centroid) of cluster 2
1 2 3 4 5 6 7
1.5 0.4 2.1 5.7 3.2 3.8 2.8
5.4 4.3 1.8 1.8 0.7 0.6 1.1
Table 3–6
Cluster 1 Cluster 2
New partition clusters. Individual
Mean vector (centroid)
1, 2 3, 4, 5, 6, 7
(1.3, 1.5) (3.9, 5.1)
48
Artificial Intelligence in Data Mining
algorithm. Moreover, observing the data points across the groups offers insight to compute the data splitting for each k. • Applications k-means algorithm is a well-known algorithm and is utilized in several domains that involve document clustering, market segmentation, image compression and image segmentation, and so on. b. K-medoids clustering Numerous solutions are provided for scenarios when the centroid cannot be described that involves the k-medoids method in which the prototype of the cluster is an item that is innermost to the cluster. The k-medoids clustering is an alternative of kmeans which is more robust to outliers and noises. As an alternative to using the mean point as a cluster center, the k-medoids utilize an actual point in the cluster for representation. Medoids are the most sited object of the cluster with a reduced sum of distances to other points. Mean is significantly prejudiced by the outlier and hence cannot symbolize the exact cluster whereas medoids are robust to outliers and precisely indicate the cluster center. The k-medoids clustering method poses partitioning around medoids as a representative for clustering data. The fundamental motive is as follows: choose k representative points to establish initial clusters and then frequently shift to generate improved cluster representatives. The possible amalgamation of representative points are evaluated and the excellence of resulting clustering is computed for each pair. An original representative point is replaced using the new point that causes the utmost reduction in distortion function. For each iteration the set of best points for each cluster is obtained to establish new respective medoids. c. k-modes techniques Other alternatives of the k-means algorithm are the k-modes method which is an expansion of categorical data. The k-modes are an expansion of k-means and it utilizes dissimilarities which is the quantification of total mismatches between two objects and utilizes different modes. The mode is a vector of elements that reduces the dissimilarities between the vectors. This technique poses several modes as the number of clusters we needed as they act as centroids. By adapting the squared error criterion using Mahalanobis metric and Minkowski metric, clusters pose elliptic shape. The utilization of different prototypes of each cluster of distance measures can eliminate this restraint. d. Fuzzy c-means (FCM) Fuzzy clustering techniques are beneficial when there exists dataset considering subgroupings of points with indistinct boundaries and overlap between clusters. Fuzzy versions of the methods are devised on the basis of squared error using FCM. When compared to the crisp counterpart, the fuzzy techniques are more victorious in evading local minima of cozy function and can model situations in which the cluster actually overlaps.
Chapter 3 • Unsupervised learning methods for data clustering
49
For making the outcomes of clustering with less sensitive to outliers using isolated data items, many fuzzy solutions put forward using robust statistics with noise clusters. FCM is a technique of clustering which permits one portion of data to belong to two or more clusters, which is commonly utilized in recognizing patterns. This algorithm deals by allocating membership for each data based on membership using the distance between data and cluster center. If the data is nearer to the cluster center, then the membership is toward the specific cluster center. The goal of FCM is to evaluate the membership values for reducing the within-cluster distances and increase the cluster distances. Evidently, the summation of membership of each data should be equal to one. As mentioned earlier, the data are bound for each cluster using the means of membership function which indicates the fuzzy behavior of the algorithm. Here, an appropriate matrix was built, namely, U, the factors of which exist between 0 and 1 and indicates the membership degree between cluster centers and data. Here, two clusters are identified with the proximity of the two data. The method is referred using “A” and “B.” In the FCM method the provided data does not belong to a distinct cluster, but it can be placed in a middle way. Here, the membership function follows a smoother line to devise that each data may belong to different clusters with different membership coefficient values. • Algorithmic steps for FCM clustering Step 1: Initialize the membership matrix in a random manner using the equation given as: C X
μj ðxi Þ 5 1 i 5 1; 2; . . .; k
(3.1)
j51
Step 2: Evaluate the centroid using the equation given as: im P h xi i μj ðxi Þ im Cj 5 P h i μj ðxi Þ
(3.2)
Step 3: Compute the dissimilarity between the data points and centroids using Euclidean distance: Di 5
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðx2 2x1 Þ2 1 ðy2 2y1 Þ2
(3.3)
Step 4: Update the new membership matrix using the equation: h i1=m21 μj ðxi Þ 5
1 dji
PC
h i1=m21
1 k51 dki
(3.4)
50
Artificial Intelligence in Data Mining
where m represents the fuzzification parameter. The range of m is [1.25, 2]. Step 5: Go back to step 2, till the centroids are not changing. • Numerical example Consider Table 37 as the example having count of objects 5 6 and count of clusters 5 2. Step 1: Initialize the membership matrix Step 2: Determine the constraint using the following equation: 2P h im P h im 3 μ ðx Þ x μ ðy Þ yi 7 i i i j j i i 6 im ; P h im 5 Cj 5 4 P h μj ðxi Þ μj ðyi Þ i i 2 3 1 3 0:82 1 2 3 0:92 1 3 3 0:72 1 4 3 0:32 1 5 3 0:52 1 6 3 0:22 5 4 ; C1 5 0:82 1 0:92 1 0:72 1 0:32 1 0:52 1 0:22 6 3 0:82 1 5 3 0:92 1 8 3 0:72 1 4 3 0:32 1 7 3 0:52 1 9 3 0:22 0:82 1 0:92 1 0:72 1 0:32 1 0:52 1 0:22 C1 5
5:58 14:28 ; 2:32 2:32
(3.5)
ð2:4; 6:1Þ C1 5 2 3 1 3 0:22 1 2 3 0:12 1 3 3 0:32 1 4 3 0:72 1 5 3 0:52 1 6 3 0:82 ;7 6 0:22 1 0:12 1 0:32 1 0:72 1 0:52 1 0:82 6 7 6 7 C2 5 6 2 2 2 2 2 2 7 6 3 0:2 1 5 3 0:1 1 8 3 0:3 1 4 3 0:7 1 7 3 0:5 1 9 3 0:8 4 5 0:22 1 0:12 1 0:32 1 0:72 1 0:52 1 0:82 C2 5
7:38 10:48 ; 1:52 1:52
C2 5 ð4:8; 6:8Þ
Step 3: Determine distance Di 5
Table 3–7
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðx2 2x1 Þ2 1 ðy2 2y1 Þ2
(3.6)
Example having count of objects 5 6 and count of clusters 5 2.
X
Y
C1
C2
1 2 3 4 5 6
6 5 8 4 7 9
0.8 0.9 0.7 0.3 0.5 0.2
0.2 0.1 0.3 0.7 0.5 0.8
Chapter 3 • Unsupervised learning methods for data clustering
51
Centroid 1: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffi 2 5ffiffiffiffiffiffiffiffi1:96 ð1; 6Þð2:4; 6:1Þ 5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1:4Þ2 1 ð0:1Þ ffi p ffi 1 0:01 5 1:97 5 1:40 ð2; 5Þð2:4; 6:1Þ 5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0:16 1 1:21ffi 5 pffiffiffiffiffiffiffiffi 1:37ffi 5 1:17 ð3; 8Þð2:4; 6:1Þ 5 p0:36 1 3:61ffi 5 p3:97 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffi 5 1:99 ð4; 4Þð2:4; 6:1Þ 5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2:56 1 4:41ffi 5 pffiffiffiffiffiffiffiffi 6:97ffi 5 2:64 ð5; 7Þð2:4; 6:1Þ 5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 6:76 1 0:81 5 p 7:57 5 2:75 ffiffiffiffiffiffiffiffiffiffiffi ð6; 9Þð2:4; 6:1Þ 5 12:96 1 8:41 5 21:37 5 4:62
Centroid 2: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffi ð1; 6Þð4:8; 6:8Þ 5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 14:44 1 0:64 15:08 5 3:88 ffi 5pffiffiffiffiffiffiffiffiffiffiffi ð2; 5Þð4:8; 6:8Þ 5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 7:84 1 3:24ffi 5 pffiffiffiffiffiffiffiffi 11:08 ffi 5 3:32 ð3; 8Þð4:8; 6:8Þ 5 p3:24 1 1:44 5 4:68 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffi 5 2:16 0:64 1 7:84ffi 5 pffiffiffiffiffiffiffiffi 8:48ffi 5 2:91 ð4; 4Þð4:8; 6:8Þ 5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð5; 7Þð4:8; 6:8Þ 5 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0:04 1 0:04ffi 5 pffiffiffiffiffiffiffiffi 0:08ffi 5 0:28 ð6; 9Þð4:8; 6:8Þ 5 1:44 1 4:84 5 6:28 5 2:50
The distance between data points is represented in Table 38. Step 4: Update the membership value h i1=m21 μj ðxi Þ 5
1 dji
PC
h i1=m21
(3.7)
1 k51 dki
where m 5 2, i 5 first data point, and j 5 first cluster. Cluster 1 μ12 5 ð1=d12 Þ=ð1=d12 Þ 1 ð1=d22 Þ 5 1=1:17=1=1:17 1 1=3:32 5 0:56=0:56 1 0:30 5 0:56=0:86 5 0:6 μ13 5 ð1=d13 Þ=ð1=d13 Þ 1 ð1=d23 Þ 5 1=1:99=1=1:99 1 1=2:16 5 0:50=0:50 1 0:46 5 0:50=0:96 5 0:5 μ14 5 ð1=d14 Þ=ð1=d14 Þ 1 ð1=d24 Þ 5 1=2:64=1=2:64 1 1=2:91 5 0:37=0:37 1 0:34 5 0:37=0:71 5 0:5 μ15 5 ð1=d15 Þ=ð1=d15 Þ 1 ð1=d25 Þ 5 1=2:75=1=2:75 1 1=0:28 5 0:36=0:36 1 3:57 5 0:36=3:93 5 0:1 μ16 5 ð1=d16 Þ=ð1=d16 Þ 1 ð1=d26 Þ 5 1=4:62=1=4:62 1 1=2:50 5 0:21=0:21 1 0:4 5 0:21=0:61 5 0:3
Table 3–8
Distance between data points. Cluster 1
Data point (1.6) (2.5) (3.8) (4.4) (5.7) (6.9)
Cluster 2 Distance 1.40 1.17 1.99 2.64 2.75 4.62
Data point (1.6) (2.5) (3.8) (4.4) (5.7) (6.9)
Distance 3.88 3.32 2.16 2.91 0.28 2.50
52
Artificial Intelligence in Data Mining
Cluster 2 μ21 5 ð1=d21 Þ=ð1=d12 Þ 1 ð1=d21 Þ 5 1=3:88=1=1:40 1 1=3:88 5 0:25=0:71 1 0:25 5 0:25=0:96 5 0:3 μ22 5 ð1=d22 Þ=ð1=d12 Þ 1 ð1=d22 Þ 5 1=3:32=1=1:17 1 1=3:32 5 0:30=0:56 1 0:30 5 0:30=0:86 5 0:4 μ23 5 ð1=d23 Þ=ð1=d13 Þ 1 ð1=d23 Þ 5 1=2:16=1=1:99 1 1=2:16 5 0:46=0:50 1 0:46 5 0:37=0:71 5 0:5 μ24 5 ð1=d24 Þ=ð1=d14 Þ 1 ð1=d24 Þ 5 1=2:19=1=2:64 1 1=2:19 5 0:34=0:37 1 0:34 5 0:34=0:71 5 0:5 μ25 5 ð1=d25 Þ=ð1=d15 Þ 1 ð1=d25 Þ 5 1=0:28=1=2:75 1 1=0:28 5 3:57=0:36 1 3:57 5 3:57=3:93 5 0:9 μ26 5 ð1=d26 Þ=ð1=d16 Þ 1 ð1=d26 Þ 5 0:4=0:21 1 0:4 5 0:4=0:61 5 0:7
Then new membership value is given in Table 39. Step 5: The process is repeated till the acquisition of the same centroid. Merits: It provides optimal outcomes for the overlap dataset and is moderately superior to the k-means algorithm. Numerous classical techniques assumed that the count of clusters was unknown before the clustering process. The techniques for determining apposite clusters need to be developed. This is a significant issue for partitional clustering. For techniques devised using the squared error, the issue is partly addressed by adapting the regularization term with a cost function. e. Graph-theoretic clustering Graph-theoretic techniques for clustering itself led to enhanced performance. The widespread usage of representations on the basis of graphs seems obvious in pattern recognition and machine learning because of flexibility and power. At the same time, the usage has been widespread for a while which subsequently rehabilitated interest in the same recently. This method rehabilitated interest in the sync using the proverbial information explosion which required devising improved techniques for understanding the huge data being produced in different settings. The attention can be learned by the fact which abstracts the issues into graph-theoretic terms that allowed them to be crystallized in settings using theoretical underpinning. These strong basics allowed the usage of several graph techniques providing manipulation simpler and accurate thus provide improve understanding. The domains of Deep Learning and Graphical Models are instances of the phenomenon of machine learning. • Graph-theoretic clustering: The popular graph-theoretic divisive clustering technique is devised on the basis of constructing a minimal spanning tree of data and then Table 3–9
New membership value.
X
Y
C1
C2
1 2 3 4 5 6
6 5 8 4 7 9
0.7 0.6 0.5 0.5 0.1 0.3
0.3 0.4 0.5 0.5 0.9 0.7
Chapter 3 • Unsupervised learning methods for data clustering
53
eliminating the minimum spanning tree (MST) edges using the highest length for generating more clusters. Fig. 34 illustrates the MST generated from nine 2D points. By employing the link labeled CD with a length of 6 units having maximum Euclidean distance, two clusters {A, B, C} and {D, E, F, G, H, I} are generated. The second cluster can be further partitioned into two clusters by break splitting the edge EF which poses a length of 4.5 units. f. Network flow theorybased clustering Numerous graph-theoretic methods are devised for analyzing the cluster. One such method is devised using network flow theory. Here, minimal cuts are undirected adjacency graphs which are utilized for partitioning the data into different clusters. The data to be clustered is expressed as an undirected adjacency graph G wherein each vertex of G is linked to a data point and the arc helps to link two vertices in G, if the corresponding data points are neighbors based on the provided neighborhood system. The capacity of flow is allocated to each arc in G. This flow capacity is selected for reflecting the similar features between the pair of linked vertices. The clustering is attained by eliminating the arcs of G to establish mutually exclusive subgraphs. For the scenario of unconstrained optimal k-subgraph partition of G, the arcs are chosen for eliminating those in the group of k 2 1 minimum cuts with least k 2 1 values amongst possible minimum cuts extricating all pairs of vertices. The resultant k-subgraph partition is globally optimal k-partition of adjacency graph G. It reduces the highest intersubgraph maximum flow amongst all promising k-partitions of graph G and thus reduces the resemblance between the subgraphs. The complexity in reaching a globally optimal solution for a partitional clustering occurs from the huge possible k-partitions. The locally optimal solutions are devised with hillclimbing and iterative techniques. The efforts are made to recognize and decline huge numbers of perceptibly nonoptimal partitions with techniques like branch-and-bound, conditional clustering, and dynamic programming.
FIGURE 3–4 Formation of clusters using minimal spanning tree.
54
Artificial Intelligence in Data Mining
The clustering ethod can be effectively executed to manage huge graphs with numerous hundred thousand vertices. Moreover, the clustering method generates optimal k-subgraph partition and a partitioned nested sequence that are optimal for cluster numbers from 2 to k. This is specifically attractive as the cluster number has to discover from data. For making this method as data clustering method, two imperative issues need to be resolved. The first one is determining an effective implementation method for making the clustering technique realistic and second constructs an adjacency graph G that can generate significant clusters. The minimum cuts of the undirected graph G are evaluated with the flow and cut equivalent tree T of G, which is devised with the GomoryHu algorithm that originally devised for addressing the multiterminal maximum flow issue for undirected graphs. The GomoryHu algorithm contains the consecutive solution of exactly M 2 1 maximum flow issues with A4 depicting count of vertices in G. After computing T , the best k-partition of G is generated by simply disengage the k 2 1 arc in T with k 2 1 smallest arc capacities. The undeviating execution is satisfactory for graphs with a sensible size. However, the method quickly turns out to be unfeasible as the size of G increases due to the polynomial complexity of the algorithm. For overwhelming the issue a fast hierarchical technique is devised which needs construction and partition of a partially equivalent tree Tc with condensed size. These methods still result in the best solution generated by splitting the corresponding tree of G. The algorithm is devised on the basis of observation in which most of the minimum cuts are devised in G are never utilized as their associated values. The value of cut is described as the capacity sum of its arcs are adequately large in which the cuts will not be eliminated to establish subgraphs. The majority of minimum cuts using huge value is detected with small local subgraphs so that the vertices linked by them can be dense before devising the corresponding tree. As a result, the GomoryHu algorithm is adapted on graphs of small size, but without comprising the overall optimality of the clustering technique. The clustering method is adapted to partitioning huge graphs. There exist manifold instances, some striking examples involve a significant leap for learning social and biological networks with numerous ideas using random graphs and it is considered rapid evolution of web-search techniques using link analysis such as PageRank, HITS algorithm and spectral graph theory. g. Spectral clustering The spectral clustering is devised on the basis of facts devised from spectral graph theory and has produced much interest in recent days which considerably enhances the conventional algorithm in clustering. Many solutions are devised for understanding unsupervised and semisupervised learning techniques. Spectral clustering is a graphtheoretic method for metric adaptation such that it provides a more global conception of similarity amongst data points in contrast to other clustering techniques like k-means. Thus the data is represented in such a manner that it becomes simpler for determining the significant clusters for illustration. It is particularly beneficial for complicated datasets wherein conventional clustering techniques may fail to determine groupings. Spectral
Chapter 3 • Unsupervised learning methods for data clustering
55
clustering does not pose suppositions for generating data. This method instead determines groupings for evaluating the top eigenvectors contained in the affinity matrix and thus usually gives improved results. In k-means the data is directly used for the processing. In spectral clustering, the representation of data is done which helps to provide a global encoding of similarities between points. The similarity discovered in spectral clustering is expressed in the form of a graph, namely, similarity graph. Two points are associated in the graph wherein the similarity or weight between them is attained as nonzero or above some threshold value. The problem of clustering is restated by the information considering similarity graph wherein the graph partitions are detected in such a way that the weights between points of similar groups are high and those between points in different groups are less. The hierarchical methods are based on graph-theoretic clustering. Single-link clusters are subgraphs of a minimal spanning tree of data which are also termed as connected components. Complete-link clusters are maximal complete subgraphs and are based on node colorability of graphs. The maximal complete subgraphs are termed as the best illustration of clusters. The graph-oriented techniques for nonhierarchical structures and overlapping clusters are massively adapted for clustering. The Delaunay graph (DG) is generated by associating all pairs of points which are termed as Voronoi neighbors. The DG comprises information of neighbors present in MST and the relative neighborhood graphs.
3.2 Mode seeking and mixture-resolving algorithms The mixture-resolving technique for analyzing cluster are solved in numerous ways wherein the fundamental supposition is that the patterns that need to be clustered are acquired from one or many distributions and the purpose is to detect the parameters of each with its number. The finite mixture model permits a supple method for the statistical modeling of phenomena linked by unobserved heterogeneity in various domains of applications. Moreover, the data is complicated. For example, it can be multimodal, which refers that there exist numerous modes or regions of elevated probability mass, and regions with smaller probability mass. Here, one can model the data based on a mixture of numerous components in which each component poses a simple parametric form like Gaussian. Moreover, the assumption of each data point fit in one component and distribution of each component is inferred separately. The mixture-resolving techniques consider that the data present in the clusters are acquired from various distributions and tries to evaluate the parameters of all distributions. The illustration of the EM algorithm is an imperative step in addressing the estimation issues of parameter estimation. The mixture-resolving compose suppositions concerning the data distribution. The alternatives of the number of clusters for these techniques are analyzed in the recent works and the noise models are explicitly considered.
3.2.1 Gaussian mixture models Gaussian mixture models are a kind of probabilistic model that helps to represent normally distributed subpopulations with a complete population. Mixture models learn subpopulation in an automatic manner without knowing the data point. As the assignment of a
56
Artificial Intelligence in Data Mining
subpopulation is unknown, these compose unsupervised learning. Gaussian mixture models are extensively utilized in mining data, recognition of patterns, machine learning, and statistical analysis. In several applications, their parameters are detected using maximal likelihood and EM algorithm and are modeled as latent variables. • Learning the model If the count of components k is known, the EM is the method used frequently for evaluating the mixture model parameters. In frequentist probability theory the models are learned considering maximum likelihood estimation method which seems to increase the likelihood or probability of the pragmatic data specified the model attributes. However, determining the maximal likelihood solution for mixture models by distinguishing the log-likelihood and addressing for 0 is analytically not possible. EM is a method for estimating the maximum likelihood and is employed to evaluate closed-form expressions to update the model parameters. EM is an iterative technique that possess suitable property in which the maximal likelihood of the data maximization is certified to come near a local maximum. • EM for Gaussian mixture models EM for mixture models contains two steps: the first step is termed as expectation step or E-step which contains computation of the expectation of the component assignments for each data point provided the model parameters. The second step is termed as maximization step or M-step which contains maximization of expectations computed in E-step based on model parameters. Each step contains the updation of parameter values. The complete iterative process continues until the algorithm converges providing a maximal likelihood estimate. Instinctively, the algorithm works as knowing the component assigning for each data point and makes parameter solving in an easier manner. The expectation step is based on the latter case, while the maximization step is linked to the former case. Thus, by considering consecutive values, the nonfixed values are computed in an effective manner.
3.2.2 Hierarchical clustering The goal of hierarchical clustering is to generate a cluster hierarchy, namely, dendrogram which portrays how the clusters are associated with each other. These techniques continue either by iteratively merging small clusters into a huge one or by partitioning the huge clusters. The partitions of data items can be generated by splitting the dendrogram at the required level. Agglomerative techniques require criteria for merging small clusters into huge ones. The majority of criteria concerns the integration of cluster pairs and is termed as an alternative of the conventional single-link, complete-link, or minimal-variance criteria. The usage of single-link criterion is based on density-based techniques and generates upsetting effects wherein the clusters are linked by line or items which cannot be separated or most items are merged individually. The usage of complete-link or minimal-variance criterion is based on squared error techniques.
Chapter 3 • Unsupervised learning methods for data clustering
57
Several hierarchical techniques are based on density-information and do not constrain the cluster shape. This method often imitates curiosity in the database community for handling large datasets to acquire speeding-up access. CURE adapts different representatives per cluster for obtaining clusters in arbitrary shapes while ignoring the issues of the single-link criterion. The OPTICS failed to build an explicit clustering with an assortment of items but rather poses a well-organized illustration of data which reflect the structure of clustering. The data are not split into a specific number of clusters in a hierarchical classification. As an alternative, the classification comprises a sequence of partitions that may scamper from a single cluster containing all individuals to n clusters. Moreover, the hierarchical clustering methods are split into agglomerative techniques that are processed by a series of successive fusions into groups that can divide n individuals into finer groupings. Both kinds of hierarchical clustering are utilized for determining the optimal step. Considering hierarchical techniques, the fusions or divisions once made are irreversible so that when an agglomerative technique has joined two individuals then they cannot subsequently be detached and when the divisive technique is made a split it cannot be undone. Assume a single variable is computed on eight objects, providing the results (22.2, 22, 21.8, 20.1, 0.1, 1.8, 2, 2.2). The data comprises three clusters, if the first split was into two clusters based on the size of t-statistic, the middle cluster (20.1, 0.1) can be divided to generate the two clusters (22.2, 22, 21.8, 20.1) and (0.1, 1.8, 2, 2.2). To recuperate them entails, a nonhierarchical method for acquiring a four-cluster solution and then merge these two clusters. All agglomerative hierarchical methods minimize the data to a single cluster and the divisive methods split the complete set of data into n groups where each consists of a single individual. The structure design of the hierarchical technique is given in Fig. 35. The structural design of the above figure looks like an evolutionary tree, and it is in biological applications wherein the hierarchical classifications are most pertinent. Other domains in which the hierarchical classifications become predominantly suitable are social systems and librarianship. • Agglomerative methods Agglomerative methods are mostly utilized in hierarchical techniques wherein they generate a series of data partitions. The first contains n single-member clusters and the last contains a single group with all individuals. The fundamental operation of all these methods is similar and is explored by two specific instances centroid linkage and single linkage. At each phase, the method is employed for fusing the individuals or grouping the individuals that are closest. The differences between the techniques arise due to different ways of describing instances between an individual and a group with numerous individuals or between two groups of individuals.
58
Artificial Intelligence in Data Mining
FIGURE 3–5 An instance of the hierarchical technique.
• Agglomerative single-link clustering. Put each sample in its own cluster. Devise a list of interpattern distances considering distinct unordered pairs of patterns and sort the list in ascending order. Step throughout the sorted listed of distances, format for each dissimilarity value dk , a graph on the patterns wherein the pair of patterns closer than dk is associated by graph edge. If all the patterns are members of the connected graph, Stop, else repeat the step. The outcome of the algorithm is a nested hierarchy of graphs that can be cut as a desired similarity level by forming a partition detected by the simply connected components in the corresponding graphs. • Descriptive instances of agglomerative techniques. The single linkage serves to elaborate on the common process of hierarchical method, and in the instance below, it is adapted as an agglomerative method. However, this method is uniformly applied as a divisive method by initiating the cluster consisting of all objects and then dividing the objects into clusters whose nearest neighbor distance is maximal. Assume the given distance matrix as: 0 1 0:0 2B B 2:0 D1 5 3 B B 6:0 4 @ 10:0 5 9:0
1 C 0:0 C C 5:0 0:0 C A 9:0 4:0 0:0 8:0 5:0 3:0 0:0
(3.8)
Chapter 3 • Unsupervised learning methods for data clustering
59
The least nonzero entry with respect to the matrix is for individuals 1 and 2, and thus they are associated to produce a two-member cluster. The distance connecting the cluster and the other three individuals is expressed as: dð12Þ3 5 minðd13 d23 Þ 5 d23 5 5:0
(3.9)
dð12Þ4 5 minðd14 d24 Þ 5 d24 5 9:0
(3.10)
dð12Þ5 5 minðd15 d25 Þ 5 d25 5 8:0
(3.11)
A new matrix is designed by providing the entries whose interindividual and clusterindividual distance values are: 1 0 ð12Þ 0:0 C 3 B C B 5:0 0:0 D2 5 A 4 @ 9:0 4:0 0:0 5 8:0 5:0 3:0 0:0
(3.12)
The least entry in D2 is that for individuals 4 and 5, and these outline a second twomember cluster with a new set of distances and are represented as: dð12Þ3 5 5:0 as before
(3.13)
dð12Þð45Þ 5 minðd14 ; d15 ; d24 ; d25 Þ 5 d25 5 8:0
(3.14)
dð45Þ3 5 minðd34 ; d35 Þ 5 d34 5 4:0
(3.15)
The arrangement in matrix D3 is given as: 0 1 ð12Þ 0:0 A D3 5 3 @ 5:0 0:0 ð45Þ 8:0 4:0 0:0
(3.16)
The least entry is dð45Þ3 and individual 3 is associated to the cluster with individuals 4 and 5. At last, the groups containing individuals 1, 2 and 3, 4, 5 are united to form a single cluster. The dendrogram illustrates the process and the partitions generated at each phase are portrayed in Fig. 36. The height in the diagram indicates the distance in which each fusion is made. The single linkage functions directly using the proximity matrix. 3. Centroid clustering Other kind of clustering is centroid clustering which needs admittance to the original data. To evaluate the type of method, it is adapted to following group of bivariate data. Table 310 portrays the example of centroid clustering.
60
Artificial Intelligence in Data Mining
FIGURE 3–6 Dendrogram of worked example of single linkage showing partition at each step.
Table 3–10
An instance of centroid clustering.
Object
Variable-1
Variable-2
1 2 3 4 5
1.0 1.0 6.0 8.0 8.0
1.0 2.0 3.0 2.0 0.0
Assume Euclidean distance is selected as interobject distance measure that provides the distance matrix as: 0 1 0:00 2B B 1:00 C1 5 3 B B 5:39 4 @ 7:07 5 7:07
1 C 0:00 C C 5:10 0:00 C A 7:00 2:24 0:00 7:28 3:61 2:00 0:00
(3.17)
The assessment of C1 portrays that c12 is the least entry, and objects 1 and 2 are combined to establish a group. The mean vector (centroid) of the group is computed as (1, 1.5) and a new Euclidean distance matrix is computed as: 0 ð12Þ 0:00 3 B B 5:22 C2 5 4 @ 7:02 5 7:16
1 C 0:00 C A 2:24 0:00 3:61 2:00 0:00
(3.18)
Chapter 3 • Unsupervised learning methods for data clustering
61
The least entry in C2 is c45, and objects 4 and 5 are thus fused to generate a second group, in which mean vector is (8.0, 1.0). A distance matrix C3 is computed as: 0 1 ð12Þ 0:00 A C3 5 3 @ 5:22 0:00 ð45Þ 7:02 2:83 0:00
(3.19)
In C3 the least entry is c(45)3, and so objects 3, 4, and 5 are combined into a threemember cluster. The final phase contains the fusion of two remaining sets into one.
3.2.3 Hierarchical divisive algorithms Hierarchical divisive techniques begin with a single cluster of all given objects and start splitting the clusters on the basis of some criteria to generate partition of singleton clusters. Divisive techniques function in the opposite direction of agglomerative techniques that begin with one huge cluster and consecutively split the clusters. They are computationally challenging if all 2k21 2 1 possible divisions into two subclusters of cluster k objects are considered. However, data containing p binary variables are simple and computationally effective techniques also termed as Monothetic divisive methods. These methods split the clusters based on the absence or presence of each p variable so that each cluster consists of members with some attribute present or absent. The data of these techniques are in the form of a two-mode binary matrix. • Monothetic divisive methods The term Monothetic defines the usage of a single variable in which the base is split at a given phase. The selection of variable in Monothetic divisive techniques are based on optimizing a standard reflecting either association or cluster homogeneity with other variables. The Monothetic methods are massively utilized in their data and are revealed from the outset of the divisive method. This tends to reduce the count of splits which has to be made. An instance of homogeneity criterion is an information content which illustrates disorder or chaos described by ρ variables and n objects. C 5 pn log n 2
p X fk log fk 2 ðn 2 fk Þlog ðn 2 fk Þ
(3.20)
k51
where fk indicates the count of individuals having kth attribute. If group X is to be divided into two sets A and B, then the reduction in C is CX 2 CA 2 CB. The ideal set of clusters poses members with the same attributes and C 5 0. Thus clusters are split at each phase based on the ownership of the attribute that led to a great reduction in C. In spite of cluster homogeneity, the attributes utilized at each step can be selected based on overall association with all attributes remaining in the step. This is also called association analysis particularly in ecology. For instance, one pair of variables Vi and Vj with values 0 and 1, the frequencies observed in Table 311.
62
Artificial Intelligence in Data Mining
Table 3–11
Association analysis with one pair of variables Vi and Vj .
Vj
Vi
1 0
1 a c
0 b d
Most familiar measures of association are as follows: jad 2 bcj
(3.21)
ðad2bcÞ
(3.22)
2
ðad2bcÞ2 n=½ða 1 bÞða 1 cÞðb 1 dÞðc 1 dÞ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðad2bcÞ2 n=½ða 1 bÞða 1 cÞðb 1 dÞðc 1 dÞ
(3.24)
ðad2bcÞ2 =½ða 1 bÞða 1 cÞðb 1 dÞðc 1 dÞ
(3.25)
(3.23)
The split at each phase is made based on the occurrence or nonexistence of the parameter whose association with the others is maximal. The first two criteria (3.21) and (3.22) pose the benefits that there is no hazard of computational issues if any marginal totals are zero. The last three (3.23), (3.24), and (3.25) are based on chi-squared statistic, Pearson correlation coefficient, and square root. The tempting features of Monothetic divisive techniques are the easier classification of new members and the addition of cases with missing standards. The latter can be handled as follows: If there are missing values of specific variable V1, the nonmissing variable with the highest unconditional association with it is determined, V2. The missing value for V1 is restored by V2 for similar observation or 1 2 V2 (negative association). The benefit of Monothetic divisive techniques is that it is understandable that which variable generates the split at any stage of the process. However, the general issues with these techniques are that the possession of a specific attribute which is either rare or found with the incorporation of others that may take an individual to the wrong’ path. The major usage of the method includes mortuary studies or medicine in archeology in which it can be stated that social stratum in life can be reflected by the possession of grave goods. The novel technique of divisive clustering is devised for data sequencing like life-course histories. The method utilized the logic of classification and regression tree (CART) analysis and facilitates some of the constructive features of CART analysis like tree pruning to predict the count of clusters. The two new types of data are derived from the original sequences, which include state permanence sequences and auxiliary variables. The variables are defined for differentiating dependent and independent variables. The characterizing of data into clusters is derived from the data sequence. The splits in the method are done with the goal of generating pure clusters. In CART, purity is described based on the dependent or outcome variable. In this method the impurity is described as the summed Orthologous Matrix Database (OMA) distance between the pairs. The auxiliary variables are utilized for splitting
Chapter 3 • Unsupervised learning methods for data clustering
63
the samples and can be described in different ways based on the subject matter. For instance, it may be the time in which the specific state is reached for the first time, the second time, and so on. The divisive process functions by selecting the auxiliary variables which leads to the highest development in within-cluster purity of the splitting process. This leads to a tree having nodes and can be simplified by cutting and pruning back branches. In CART, this is at the expense of within-cluster purity, which needs to balance against maximized simplicity. After determining the acceptable solution the exemplar can be utilized for summarizing the clusters with a representation that retains key features as original sequence and the state permanence sequence refers to the time length required by each state. • Polythetic divisive techniques Polythetic divisive techniques are similar to agglomerative techniques as they utilize all variables at the same time and can handle the proximity matrix. For computing the matrix the distance matrix with seven individuals is taken into consideration and is expressed as: 0
0 B 10 B B 7 B D5B B 30 B 29 B @ 38 42
1 0 7 23 25 34 36
0 21 0 22 7 0 31 10 11 36 13 17
0 9 0
C C C C C C C C A
(3.26)
The individual utilized for initiating the splinter group is one whose average distance from other individuals is maximal and is termed as individual 1 providing the initial group as (1) and (2, 3, 4, 5, 6, 7). Then, the average distance of each individual contained in the main group to the individuals in splinter groups is detected by averaging the distances of each individual in the main group to other individuals of group. The difference between these two averages are determined and expressed in Table 312. The maximal difference is 16.4, for individual 3, and is added to splinter group providing two groups (1, 3) and (2, 4, 5, 6, 7). The process is repeated and expressed in Table 313. Subsequently, the individual 2 gets connected to splinter group to offer groups (1, 3, 2) and (4, 5, 6, 7), and these process are repetitive as depicted in Table 314.
Table 3–12 Difference between each individual in the main group to other individuals of group. Individual in main group
Average distance to splinter group (A)
Average distance to main group (B)
B2A
2 3 4 5 6 7
10.0 7.0 30.0 29.0 38.0 42.0
25.0 23.4 14.8 16.4 19.0 22.2
15.0 16.4 215.2 212.6 219.0 219.8
64
Artificial Intelligence in Data Mining
Table 3–13
Obtained groups after processing.
Individual in main group
Average distance to splinter group (A)
Average distance to main group (B)
B2A
2 4 5 6 7
8.5 25.5 25.5 34.5 39.0
29.5 13.2 15.0 16.0 18.7
21.0 212.3 210.5 218.5 220.3
Table 3–14
Final received splinter group and main groups.
Individual in main group
Average distance to splinter group (A)
Average distance to main group (B)
B2A
4 5 6 7
24.3 25.3 34.3 38.0
10.0 11.7 10.0 13.0
214.3 213.6 224.3 225.0
All differences are negative, the process continues (if desired) on each subgroup in separate manner.
3.3 Conclusion This chapter describes a nomenclature of different unsupervised learning methods for clustering the data into groups. Generally, the unsupervised learning methods are utilized for grouping the cases on the basis of similar attributes. Unsupervised models contain clustering techniques that utilize different methods for splitting the data into groups. Many examples of an unsupervised learning algorithm are outlined in this chapter. This chapter describes the instinctive idea behind cluster analysis, wherein the fundamentals involved in complete cluster analysis is described through which the final result is acquired. In addition, the application of clustering methods is described effectively with step-by-step process for finding cluster centroids. This method rehabilitated interest in the sync using the proverbial information explosion which required devising improved techniques for understanding the huge data being produced in different settings. The attention can be learned by the fact which abstracts the issues into graph-theoretic terms that allowed them to be crystallized in settings using theoretical underpinning. Thus these strong basics allowed the design of several clustering techniques for devising a novel cluster analysis strategy.
4 Heuristic methods for data clustering Rajasekhar Butta1, M. Kamaraju1, V. Sumalatha2 1
ECE DEPARTMENT, GUDLAVALLERU ENGINEER ING C OLLEGE , GUDLALVALLERU, INDIA 2
E CE DE PARTM ENT, JNT UACEA, ANAT APURAMU, INDIA
4.1 What is the heuristic method? Heuristic methods are the strategies adapted for determining the solution of an issue that is derived from the olden Greek word “eurisko” that means to find, determine, or search. It is all about utilizing a sensible technique that does not inevitably necessitate being perfect. Heuristic strategies are specially utilized for speeding up the process to reach an acceptable solution. Earlier knowledge with analogous issues is utilized which can analyze the situations of problems for issues related to people, machines, or abstracts. The founder of heuristics is Hungarian mathematician György who utilized several principles for solving the issues. 1. Heuristic method The principles utilized in heuristic methods are portrayed in Fig. 41, and a detailed description is given below: 1. Try to comprehend the issue 2. Formulate a plan 3. Carry out the plan 4. Analyze and acclimatize 2. The first principle considered in the heuristic method is to comprehend the issue It is more complex than it seems as it tends to be obvious. In reality, people are overinvolved when it moves toward determining an initial apposite method for solving the issue. Moreover, the method assists in drawing the issues and looking at it from different angles. The following questions need to be analyzed which includes what is the issue, what is happening, whether the issue is elaborated in other words, or there is adequate information accessible. These questions are used in the evaluation of the issue. 3. The second principle considered in the heuristic method is to formulate a plan There are numerous methods for addressing the issues. This section helps to select the correct strategy which best fit the issue at one hand. The reversed or functioning backward’ helps people to devise a solution and use the solution as an initiating point for working on the issue. Moreover, it can be beneficial for outlining the possibilities, Artificial Intelligence in Data Mining. DOI: https://doi.org/10.1016/B978-0-12-820601-0.00003-3 © 2021 Elsevier Inc. All rights reserved.
65
66
Artificial Intelligence in Data Mining
FIGURE 4–1 Principles utilized in the heuristic methods.
eliminate some of the issues immediately, analyze with comparisons, and application of symmetry. Also, creativity comes into the picture which can help to enhance the capability of judgment. 4. The third principle considered in the heuristic method is to carry out the plan After the selection of the method, the plan is quickly applied for obtaining the solution. However, it is essential for paying attention to the time and to keep patience as the solution will not appear simply. If the plan does not work, then it is advisable to devise a new plan for acquiring the best solution. 5. The fourth principle considered in the heuristic method is to evaluate and adapt Acquire ample time to precisely analyze and reflect the work that is already be done. The things that tend to be working well should be carefully maintained, and those that lead to the less significant solution must be regulated. Some of the methods work simply, whereas other methods need much time for processing to get improved results.
4.1.1 Heuristic method and the formulation of exact solutions The heuristic strategy is a statistical strategy using which proof of an improved solution to an issue is delivered. There exists a huge number of different issues that could utilize good solutions. In heuristic methods, the speed of processing is equally important as the generated solution. The heuristic strategies endeavor to discover the good, but not inevitably optimal, solution. This is what distinguishes heuristics from the precise solution techniques which are all about determining the optimum solution to the issue. However, this is time-consuming which makes the heuristic method as the most preferable solution. This method is much faster and more flexible than the exact methods, and this method satisfies the criteria. Initially, the heuristic solution should provide realistic computation effort. Furthermore, there should
Chapter 4 • Heuristic methods for data clustering 67
be the highest probability that the solution is realistic, and lastly, the probability of conception for producing a bad solution should be less. In spite of the requirement for determining the best solution to complex issues in less time, there exist other reasons which force us to use heuristic strategies that are elaborated below: • No technique is identified till now which helped to address the issues of optimality. • Even though there exists a precise technique for addressing the issues, it cannot be utilized on the presented hardware. • The heuristic strategies are more reliable than the exact method that allows the amalgamation of conditions that are complex to model. • The heuristic strategies are utilized as the unit of the global process which certifies to determine the optimal solution of the issue. ! The best heuristic algorithm must satisfy the subsequent characteristics • The solution can be generated with less computational effort. • The solution must be near-optimal (having high probability). The likelihood of generating the worst solution (far from optimal) must be low. Numerous heuristic techniques are very diverse in nature. Thus it is complicated to provide full cataloging. Moreover, the majority of them were devised to address a particular issue without the chances of generalization or the adaption of other similar issues. The subsequent methods endeavor to provide broad categories to consign improved heuristics. There are different heuristic techniques which are utilized, and some of the heuristic strategies are described below: 1. Dividing strategy The original issues are splitted into small subissues which can be addressed more easily. These subissues can be linked with each other and further integrated which ultimately leads to solving the original issues. 2. Inductive strategy This contains an issue that has already been addressed, but these methods are slighter than the original issues. The generalizations are obtained from the formerly addressed issues, which assist to solve the larger optimal issues. The fundamental idea among these techniques is to generalize the slighter or easier versions to the complete form. The characteristics or the methods which have been spotted in these cases are simple to evaluate and can be adapted to solve the whole issue. 3. Reduction strategy Since the issues are bigger than considered and undergo numerous causes and factor, this techniques set limit for solving the issues in advance. This minimizes the flexibility of the original issue thereby make it simple to solve. These contain identification of properties that are mainly addressed by the best solution and devise them as boundaries of the issues. Here, the goal is to limit the solution space to make things easier to solve the issues. However, the major risk in this method is that the optimal solution of the original issue is left out.
68
Artificial Intelligence in Data Mining
4. Constructive strategy This is all about functioning on the issue in a step-by-step manner. Some of the solutions are depicted as conquest and from that successive steps are occupied. In this way, the optimal choices are made which ultimately led to triumphant end results. Moreover, these methods include the building of solution to the issues which are done in a step-by-step manner from scratch. Generally, these methods are known as deterministic methods that depend on the best option from the set of iterations. Furthermore, these techniques are extensively utilized for solving classical combinatorial optimization. 5. Local search strategy This is regarding the search for attaining the best solution to the issue. The solution is enhanced along its way. The technique stops when upgrading is not possible. For the priorly devised methods, the local search or local improvement begins with some practicable solution of the issue and attempts to gradually progress with it. Each step of the process performs a movement from one resolution to another one with improved value. The technique stops when there is no available solution that helps to solve the issue. 6. Decomposition strategy The original issue is splitted in issues that are easier to address bearing in mind that the subproblems are considered in a similar class of problems. There exist a huge variety of metaheuristics and different count of properties for the classification.
4.1.2 Clustering-based heuristic methodologies Clustering is the division of data or the set of objects into uniform clusters and is termed as a basic manoeuver in data science. This function is needed in different data analysis tasks that involve unsupervised classification and data summation, and segmentation of massive homogeneous datasets into small harmonized subsets that can be effortlessly managed, modeled, and evaluated. The heuristic methods employed for clustering the data is illustrated in the below section. 1. Local search versus global search The method is to define the kind of search strategy. This type of search method is the modification of simple local search algorithms. The most prominent local search algorithm is the hill-climbing strategy which is utilized for determining local optimums. However, hill climbing does not guarantee global optimum solutions. Numerous metaheuristic facts are devised to enhance local search heuristics for determining improved solutions. These metaheuristics involve iterated local search and Tabu search. • Hill climbing In mathematical analysis, the hill climbing is an arithmetical optimization method that exists in the local search family. This method is an iterative technique that begins with a random solution to an issue that tries to determine the best solution by making
Chapter 4 • Heuristic methods for data clustering 69
incremental alterations to the solutions. If the change generates an improved solution, another incremental change is done to the novel solution, so no further enhancements can be done. For instance, the hill climbing is adapted in the traveling salesman issue which is simple to determine an initial solution, which visits all cities that is likely to poor compared to the optimum solution. The technique begins with a solution that made small upgrading for switching the order of visited cities, which generates shorter routes. Hill climbing determines the best solutions for convex issues; for other issues, it determines the local optima which means the solution that cannot be enhanced upon by any neighbor configurations which are not essentially the finest possible solution or the global optimum out of all possible solutions. For avoiding the local optima stuck, one can utilize restarts which means repeated local search, memory-less stochastic modifications, or more complicated methods. The hill-climbing methods try to increase or decrease the target function f ðxÞ, where x denotes a vector of discrete or continuous values. For each iteration, the hill climbing adjusts a solitary element in x and finds whether the change enhances the f ðxÞ value. It should be noted that unlike the hill-climbing method, the gradient descent techniques adjust all values of x in each iteration based on the gradient of the hill. Using the hill-climbing method, any alterations that enhance f ðxÞ is acknowledged and this process repeats till no change is found for enhancing the value of f ðxÞ. Here, the x tends to be local optimal, and in discrete vector spaces, each possible value of x are modeled as a vertex of the graph. The hill-climbing method follows the graph through vertices that always increase or decrease the value of x till local minimal or local maximal is attained. Steps of the hill-climbing algorithm for clustering: • At first, each cluster is allocated to the set of data. • The preliminary hill climb begins by placing each data randomly amongst one of the N clusters. • Then, the evaluation of the clustering result is done using modularization quality (MQ) which poses higher inner cluster cohesion and less outer coupling. • Each hill climber tries to move toward the nearest neighbor clustering with high MQ at each phase of the algorithm and the formula of MQ is given as, MQ 5
n X i51
i i 1 12 j
where, i indicate cluster and n are total clusters and j is the sum of outer edge weights. • The nearest neighbors from each clustering are generated by moving the single data from one cluster to another cluster. • As the fittest neighbor is determined, the hill climber begins its search from the fitter neighbor from the newly devised clustering. • The search terminates when it is devised that none of the nearest neighbor from the clustering provides better MQ. • Tabu search.
70
Artificial Intelligence in Data Mining
Fred W. Glover devised the Tabu search, which is a metaheuristic search method that adapts local search methods for arithmetical optimization. The local searches pose impending solutions to the issues and validate their instant neighbors for determining enhanced solutions. The local search method poses a capability to trap in suboptimal regions wherein numerous solutions are equally fit. The Tabu search improves the local search performance by soothing its fundamental rule. Initially, at each phase, worsening moves is established, if no enhancing move is accessible. Moreover, prohibitions are devised for discouraging the search from the previously visited solutions. The execution of Tabu search utilizes memory structures which described the visited solutions or userbased rule set. • Tabu search is devised based on the process that crosses boundaries of the local optimality or feasibility which are frequently stated as barriers and release constraints to allow exploration. The Tabu search allows the local heuristic search process to discover the solution space ahead of local optimality. A basic attribute of Tabu's search is to utilize flexible memory structures. The key method for using the memory in Tabu's search is to categorize the subset of moves in the neighborhood as Tabu. The basic attributes of the Tabu search methods are as follows: 1. Configuration: It is an allocation of values to the variables and is considered as the solution of optimization issues. 2. Move: It is an unusual process for obtaining a trial solution that is reliable to the optimization issues and is based on the present configuration. 3. Neighbors: It is the group of all neighbors that depicts the adjacent solutions which can be reached from any present configurations. This may involve neighbors that do not satisfy the expected viable conditions. 4. The candidate subset is the neighborhood subset which is to be evaluated as an alternative of complete neighborhood particularly got huge issues in which the neighborhood poses several attributes. 5. Tabu restrictions are some of the limitations which prevent the selected moves to be repeated or reversed. These play a memory role for the search by adapting the forbidden moves as Tabu. The Tabu moves are accumulated in the list named Tabu list. 6. Aspirations criteria are the rules which detect when the restrictions of tabu are overridden and thus eliminates the Tabu classification otherwise adapted to the move. If a certain move is forbidden by restrictions of Tabu, then the criteria of aspiration when fulfilled makes this move allowable. The clustering algorithm based on the Tabu search is illustrated in Table 41. • Iterated local search Iterated local search produces a new set of solutions in the search space by iterating the operators, namely local search operator and perturbation operator. Here, the local search
Chapter 4 • Heuristic methods for data clustering 71
Table 4–1
Algorithmic steps of Tabu search algorithm.
Step 1: Initialization Assume Z u be the random centers and F u be objective function. Choose values of NLTM: Tabu list size, P is probability threshold, NH is number of trial solutions, IMAX is maximum number of iterations, and γ is iteration reducer. Assume h 5 1; NTL 5 0, and r 5 1. Go to step 2. t Step 2: Using Z u , fix all centers and move ziu by producing NH neighbors z1t ; z2t ; . . . ; zNH and compute the fitness t F1t ; F2t ; . . . ; FNH . Go to step 3. t t Step 3: (a) Sort Fit ; i 5 1 to NH in ascending order and denote as F½1 ; . . . ; F½NH . Assume e 5 1. If F1t $ F b , then replace h 5 h 1 1. Go to Step 3(b) (b) If z½e is not tabu or if it is tabu but Fet , F b , then let zru 5 ze and F u 5 Fet go to step 4, otherwise generate uniform density function between 0 and 1. If F b , Fet , F u and u . P, then let zru 5 ze and F u 5 Fet and go to step 4 otherwise go to step 3(c). (c) Check for next neighbor by letting c 5 c 1 1. If e # NH go to step 3(a) otherwise go to step 3(d). (d) If h . IMAX, then go to step 5 else choose a next set of neighbors and go to step 2 Step 4: Insert zru at the bottom of tabu list. IF NTL 5 NTLM, then remove the top of tabu list otherwise let NTL 5 NTL 1 1. If F b . F u , then let F b 5 F u and Z b 5 Z u . Go to step 3. Step 5: If r , k, the let r 5 r 1 1 and reset h 5 1 and go to step 2. Otherwise fix IMAX 5 γðIMAXÞ. If IMAX . 1, then let r 5 1 and reset h 5 1 and go to step 2 else stop.
Table 4–2
Algorithmic steps of iterated local search algorithm.
Procedure ILSðPÞ Select initial solution x; 0 Initialize the optimal solution x 5 Local SearchðP; xÞ; Fix proceed 5 True; while (proceed) do Choose a neighborhood structure N:S ! ρðSÞ; x 5 Perturb ðNðx Þ; historyÞ; x 0 5 Local Search ðPðxÞ; x 5 Acceptance Criterion ðx ; x 0 ; historyÞ; Update proceed; end while Return x
operator is adapted for reaching local optima, and the perturbation operator is used for escaping from bad local optima. The algorithmic steps of the iterated local search algorithm are portrayed in Table 42. A similar search method is devised which is known as a fixed neighborhood search. 2. Single-solution versus population-based searches Another dimension of classification is single-solution versus population-based searches. The single-solution method concentrates on altering and enhancing a single-candidate solution which involves certain methods like variable neighborhood search (VNS), simulated annealing
72
Artificial Intelligence in Data Mining
(SA), and guided local search (GLS). The population-based method upholds and progresses multiple candidate solutions using population features which directs the search and includes certain algorithms like particle swarm optimization (PSO) and evolutionary computation. Another class of metaheuristics is Swarm intelligence which is the communal behavior selforganized agents in the swarm. The methods considered in swarm intelligence are PSO, ant colony optimization (ACO), and social cognitive optimization. • Simulated annealing SA is a kind of probabilistic method utilized for resembling the global optimum of the provided function. Particularly, it is the metaheuristic employed for approximating the global optimization in the huge search space for an optimization issue. Moreover, it is utilized when the search space is discrete. For solving issues, determining an estimated global optimum is much more imperative than determining the accurate local optimum in a specific amount of time. The SA is most favorable as compared to the gradient descent algorithm. The SA is inspired by the annealing in metallurgy in which the heating is controlled to maximize the crystal size and minimize the defects. There exist attributes that are based on the thermodynamic free energy. The cooling and heating of the material impact the thermodynamic free energy and temperature, and it can be utilized for the global minimum for function having several variables. The association of the algorithm and the arithmetical minimization is devised based on the optimization method for solving the combinatorial (and other) problem. The benefit of SA over other techniques is an aptitude to prevent becoming trapped at local minima. The technique adapted a random search which not only accepted the changes but reduced the objective function that increases it. The fundamental units of the SA are 1. 2. 3. 4. 5.
Configuration: Best solution is present. Group of candidate moves: Reliable directions of the search. Move: Chosen feasible directions of the search. Annealing method: Initial temperature and minimal speed. Criterions for aspiration: Termination. The SA is devised for partitional clustering for solving the multisensor fusion issue. The clustering technique starts with the initial solution x which indicates cluster centroids and large initial temperature T. The fitness of the initial solution f ðxÞ indicates the internal energy of the system. The heuristic technique moves to the innovative solution x chosen from the neighborhoods or remained in the old state x based on the acceptance probability function. • Variable neighborhood search VNS is a technique devised for addressing global optimization and combinatorial issues. The goal is to ensure a methodical change of neighborhood in the local search algorithm. The group of neighborhoods is commonly induced from metric function devised in the solution space. The technique centers the search among a similar solution till another solution better than the current is determine and then recenters the search.
Chapter 4 • Heuristic methods for data clustering 73
• Guided local search GLS is an imperative search method that is placed in the peak of a local search algorithm. The GLS devises penalties in the search and utilizes these penalties to assist the local search algorithms to escape from the trap of local minimal and plateaus. Whenever the provided local search technique settled in the local optimum, the GLS modified the objective function with the particular method. The local search function utilizes an elevated objective function which is devised for bringing the search out of the local minima. To adapt GLS, the features of the solutions should be described for the given issues. The solution features are described to differentiate between the solutions having different characteristics so that regions of similarity among local optima can be detected and avoided. The selection of solution features are based on the kind of issue, and also to a certain extent of the local search algorithm. For each feature, Fi a cost function Ci is described. Each feature is linked to the penalty Pi which is initially set to 0 and record the count of occurrences of the feature in local minima. The cost and features come directly from the fitness function. For instance, in the traveling salesman issue, does the tour travel directly from city X to city Y can be described with the feature. The distance between city X and city Y can be described using cost. In SAT and weighted MAX-SAT issues, the features are “whether clause C fulfilled by present assignments.” Once the implementation is completed, one can describe for each feature i as an indicator function Ii depicting if the feature is present in the current solution or not. Ii 5 1 If the solution x poses property i, and 0 otherwise. • Ant colony optimization This section illustrates the ant algorithm for solving the clustering issues in which the goal is to generate an optimal allocation of N objects in Rn to one of the K clusters in such a way that the sum of squared Euclidean distances amongst each object and the center of the belonged cluster is reduced. The technique utilizes R agents for building solutions. An agent begins with an unfilled solution string S of a length N in which each attribute of the string belongs to one of the test samples. The values allocated to the element of the solution string S indicates the cluster number to which the test sample is allocated. For instance, a representative solution string is devised for N 5 8 and K 5 3 is given below as 21322321. We observe that the first attribute of the above string is allocated to cluster number 2 and the second element is assigned to cluster number 1 and so on. To devise a solution, the agent utilizes the information of the pheromone trail to assign each attribute of string S to a suitable cluster label. In the beginning, the pheromone matrix, τ is initialized as 0. The trail value, τ i;j in location, i; j indicates the pheromone concentration of the sample i kinked to the cluster j. For the issue of separating N samples into K clusters, the pheromone matrix is of size N 3 K . Hence, each sample is linked to K pheromone concentrations. The matrixes of the pheromone trail develop as the number of iteration increases.
74
Artificial Intelligence in Data Mining
At a certain iteration level, each agent or software ants enlarge these trial solutions considering the procedure of pheromone-mediated communication with a vision to acquire a near-optimal partition of given N test samples into K groups fulfilling the definite objective. After producing a population of R trial solutions, a local search is carried out for improving the fitness of the solutions. The pheromone matrixes are updated based on the solution quality generated by the agents. Based on the enhanced pheromone matrix, the agents devise enhanced solutions and the aforementioned steps are continued for a definite number of iterations. Fig. 42 indicates the flowchart of the ACO algorithm for data clustering. 3. Hybridization A hybrid metaheuristic is one that integrates the metaheuristic with other optimization methods like mathematical programming, constraint programming, and machine learning. Both units of a hybrid metaheuristic may execute simultaneously and swap information for guiding the search. On the other hand, memetic algorithms indicate the synergy of evolutionary or other population-based methods with detached individual learning or local improvement processes for problem search. An instance of memetic algorithm is the usage of a local search algorithm instead of a basic mutation operator in evolutionary algorithms. In recent days, numerous clustering techniques based on evolutionary techniques like genetic algorithm, Tabu search, and SA are devised. However, the majority of evolutionary techniques like Tabu search genetic algorithms are very slow to find the optimal solution. Thus the authors have devised novel evolutionary techniques like particle swarm algorithms and ant colony algorithms for solving hard optimization issues which not only pose improved response but converge quickly as compared to conventional evolutionary techniques. The studies performed by the researchers confirmed that the PSO must be combined with other effective methods which are proficient in handling different types of nonlinear optimization issues. • Hybrid PSO and K-means clustering algorithm The K-means algorithm can converge faster as compared to PSO, but generally with the least accurate clustering. Thus the performance of PSO can be further increased by seeding the initial swarm through the outcome of the K-means algorithm. The hybrid technique implements the K-means algorithm once. Here, the K-means clustering is stopped when the maximal count of iterations is exceeded or when the average change in centroid vectors is lesser than the user devised parameter. The outcomes of the K-means algorithm are utilized as one of the particles, whereas the rest of the swarms are randomly initialized to generate global best solutions. • Hybrid PSO-ACO in clustering The hybrid evolutionary optimization techniques are based on ACO and PSO namely PSO-ACO for optimally clustering N objects in K clusters, which not only pose improved response but also converge more rapidly than the classical evolutionary techniques. Here, a method devised by combining ACO and PSO is applied for acquiring intelligent decision-making structures for solving the optimization issues to provide global best solutions. Here, the PSO-ACO is applied for clustering the data and is illustrated below:
Chapter 4 • Heuristic methods for data clustering 75
FIGURE 4–2 Flowchart of the ACO algorithm for data clustering. ACO, ant colony optimization.
76
Artificial Intelligence in Data Mining
Step 1: Produce the initial population and initial velocity. The initial population and initial velocity for each particle are produced randomly. Step 2: Produce the intensity of initial trails. In the initialization phase, it is observed that the trail intensity between each pair of swarms is the same. Step 3: Evaluate fitness function value and fitness is computed for each individual. Step 4: Sort the initial population based on fitness values and are arranged in ascending order. Step 5: Choose the global best solution. The individual which has minimal fitness function is chosen as the best global solution (Gbest). Step 6: Choose the best local position and the best local position (Pbest) is chosen for each individual. Step 7: Choose the ith individual and is chosen and neighbors of the particle must be dynamically defined. Step 8: Evaluate the next position of ith individual. The enhanced position of ith individual is validated with the limit. Step 9: If all individuals are chosen, go to the next step, else i 5 i 1 1, and go back to step 7. Step 10: Validate termination criteria. If the number of current iteration reaches the predetermined maximum iteration number, the search process is stopped, else the initial population is replaced with a new population of swarms and then go back to step 3. The last obtained Gbest is considered as the solution to the problem. 4. Parallel metaheuristics Here, the briefer review of conventional parallel models devised in the literary works are explained. It is an imperative issue as novel methods for parallel metaheuristics frequently utilize these conventional methods as the foundation for their research. The method helps to differentiate between population-based and trajectory-based metaheuristics as the parallel model adapted for each one is completely different. The authoritative way to attain elevated computational efficiency with trajectory-based techniques is the usage of parallelism. Numerous parallel models are devised for the trajectory-based metaheuristics, and three of them are usually utilized in the literary works. 1. The parallel evaluation and exploration of the neighborhood (or parallel moves model), 2. The parallel multistart model 3. The parallel assessment of a solitary solution (or move acceleration model). 1. Parallel moves model It is a low-level architecture that does not change the behavior of the method. The sequential search cane evaluates a similar result but leisurely. In the commencement of each iteration, the master copy the present solution between distributed nodes. Each method independently handles its solution and their outcomes are given to the master. 2. Parallel multistart model It consists of concurrently launching different trajectory-based techniques for generating improved and robust solutions. It can be homogeneous or heterogeneous,
Chapter 4 • Heuristic methods for data clustering 77
cooperative, or independent and begin from different or the same solutions and can be configured with the same or different attributes. 3. Move acceleration model The quality of each move is computed in a parallel centralized mode. This model is generally interesting when the computation function is parallelized as the Central processing unit (CPU) is time-consuming or Input/Output (I/O) intensive. Here, the function can be visualized as a combination of a specific count of partial functions that can process in a parallel manner. When evaluating the parallel technique, it is imperative to consider on which computing platform it has been executed as the hardware design affects the time needed to carry out the communications, synchronizations or computations, and the sharing of data. Till the last decade, the conventional techniques of parallel metaheuristics concentrate on conventional supercomputers and a group of workstations. Presently, parallel computing architectures like graphic processing units, multicore processors, and grid environments offer new prospects to devise parallel computing methods to enhance problem solving and minimize the needed times of computation. This section devises admired parallel computing platforms that offer the main concept regarding the execution of parallel metaheuristics in hardware and illustrate the most effective recent works for programmers and users. Moreover, the software tools for adapting parallel metaheuristics are described in this method. Numerous executions of parallel metaheuristics are devised on cluster architectures, wherein the majority of them follow the cooperative parallel model which utilizes more population. This method offers a cooperative search method which often permits by generating improved results than the sequential model and outperforms other parallel metaheuristics by employing the benefits of multiple searches and increase the diversity offered by the multipopulation model by implementing the group of processors. The platform for cluster computing offers the most natural means for parallelizing metaheuristics with conventional hardware having a better performance/cost ratio. Moreover, the exploration of libraries in parallel computing using clusters is united and numerous high-level models for parallel/distributed metaheuristics. In cluster computing platforms, the synchronization and cooperation between different processes are done in parallel considering the message-passing model to attain parallel computing. As the MapReduce model and its pertinent techniques offer a way which is simple than the message-passing interface (MPI) for parallel computing. Some earlier researches tried to enhance the data clustering techniques using the MapReduce model as the majority of data clustering techniques are devised as centralized techniques. The most important thing is to determine the most time-consuming operator of the strategy and then utilize the map and reduce functions for distributing the computing tasks of different nodes and combine the results of these nodes. Some methods adapted python for developing a coclustering algorithm in the Hadoop environment. Huge research was done for offering a parallel k-means (PKM) clustering technique on Hadoop. The directory of the closest center for the input patterns us the key in
78
Artificial Intelligence in Data Mining
which the input pattern is the value and is an innate solution for the map and reduces functions. The map function cooperates with the role of evaluating the distance between each center and input pattern that determines the closest center of the input pattern. The reduce function combines the count of input patterns of each cluster and compute the cluster values. The main issues when adapting the clustering techniques using MapReduce with huge datasets are I/O cost and network cost amongst the processing nodes. Thus they offer a twophase clustering technique wherein the sampling technique is utilized for reducing the count of patterns to be clustered in the first phase and then the residual input patterns are clustered in the second stage. To enhance the result of data analytics in big data, a critical research problem is devised to adapt the metaheuristic techniques on the cloud computing platform. In contrast to United Data Management (UDM) which adapts one search solution at each iteration, the model employs parallel metaheuristic data clustering framework for different solutions at each iteration. As the majority of metaheuristic technique adapts evaluation, determination, and transition in the converging process that the operator must be devised as the map function and it should be devised as the reduce function to adapt metaheuristic algorithm for cloud computing infrastructures. Unlike other parallel data clustering techniques, how to parallelize the tasks of data analysis is a major question that arises in deterministic data mining techniques and metaheuristic techniques. The transition operator of the metaheuristic technique utilized to alter the searched solution so that the process of search is capable to search other solutions from the search space. The operator employed for evaluation is utilized for computing the objective value of each solution, whereas the determination is utilized for deciding the search directions in later iterations. The common solutions devised in the studies are evaluation and transition operators as the map function and determination operator as the reduce function. 4. The parallel data clustering model To enhance the results of the analysis in big data, critical research is devised for applying metaheuristic techniques in a cloud computing model. The development of UDM helped to generate different solutions at each iteration. As the majority of metaheuristic techniques adapt evaluation, transition, and determination in the process of convergence in which the operators must be devised as map function and must be devised as a reduce function in which the key to adapt metaheuristic algorithm for cloud computing models. Unlike parallel data clustering algorithms that consider how to parallelize the tasks of data analysis, the MapReduce framework can be adapted for deterministic data mining and metaheuristic algorithms. The transition operator contained in the metaheuristic algorithm is characteristically utilized for adjusting the searched solution so that process of search is capable to search other solutions among search space. The evaluation operator is commonly utilized for evaluating the value of objective in each solution, whereas the determination is utilized for deciding the search directions in later iterations. A common solution illustrated in different studies helps to compute the transition and evaluation operators as the map
Chapter 4 • Heuristic methods for data clustering 79
function and the determination operator as a reduce function. Table 43 portrays the parallel metaheuristic data clustering model. • The parallel fuzzy c-means algorithm The parallel fuzzy c-means algorithm (PFCM) is developed for executing the algorithm on parallel computers which belongs to the single program multiple data (SPMD) model adapting message passing. The cluster of networked workstations is adapted with the installed MPI software. The MPI is an accepted model that is both easy-to-use and portable. The parallel program is written in C, C11, or FORTRAN 77 which are then executed and linked through the MPI library. The resultant object code is divided among each processor for parallel implementation. For illustrating PFCM in terms of SPMD model and message passing is illustrated with the algorithm. Table 44 illustrates the algorithmic steps of PFCM. In the algorithm, subroutine calls using the MPI prefix are the calls to the MPI library, and these calls are made when messages are sent to the processes which means the transmission of data or evaluation needs more than one process. Line 1 indicates that P processor is assigned to the parallel processing jobs and each process is allocated with an identity number of myid 5 0; . . . ; P 2 1. Each data point is indicated by the vector variable x½i in which i 5 1; . . . ; n and each cluster is identified by index j where j 5 1; . . . ; c. The algorithm needs that the datasets are set splitted into identical data points so that each process evaluates with n=P data points loaded into its own local memory. If the evaluation needs data points accumulated in other processes, an MPI call is needed. Similarly, the fuzzy membership function uj;i is splitted amongst the processes with local representation my u½j½i storing the membership of the local data. The divide-and-conquer method in parallelising the data storage and variables permits important computations in main memory without accessing the secondary storage like the disk. This improves the performance as compared to serial techniques wherein the dataset is too huge to reside in the main memory. In line 3, my uOld½j½i indicates the old value of my u½j½i and its values are initialized using random numbers in ½0; 1. Lines 430 are the parallelizations of iterative computation. Table 4–3
The parallel metaheuristic data clustering model.
Input data: D Obtain the preliminary solution r While the stopping criterion is not met d 5 Data Scan ðDÞ v; f 5 Map Transition Evaluation ðrÞ r 5 Reduce Determination ðv; f Þ End Output rules r
80
Artificial Intelligence in Data Mining
Table 4–4
Algorithmic steps of parallel fuzzy c-means.
P 5 MPI Comm sizeðÞ; myid 5 MPI Comm rankðÞ; randomize my old½j½ir for each x½i in fuzzy cluster j Do { myLargestErr 5 0; for j 5 1 to c myUSum½j 5 0 Reset vector my_u[j] [i] to 0; end for; for i 5 myid (n/P) 1 1 to (myid 1 1) (n/P) for j 5 1 to c update myUsum [j]; update vectors my_v[j]; end for; end for; for j 5 1 to c MPI_Allreduce(myUsum[j], Usum[j], MPI_SUM); MPI_Allreduce(my_v[j], v[j], MPI_SUM); update centroid vectors; v[j] 5 v[j]/Usum[j]; end for; For i 5 myid (n/P) 1 1 to (myid 1 1) (n/P) for j 5 1 to c update my_u[j][i]; myLargestErr 5 max{|my_u[j][i]-my_uold[j][i]|}; my_uOld[j][i] 5 my_u[j][i]; end for; end for; MPI_Allreduce(myLargestErr, Err, MPI_MAX); }while(Err . 5 epsilon)
Lines 510 retune numerous quantities to 0, and the variable myUsum½j represents the local summation of my uOld½j½iÞm and my v½j indicates the vectorial value of cluster centroid j, and my u½j½i indicates the membership function of the next iteration. Lines 1121 evaluate the centroids of cluster v½j. The first half of the evaluation (lines 1116) handles intermediate computations performed with each process and utilizes data points of the process. These computations are the local summation. As the computation of v½j needs putting together all results accumulated locally in each process with two MPI calls for the second half computation. The MPI AllreduceðÞ subroutine carries out the parallel computation using P process with the first argument as input and output are accumulated in the second argument. Here, the summation is performed with MPI SUM to obtain the output.
Chapter 4 • Heuristic methods for data clustering 81
For instance, in line 18, each process poses a different value of myUsum after the computations. Lines 2228 are utilized for evaluating the fuzzy membership function my u½j½i. In line 25, the values are generated momentarily within each process and the successive value Err all process is generated by calling MPI AllreduceðÞ with MPI MAX . The algorithm terminates when the error is lesser than the tolerance epsilon. • The PKM algorithm The PKM is devised by Dhillon and Modha and is considered as a pertinent technique for parallel computers with SPMD and MPI installed. Table 45 indicates the algorithmic steps of PKM. In the algorithm, a similar strategy of splitting the data into small partitions for each process is utilized. Intermediary results with local data in each process are accumulated in local variables with my prefix: The variables MSE accumulate the mean-squared-error for convergence criteria. As the k-means algorithm formed the crisp partitions, the variable n½j is utilized for recording the count of data points in the cluster j. The centroid initialization took place in lines 47 and is carried out locally in one process and then the centroids are transmitted to each process considering MPI BcastðÞ. This method is not employed for the initialization of fuzzy membership function due to two reasons one is that my uOld½j½i indicate local quantity to each process, and thus the initialization can be performed locally and second their exist c 3 n membership values in total which must be evaluated in parallel amongst all process for sharing the load. It is appealing to note that even though the PFCM algorithm tends to be more complex, it only needs three MPI AllReduces which is similar to that of PKM. It is desirable as extra MPI AllReduces calls means more time required for the processes to exchange information with each other. 1. PKM using MapReduce framework For minimizing the computational time, and for managing the distributed data, the research utilized the MapReduce framework in which the computational issues are minimized through effective feature selection mechanisms that operate for reducing the dimension of features. The method acquires improved classification accuracy through effectual parallelism of the servers that process the partitions of big data in parallel. The two functions of MapReduce employed the function of mapping the input data as relevant features and reducing the intermediary data of the mapper to acquire the required output. Fig. 43 offers a straightforward instance to illustrate how PKM work, particularly using the map and reduce functions. With the usage of the PKM in Spark, this algorithm is required to convert the input data to resilient distributed datasets prior to the Map function and utilize the data. Each node reacts to the allocated tasks having the subset of the input patterns. After all, the input patterns are allocated to its cluster, and the reduced function which is the update
82
Artificial Intelligence in Data Mining
Table 4–5
Algorithmic steps of parallel k-means.
P 5 MPI_Comm_size(); myid 5 MPI_Comm_rank(); MSE 5 LargeNumber; if(myid 5 0) Select k initial cluster centroids m[j], j 5 1. . .k; endif; MPI_Bcast(m[j], 0), j 5 1. . .k; do { OldMSE 5 MSE; my_MSE 5 0; for j 5 1 to k my_m[j] 5 0; my_n[j] 5 0; end for; for i 5 myid (n/P) 1 1 to (myid 1 1) (n/P) for j 5 1 to k Compute squared Euclidean distance d_Sq(x[i], m[j]); end for; Find the closest centroid m[r] to x[i]; my_m[r] 5 my_m[r] 1 x[i]; my_n[r] 5 my_n[r] 1 1; my_MSE 5 my_MSE 1 d_Sq(x[i], m[r]); end for; For j 5 1 to k MPI_Allreduce(my_n[j], n[j], MPI_SUM); MPI_Allreduce(my_m[j], m[j], MPI_SUM); n[j] 5 max(n[j], 1); m[j] 5 m[j] / n[j]; end for; MPI_Allreduce(my_MSE, MSE, MPI_SUM); }while(MSE , OldMSE)
operator that helps to accumulate the information for computing the novel means of the clusters. The update and assign operators are carried out repeatedly till the stop condition is met. 5. Nature-inspired techniques The research works employed nature-inspired metaheuristics for designing different optimization based data clustering methods. Numerous modern metaheuristics, particularly evolutionary computation-based techniques are motivated by the natural models. Nature provides the concepts, principles, and strategies for developing artificial computing models for dealing with complicated issues. These metaheuristics algorithms include cat swarm optimization (CSO), Cuckoo Search Algorithm (CSA), Firefly algorithm (FA), Invasive Weed Optimization Algorithm (IWO), and Gravitational Search Algorithm (GSA). A huge number of current metaphor-inspired metaheuristics are devised for
Chapter 4 • Heuristic methods for data clustering 83
FIGURE 4–3 An example of Parallel K-means with MapReduce.
attracting criticism in the research communities for hiding the lack of novelty and illustrated metaphor. 1. Cat swarm optimization The CSO is devised by scrutinizing the hunting behaviors of cats. The CSO-based clustering techniques are adapted in the literary works for categorizing the UCI datasets. The technique discovered the optimum solution using two modes, namely seeking mode and tracing mode. The seeking mode indicates a global search method that mimics the resting position of cats with sluggish movement, whereas the tracing mode imitates the quick chasing of the cat on the target. Also, the CSO is adapted for optimum deployment of sensor nodes in the wireless sensor network for clustering the sensor nodes. 2. Cuckoo search algorithm The CSA is devised by Yang and Deb devised by the inspiration generated from the breeding behavior of cuckoos which lays eggs in the nests of other birds. The three fundamental functions linked with the CS are: 1. each cuckoo lays one egg in a specific time and dumps it arbitrarily in the selected nest; 2. the nests having a superior quality of eggs will stay for the next generations;
84
Artificial Intelligence in Data Mining
3. the count of host bird nests is set, and eggs laid by the cuckoo is spotted by the host bird based on the probability ranging between [0, 1]. In this condition, the host bird can either obliterate the egg or annihilate the current nest and devise a new one. In earlier methods, the cuckoo search-based clustering algorithm is devised and adapted for mining the information of water bodies using remote sensing satellite images. 3. Firefly algorithm The technique devised by Yang observed the rhythmic flashes of fireflies and utilized these FA for cluster analysis using UCI datasets. The method devised three rules based on fireflies glowing nature. • Each firefly is unisex and all fireflies are fascinated in the direction of other fireflies despite of their sex. • The attraction of fireflies is proportional to its brightness. Thus two flashing fireflies are adapted, wherein the less bright move to the brighter one. As the attraction is proportional to brightness, both decrease and increase in distance between fireflies. • The brightness of the fireflies are discovered by the temperament of the objective function. At first, the commencement of the clustering technique is done in which all the fireflies are dispersed in a random manner amongst the complete search space. Then, the algorithm detects the optimum partitions based on two stages: • Variation of intensities between light: the brightness of the firefly at the present position is imitated based on the value of fitness. • Movement in the direction of attractive firefly: the firefly alters its position by scrutinizing the light intensity of neighboring fireflies. In conventional methods, the firefly clustering algorithm for segmenting the image. 4. Invasive weed optimization algorithm The IWO is devised by Mehrabian and Lucas by adapting the weeds colonization. Here, the weeds replicate their seeds spread in a particular area and produce new plants for determining the optimum position. The automatic clustering technique devised using IWO is modeled based on four steps: 1. 2. 3. 4.
Weeds initialization using whole search space Weeds reproduction Seeds distribution Competitive elimination of the weeds (fitter weeds generate more seeds).
The multiobjective IWO method is employed for clustering the image and conducted cluster analysis for distributing the data amongst different clusters. 5. Gravitational search algorithm The GSA follows the properties of Newton law of gravity which devise that each particle contained in the universe attracts other particles with a force that is unswervingly proportional to the product of their masses and inversely proportional to the square of the
Chapter 4 • Heuristic methods for data clustering 85
distance between them. The technique adapted for cluster analysis devised a hybrid technique based on K-harmonic means and GSA. 6. Memetic algorithm-based data clustering The recent review highlighted the advancements in applications and theoretical areas of Memetic algorithm. The algorithm was utilized for performing cluster analysis using gene expression profiles with minimization of sum-of-squares considering fitness measures. The method initiates with a population that ensures a global search that discovers different areas of search space incorporated with an improvement of individual solution that performs local search heuristic for offering local refinements. The balance method is performed with local and global methods for ensuring that the method does not attain premature convergence to the local solutions and does not consume computational resources for attaining the solution. The Memetic based partitional clustering techniques are adapted for clustering energyefficient nodes in wireless sensor networks. In other methods, the segmentation of remote sensing images are done for the segment of the images for generating different clusters. • Harmony search The Harmony search algorithm becomes well-known after the application of the algorithm to solve different engineering optimization issues. Here, the Harmony searchbased partitional algorithm is devised for web page clustering which is inspired by the harmony played by the musicians. Here, each musician indicates a decision variable that represents the solution of an issue. The musicians endeavor to match harmony based on the time by adapting improvisations and variation in the pitches accompanied by them. The variation is reflected based on the cost function to attain the global optimal solution. Numerous methods in the literary works utilized hybrid Harmony K-means algorithm for clustering the documents. The clustering techniques are effectively adapted for developing the clustering protocols in wireless sensor networks. • Shuffled frog-leaping algorithm The Shuffled frog-leaping (SFL) algorithm mimicked the frog's nature using memeplexes. The algorithm is utilized for solving the partitional clustering issue and is used for yielding improved solutions than ACO, genetic k-means, and SA on different synthetic and real-life datasets. The starting position contains the frogs set which are categorized into subsets commonly termed as memeplexes. The frogs that belonged to diverse memeplexes are supposed to be of different clusters and are not permitted for performing a local search. Within each memeplex and each individual frog share its information with other frogs and categorize the group with novel ideas. After a certain number of predefined steps, the ideas are shared amongst the memeplexes considering the shuffling process. The global and local searches are repeated until the optimum fitness is attained. The clustering techniques based on SFL are utilized for segmenting the color images and mining the web texts.
86
Artificial Intelligence in Data Mining
7. Bioinspired algorithms in partitional clustering The bioinspired or the short form of the biologically inspired technique consists of natural metaheuristics obtained from the living phenomena and characteristics of biological organisms. The intelligence obtained from the bioinspired techniques is distributed, decentralized, self-organizing, and is mostly adaptive in nature. The major techniques employed in certain domains like Bacterial foraging optimization, artificial immune systems (AIS), Krill herd algorithm, and Dendritic cell algorithm. The uses of these techniques are effectively employed for addressing the partitional clustering issue. The method helps to offer deep concepts on the AIS for evaluating its prospective applications. The four key concepts in imitating the principle of the biological immune system are clonal selection algorithm, danger theory, negative selection algorithm, and immune network model. Amongst these four, the clonal selection method is devised with optimization and machine learning method for clustering the massive datasets.
4.2 Summary This chapter devises a survey of heuristic methods for performing a data clustering analysis. The major application of heuristics is to solve the combinatorial problems using a series of algorithms. Each heuristics technique needs a decision among alternatives, which should be explicitly available for the classification of heuristics techniques. Different heuristic clustering strategies are presented whose goal is to offer effective results for clustering massive data. Furthermore, the principles utilized by the Heuristic methods are illustrated to acquire a brief idea about heuristic techniques. Here, the heuristic algorithms are divided into five types that involve local search versus global search, single-solution versus population-based, hybridization, and nature-inspired techniques. In addition to these methods, the parallel metaheuristics are devised for data clustering. Here, numerous executions of parallel metaheuristics are devised on cluster architectures, wherein the majority of them follow the cooperative parallel model which utilizes more population. Thus all the heuristic methods are briefly elaborated, and the application of the algorithm in data clustering is explained in the chapter. In addition, the illustration of different data clustering algorithms and certain examples related to the data clustering is illustrated.
5 Deep learning methods for data classification Arul V. H. DE PARTMENT OF ECE, THEJ US ENGINEERING C OLLEGE, THRISSUR, INDIA
5.1 Data classification Data classification is termed as the process of organizing the data based on relevant categories so that the data may be protected and used more efficiently. It is the data management process of categorizing and sorting the data into different forms, types, or other distant class. However, it enables the classification and separation of data based on the requirements of the dataset for various personal and business objectives. Data classification is highly essential to make the data easier to store and retrieve to reduce compliance and risk management. It organizes the data into tiers of information especially, for the data organizational purposes. It is essential to specify the roles and responsibilities of the employees in the organizational structure. Data classification is important when the data comes to compliance, data security, and risk management. It eliminates the data duplications and reduces the backup and storage costs to speed up the search process. It is a data mining technique, where the training samples or database tuples are effectively analyzed to generate a generalized data. Each tuple in the database belongs to the predefined class, which is determined by the attributes termed as classifying attributes. However, the classification scheme is used to sort out the future data samples and to provide a superior understanding of the contents in the database. Accordingly, data classification is a two-step process, namely, learning or training phase, and evaluation or test phase, where the actual instance class is compared with predicted class. When the hit rate is accepted by the analyst, it specifies that the classifier is capable to classify the future instances based on the unknown class. Reason for data classification The data classification is significantly increased over time. Today, the technologies use the data classification for various purposes in supporting the data security initiative. However, the data is classified for a number of reasons, such as meet the personal or business objectives, maintaining regulatory compliance, and ease of access.
Artificial Intelligence in Data Mining. DOI: https://doi.org/10.1016/B978-0-12-820601-0.00001-X © 2021 Elsevier Inc. All rights reserved.
87
88
Artificial Intelligence in Data Mining
Data classification algorithm Following are some of the data classification algorithms used in the data mining process: • • • • • •
decision tree clustering support vector machine Bayesian classifier artificial neural network (ANN) rules induction. Applications Here, are some of the applications of the data classification listed as follows:
• • • • • •
data sorting based on the file or content type, time, and size of data; identifying and managing the frequently used data in the memory/disk cache; sorting the data for security reasons; credit approval; medical diagnosis; and product marketing. Benefits of data classification Some of the benefits concerned in data communication are listed as follows:
• • • • •
Data security. Regulatory compliance. Discover the significant trends or patterns inside data. Identify the intellectual property, and sensitive files. Optimize the search capabilities.
5.2 Data mining Data mining is the process of automated discovery of potentially useful, previously unknown, and nontrivial patterns embed in the database. Due to the enormous growth of computerization in the aspect of life, storing a large amount of information results a tedious process. The data mining applications with large-scale factor involve decision-making strategies to access the billions of data bytes. The key role of data mining is to find the novel, understandable correlation, potentially useful, and the valid patterns present in the existing data. Accordingly, data mining is the process of finding the answers from the company’s information that an executive or analysis has not thought to ask. Data mining creates both insight and data that is added to the knowledge of the organization. However, data mining can be proceeded from top-down (search to test the hypotheses) or from bottom-up (explore the facts to find a connection between data). Data mining usually leads to making stead modification than performing chief transformations. However, data mining achieves an enormous growth in various applications, such as product design, credit card fraud detection, automatic abstraction, organic compounds, medical diagnosis, and so on. It analyses
Chapter 5 • Deep learning methods for data classification
89
large quantities of information that are recorded on the computer. It is not specific to any type of data or media, whereas it is applicable to any category of information repository. The concept of data mining is adopted in the studies of various databases, such as object relational database, data warehouses, transactional database, relational database, object-oriented database, semistructured and unstructured repositories, such as World Wide Web.
5.2.1 Steps involved in data mining process The following are the steps to be followed in the data mining process: • Develop the application with the relevant knowledge and end-user goal. • Create the target dataset. • Preprocess and clean the data (handling missing data, removing noise from data, known changes, and accosting for time-series data). • Minimize the number of variables, and compute the invariant data representation. • Select the data mining task, such as clustering, regression, classification, and so on. • Select the data mining approach. • Search the patterns with interest. • Interpret the mined pattern. • Consolidate the knowledge discovered and generate the report.
5.3 Background and evolution of deep learning Deep learning is the set of machine learning techniques used to construct the model automatically with various levels of representation by mapping the high-dimensional input into a low-dimensional output. Since 2006, deep structure learning is commonly termed as hierarchical learning or deep learning that appeared as a new field in the research of the machine learning field. In the past decades the methods introduced from the research of deep learning concepts impact an extensive range of information and signal processing tasks. Deep learning is a key aspect of machine learning and artificial intelligence (AI). Historically, the deep learning concept is derived from the research of ANN. Before describing the deep learning concept in detail, let’s discuss the definition of deep learning in brief. • The machine learning methods exploit different layers of information processing for transformation and unsupervised or supervised feature extraction and for classification as well as pattern analysis. • The hierarchy of low level as well as the high-level features in the unsupervised learning model is termed as deep architecture. • Deep learning is the set of approaches used in the machine learning method that aids to learn various levels that correspond to various levels of abstraction, which helps to sense the data, such as sound, text, and images. Deep learning uses the concept of ANN.
90
Artificial Intelligence in Data Mining
The deep learning model contains a number of neurons such that each neuron is organized in a block of layers hierarchically. Each neuron obtains the input from the set of neurons and the link used to connect the neurons has a parameter, which corresponds to the weight. Each neuron performs the operations by receiving the input from the previous neuron and transforming it to the output value. However, for each new connection, the weight of the neuron is multiplied with the input value obtained from the neuron of the previous layer and aggregates the value through the activation function, which computes the output of the neuron. However, the parameters are optimized through the gradient descent algorithm, which reduces the loss function. Moreover, the parameter in deep learning is updated after propagating the gradients of loss function through the network. However, the hierarchical models in deep learning have the facility to learn various levels of data representation corresponding to different abstraction levels that enable the concept of representation in a dense way [1]. The deep learning methods are extensively used in the last decades in various automatic classification processes. In the image analysis perspective the normal classification procedure involves feature extraction using the set of convolutional layers and makes the classification through the fully connected layers. Once the classifier is trained with the optimization algorithm, the quality of the predicted output is verified with the true values recorded in the labeled dataset. Hence, this data is assumed as the standard data revealed from the knowledge of human experts. Deep learning is the process of data mining that uses the structure of deep NNs, which is a unique type of machine learning and AI method that are extremely grown up in the past decades. It allows humans to teach the machines to complete the tasks without the interference of the programmer. Thus we have entered the area of AI and machine learning. In future, the machine will do various tasks than humans are doing today. In the past years, humans explicitly program the computer in a step-by-step mode to solve complex problems by providing the data. Due to the exponential growth of complex datasets, the deep learning techniques are widely used in recent years to offer accurate and robust data classification. The deep learning methods achieved better results than the previous machine learning methods on tasks, such as natural language processing, image classification, and face recognition. Hence, the success factor of the deep learning methods greatly relies on their capacity to form the nonlinear and complex relationships with the data.
5.4 Deep learning methods The deep learning methods use multiple layers for representing the abstraction of data to construct a computational model. Some of the available deep learning methods, such as convolutional neural network (CNN), generative adversarial networks (GANs), and model transfers entirely changed the perception of data processing. In recent decades the deep learning methods demonstrated amazing performance with various tasks. The basic deep learning architectures that use various feature space models are in the form of the input layers. In general, the feature extraction process for text documents, the deep NN is highly
Chapter 5 • Deep learning methods for data classification
91
suitable using the term frequency-inverse document frequency. It achieves state-of-the-art results in speech processing and machine translation. The deep learning methods increase the belief propagation (BP) decoding of high-density parity check codes based on the weighted BP decoder. However, the BP algorithm is specified as the neural network and it results in a high decoding factor. Deep learning is highly efficient to train the decoder of a network based on a single code word. The deep learning methods aim to learn the feature hierarchies, where the features at the higher levels in the hierarchy are generated based on the features of lower levels. Most of the deep learning methods are based on the usage of deep CNN. However, the deep learning methods are mainly focused to learn various levels of abstraction and representation that help to sense the hidden state information. The deep learning methods effectively perform the features extraction process in an automatic way that allows the researchers to reveal the discriminative features based on the human effort and domain knowledge. The deep learning methods include layered architecture for data representation such that high-level features are effectively extracted from the network layers. However, these categories of network architectures are encouraged by AI, which simulate the sensorial parts of the human brain. Human brains can extract the data representation automatically from various scenes in the real-world. The input to the deep learning is the information of the scene received from the eyes, whereas the output obtained from the learning strategy is the classified objects. This indicates the major benefit of using the deep learning methods that mimic the working of the human brain. The classification and categorization of complex data, such as documents, videos, and images, are the major challenges in the field of the data science community. In recent decades, the deep learning structure attained great evolvement in such issues. However, the deep learning structures are modeled for the explicit type of domain or data. Accordingly, there is a requirement for developing information processing techniques to perform data classification across a wide variety of data types. Various researchers successfully used deep learning architecture for classification problems. However, to accurately and efficiently solve the issues in the application is the major goal of the deep learning architectures. Some of the deep learning methods used to perform the data classification process are listed as follows: • • • • • • • • • • •
Fully connected neural network (FCNN) Deep NN Deep CNN Deep recurrent neural network (deep RNN) Deep GAN Deep reinforcement learning (DRL) Deep recursive neural network (deep RvNN) Deep longshort-term memory (deep LSTM) Hierarchical deep learning for text (HDLTex) Deep autoencoder Random multimodel deep learning (RDML)
92
Artificial Intelligence in Data Mining
Applications of deep learning methods Most of the deep learning methods are presented and are demonstrated in achieving effective results in various kinds of applications, such as • speech and audio processing • natural language processing • visual data processing
5.4.1 Fully connected neural network The FCNN is the standard network architecture used in most of the network applications. Fig. 51 shows the architecture of FCNN. Fully connected network means that the neuron of the previous layer is connected to the neurons with the successive layer. Here, feed forward means that the neurons in the preceding layer are ever connected with the neurons of the subsequent layer. However, each neuron in this network has the activation function, which modifies the output of neurons based on the input value. The activation functions used in the neural network are described as follows: Linear function: It is the straight line that multiplies the input value with the constant value. Nonlinear function: The nonlinear functions used in the network architecture are listed as follows: 1. Sigmoid function: It is the S-shaped curve that lies from 0 to 1. 2. Hyperbolic tangent function (tan H): It is the S-shaped curve that lies from 21 to 11. 3. Rectified linear unit function (ReLU): It is the piecewise function that results the output value as “0,” when the input is lower than the threshold value. When the input value is higher than the threshold value, then the piecewise function results the output as linear multiple.
FIGURE 5–1 Architecture of FCNN. FCNN, Fully connected neural network.
Chapter 5 • Deep learning methods for data classification
93
FIGURE 5–2 Activation function.
Each activation function has its pros and cons; hence, it can be applicable in the various layers of deep NN based on the features. Nonlinearity allows deep NN to model various complex functions. Fig. 52 shows the activation function. Thus the network is created with various input layers, output layers, hidden layers, neurons connected with the hidden layers, and various activation functions. Hence, these numerous connections allow creating a powerful deep NN, which is widely used to solve numerous complex problems. When a number of neurons are added to the hidden layers, the network becomes wider. By adding a number of neurons the complexity issue in the NN increases; hence, it is required to train the NN with the optimization algorithms.
5.4.2 Deep neural network Multilayer perceptron (MLP) with many hidden layers or feed forward neural networks is referred to as deep NN. Deep NN is also called convolutional network, which consists of multiple layers of nonlinear operations, such as neural nets using various hidden layers. Deep NN consists of nodes that are arranged in the form of layers, which transforms the input vector into the output vector. Each unit in deep NN takes the input and applies the function into it, which is then passed as output to the next layer. In general, the standard NN consists of three layers. Here, the first is the input layer, which helps in turn to read the input data. The second is the hidden layer, where the activation function is applied, and the third layer is the output layer, which specifies the class. Each neuron at each layer in the deep NN architecture is connected with the weight. However, the weights are applied to the neurons passing from one unit to other units at the training phase, such that the weights are modified based on the output value [2]. The deep NN architectures are modeled to learn using multiple layers of connection, where each layer accepts the data from the previous layer and offers a connection with the next layer of hidden neurons. Here, the neurons situated at the output layer are equal to the total number of classes used in the multiclass classification, while there exists only one neuron in binary classification. The deep NN is considered as a discriminatively trained network, which is implemented with the standard activation function, ReLU, sigmoid, and backpropagation algorithm. The output layer of the multiclass classification uses the softmax and generates the output with the hidden neurons. Deep NN [3] is a quite successful strategy in the
94
Artificial Intelligence in Data Mining
unsupervised learning scheme using the sparse representation based on high-dimensional data. However, deep NN is trained using the restricted Boltzmann machines (RBMs) such that pretraining will be done using the MLPs. Deep NN provides efficient computation with the combination of nonlinear processing elements that are organized in layers. This organizational structure allows the network to generalize, which means accurately predicting the new data. Deep NN is based on the architecture of a standard neural network but with the existence of various hidden layers. However, these network architectures are used in field of data classification. Fig. 53 shows the architecture of deep NN. Deep NN is an important model in the deep learning strategy, which is generally applicable in various fields. It is stimulated by the structure of the visual system. The biologists found that the visual system consists of many layers as of NN. It efficiently processes the information of image from retina to the visual center layer-by-layer; extracts the edge features, shape features, part features; and generates the abstract concept. Accordingly, the depth of deep NN is greater than or equal to four. Deep NN extracts the features layer-bylayer and integrates the low-level features to form the high-level features, which is further used to find the distributed expression in data. It has better expression in feature extraction and has the ability to perform the complex mapping. With the issues of gradient diffusion, the deep NN is not able to perform effective training. With the unsupervised pretraining by initializing the weight, the issues in deep NN can be effectively resolved. Applications of deep NN Here are some of the applications of deep NN listed as follows: • natural language processing • image analysis • letter recognition
FIGURE 5–3 Architecture of deep NN. NN, Neural network.
Chapter 5 • Deep learning methods for data classification
• • • •
95
information retrieval face recognition object recognition speech recognition Advantages Some of the greatest advantages of deep NN are:
• It can be trained using end-to-end fashion. • Removes the need for manual feature engineering. • Reduces the need for adapting to new domains and tasks.
5.4.3 Deep convolutional neural network Deep CNN is a type of deep NN architecture, which is specially designed to perform some specific tasks, such as image classification. Deep CNN is highly inspired by the organization of network neurons. Accordingly, deep CNN provides more interesting features in processing certain types of data, such as audio, video, and images. For the basic image processing system, the deep CNN has the input with the size of two-dimensional arrays, which is composed of the pixel of images. The output layer of the network is typically formed using output neurons with the one-dimensional set. Deep CNN uses the combination of connected convolutional layers to perform the processing system based on the input values. Moreover, it consists of downsampling layers termed as pooling layers, which helps to minimize the neurons in the subsequent layers of NN. Finally, it consists of a fully connected layer, which connects the pooling layer to the output layer. However, convolution is a technique, which allows us to extract the visual features from the image with small chunks. Each neuron present in the convolutional layer is liable to the small cluster of network neurons with the preceding layer. Deep CNN provides various architectural models for learning the network structure. The key idea behind this structure is to use the feed forward network using the convolutional layers, which includes global and local pooling layers. It is a special kind of neural network, which uses the grid-like topology to process the data. The time-series data can be considered as 1D grid by assuming the samples with the regular time intervals, whereas the image data can be viewed as 2D grid with the image pixels. It is tremendously applicable in practical applications. The name itself indicates that this network adopts the mathematical operation named as convolution. However, convolution is a category of linear operation to be performed for processing the data classification. It uses the convolution by replacing the matrix multiplication in the network layers. Nevertheless, the best deep learning architecture is consistently composed of the layers of neurons as building blocks. It is highly suitable for image classification. It is quite largely applicable for image data. Fig. 54 represents the schematic diagram of deep CNN. Deep CNN [4] consists of kernel or filters to determine the neuron clusters. The input of the convolutional layer is mathematically modified to detect some specific types of features
96
Artificial Intelligence in Data Mining
FIGURE 5–4 Schematic diagram of deep CNN. CNN, Convolutional neural network.
from the image. However, they have the facility to return the modified image, sharpen the image, blur the image, and detect the edges. This can be done by multiplying the input image values with the convolutional matrix. Pooling in the convolutional network is termed as downsampling or subsampling, which minimizes the number of neurons in the convolutional layer by retaining the important information. There exists a different type of pooling in deep CNN to perform the data processing system. Deep CNN is a deep learning mechanism that is allowed to perform document classification in a hierarchical manner. Moreover, deep CNN is effectively being used to perform text classification. Deep CNN is modeled with the set of kernels, and the convolutional layers are termed as feature maps, which is stacked to offer multiple filters in the input data. Deep CNN uses the pooling layer to minimize the size of output from the present layer to the successive layer and to decrease the computational complexity. To preserve the features in CNN, various pooling models are used to reduce the output. One of the most commonly used pooling methods is the max pooling, where more number of elements are selected from pooling window. To pool the output obtained from the previous layer to the successive layer, the feature maps are compressed in a single column. The final layer present in deep CNN is the fully connected layer. During the backpropagation model, in addition to the weights, the feature detector filters are also adjusted in the neural network. The significant issues found in the CNN are the size of feature space, and the number of channels. This means that the dimensionality for text classification in deep CNN is too high. Fig. 55 represents the architecture of deep CNN. Deep CNN is modeled after designing the structure of the visual cortex, where the neurons are spatially distinct but are not fully connected. It provides excellent results in simplifying the classification of objects in the real-world image. Most of the research works used the deep CNN to perform text mining. In regard to deep NN, deep CNN requires a large training set to perform text classification based on the character-level features. Deep CNN is the popular network model used in the deep learning framework. Similar to the traditional network architecture, the structure of deep CNN is inspired using the neurons of human brains and animals. However, it replicates the visual cortex of the cat’s brain as it contains a sequence of complex cells. The major advantages of using deep CNN structures are sparse interactions, equivalent representations, and parameter-sharing. To use the twodimensional structure of the input data, such as image signal, the shared weights and the local connections are utilized in the network rather than using the traditional networks. This process
Chapter 5 • Deep learning methods for data classification
97
FIGURE 5–5 Architecture of deep CNN. CNN, Convolutional neural network.
requires only a few parameters but makes the system easier to train and operates almost faster. However, this function is similar to the operation of visual cortex cells. Here, the cortex cells are operated as local filters with the input and retrieve the local correlation present in the data. The typical deep CNN consists of numerous convolutional layers, followed with this, there exists pooling or subsampling layers, and in the final step, a fully connected layer is used. Let us describe the deep CNN architecture by considering the input as y arranged in the three-dimensional form of ½a 3 a 3 t , Here, a denotes the height and width of input, and t represents the channel number or depth. The convolutional layer has various kernels or filters with the size of ½b 3 b 3 s such that b must be smaller than input image, but s can be either greater or smaller than t. Here, the filters or kernels act as the base for local connections, which is evolved with the input data and shares the parameters, like bias g m and weight wm for generating the feature maps F m . The convolutional layer calculates the dot product between the inputs and weights. The nonlinearity or the activation function f is applied to the output value of the convolutional layers: x m 5 f ðw m 3 y 1 g m Þ
(5.1)
In the subsampling layers the feature maps are downsampled to reduce the network parameters, which helps to control the overfitting and speed up the training procedure. Here, the pooling operation is performed at all the feature maps with the g 3 g contiguous region, here, g is specified as filter size. Finally, the convolutional layer is the final stage layer
98
Artificial Intelligence in Data Mining
observed in the CNN structure. The layer in the CNN takes the midlevel and previous low-level features for generating the high-level of abstraction from input data. Moreover, the convolutional layer generates the classification score such that each score specifies the probability value of certain classes in the specific instance. The nonlinear transformation is utilized to screen the features obtained from the feature map. The mapping function that is commonly used in the CNN structure is tanH, ReLU, and sigmoid. The pooling layer acts independently for the feature maps. The weight sharing by deep CNN helps to minimize the training cost and a number of parameters, and with the usage of the BP algorithm, the training efficiency of CNN is increased. Applications of deep CNN Some of the applications involved in the deep CNN are listed as follows: • • • • • • • • • • • •
image recognition image processing video analysis image segmentation natural language processing speech processing computer vision image classification speech recognition face recognition satellite orbit prediction nonferrous metal entity recognition field Advantages The major advantage of deep CNN when compared to other predecessors is shown as follows:
• It automatically detects the features without human supervision. • Weight sharing.
5.4.4 Deep recurrent neural network Deep RNN [5] operates effectively with the varying input length based on the sequence of data. Deep RNN uses the previous state knowledge as the input of current prediction, and this process will be arbitrarily repeated for the number of steps to propagate the information through the hidden state at a certain time. Hence, this feature makes the deep RNN very effective in working with the sequence of data that is received over time. For example, the sequence or stream of characters and the time-series data, namely, stock price data are being typed in a mobile phone. Let us imagine that the deep RNN is created for predicting the next character of a person that is likely to be typed based on the previous character that is already being types. The character that is being typed as
Chapter 5 • Deep learning methods for data classification
99
well the previous character types is significant to predict the next character. When the user types the letter “h” at first, then the network will predict the next letter as “i” based on the training model to predict the letter as “hi.” When the user types the second letter as “e,” the network uses the new character along with the state of the previous hidden neuron to predict the next character. When the nest letter “l” is typed by the user, the network will predict the word as “help.” By adding the next letter “l,” the network will predict the next letter to be typed, which is “hello.” The two variants used by the deep RNN to solve the common issues by training the deep RNN are LSTM and gated RNN. Both the variants use the memory to make the predictions with the sequence of data over time. The major difference between the LSTM and gated RNN is that gated RNN consists of two gates for controlling the memory, a reset and an update, while the LSTM consists of three gates, namely, output, forgets, and an input, respectively. Deep RNN connects the output of the layer back to its input. However, this architecture is specially required to learn time-dependent structures, which includes characters or words in the text. In the deep RNN structure the output from the layer of nodes reenters as the input to the same layer. It is highly effective to perform text processing in documents. It is basically assumed as a processor with the sequential neural network as it has the internal memory for updating the state of neurons in the network based on the previous input. In general, the deep RNN is trained using the backpropagation algorithm. It stacks more layers in the network architecture; hence, it yields more benefits. Unlike the traditional neural network structure, the deep RNN uses the sequential pattern data in a neural network. However, this property is used in various applications such that embedded structure used in the information sequence represents useful knowledge. For example, to know the word in the sentence, it is required to understand the context. Hence, the deep RNN can be viewed as the units of short term memory, which includes an input layer, state or hidden layer, and the output layer, respectively. Fig. 56 represents the architecture of deep RNN.
FIGURE 5–6 Architecture of deep RNN. RNN, Recurrent neural network.
100
Artificial Intelligence in Data Mining
Application of deep RNN Some of the applications involved in the Deep RNN are listed as follows: • • • • •
natural language processing speech recognition language translation image captioning conversion modeling
Deep RNN is a network architecture, which is mainly intended for data classification and text mining. It assigns weights to the previous data sequence points. Hence, the deep RNN is more powerful for string, data, and text classification. In deep RNN the output received from the previous node is focused by neural net to achieve better analysis in the semantic dataset. Gated recurrent unit (GRU) GRU is the gating model for RNN, which is slightly varying from LSTM architecture. GRU consists of two gates and it does not pose the internal memory. LSTM LSTM is a category of RNN, which preserves long-term dependency with highly effective manner. LSTM is used to solve the vanishing gradient issues, using multiple gates for regulating the information to be passed into each state. Disadvantages The drawbacks of the deep RNN are portrayed as follows: • sensitivity to vanish and explode the gradients and • forgets the initial input.
5.4.5 Deep generative adversarial network Deep GAN is the integration of two deep learning NN, namely, generator network and discriminator network. The deep GAN consists of four different generative networks—deep belief network (DBN), deep Boltzmann machine (DBM), GAN, and variational autoencoder (VAE). However, the generator network generates synthetic data, while the discriminator network detects synthetic or real-data. The above two methods are adversaries in the sense that each tries to compete with others. However, the generator tries to generate the synthetic data from the real-data, whereas the discriminator tries progressively better to detect the fake data. The generative network acts as a convolutional NN for detecting the images with respect to the fake images. Initially, the generator randomly creates the noise, as it known how to create the images by fooling the detector. At each iteration the generator progressively performs better by generating the real images, and the detector performs better in detecting the fake images. In recent decades the GAN gained quite big popularity with real images. GAN is generally termed as neural nets, which is introduced by Ian Good fellow in
Chapter 5 • Deep learning methods for data classification
101
the year 2014. It is an important network in deep learning architecture. Yann Le Cunn, who is the father of deep CNN, said that GAN is the coolest network that arises within the past 20 years in deep learning structure. Applications of deep GAN Some of the applications involved in the deep GAN are listed as follows: • • • • • •
image generation image enhancement text generation drug discovery speech synthesis medical imaging
1. DBN DBN is a hybrid generative model, which consists of RBM with the undirected connection at the top two layers, while the bottom layer consists of a directed connection to receive the input from the previous layers. However, the lowest layer acts as the visible layer, which specifies the state of the input vector called data vector [6]. In DBN the individual layer consists of an undirected graphical model named as RBMs, which typically uses the binary units. The bottom layer of DBN consists of visible units, while the top layer consists of hidden units that are bi-directionally and fully connected using the symmetric weight. The major difference between the RBM and Boltzmann machine is that in RBM the units in the same layer are not interconnected that makes learning and inference with the graphical model. DBN effectively reconstructs the input with the supervised approach such that the layers in DBN act as feature detectors based on their inputs. However, the training procedure with the supervised model of DBN has the capability to achieve classification tasks. DBN is assembled with the connection of various RBM layers, where the hidden layer in the network can be viewed as the visible layer to the successive layers. The RBM layers are the generative stochastic network that generates the output as the probability distribution using the learned inputs. It computes the probability distribution using the unit bias and connection weights by receiving the state vectors from visible layer. E ðk; l Þ 5 2 pT k 2 qT l 2 k T wl
(5.2)
where l denotes the binary configuration of hidden layer units, p and q are the bias of visible and the hidden layer units. w is the matrix that specifies the connection weight among the layers. The energy function offers the probability among the visible and hidden neuron, which is specified as:
pðk; l Þ 5
e2Eðk;lÞ A
(5.3)
102
Artificial Intelligence in Data Mining
where A represents the partition function with the possible configurations. DBN uses the greedy algorithm to maximize the efficiency of the generative model by receiving the data from subnetwork as the RBM could not model the entire original data. When the initial weight is learned, then the data is mapped with the transposed weight matrix for creating the higher level data to process in the successive layer. The log probability value of the input data vector can be bounded with the approximating distribution. When adding, the new layer to the DBN at each time helps to increase the variation bounds in the deeper layer. Fig. 57 represents the architecture of DBN. A class of deep generative model is DBN, which consists of a number of RBM layers. The key component of DBN is greedy, which is the layer-by-layer learning algorithm used for optimizing the weight of DBN with linear time complexity based on the depth and size of the network. Moreover, initializing weight to the MLP layer in DBN produces a better result than specifying with random weight. Moreover, MLP with a various hidden layer or deep neural network (DNN) learned using the unsupervised DBN pretraining followed with the backpropagation, which is also termed as DBN. In addition, to provide better initialization points, DBN is entitled to more attractive properties. At first, the learning method effectively uses unlabeled data. Second, the DBN is used as a generative approach. Third, the overfitting problem is effectively alleviated using the pretraining step. As the DBN uses hidden layers with various neurons, it significantly enhanced the modeling power of DNN and is effectively used to create optimal configurations.
FIGURE 5–7 Architecture of DBN. DBN, Deep belief network.
Chapter 5 • Deep learning methods for data classification
103
2. DBM Similar to DBN, DBM can learn difficult internal representations. DBM is determined as the robust deep learning approach to perform object and speech recognition tasks. The approximate reasoning process allows the DBM to robustly handle the uncertain inputs. The lower-level layers in the DBM build the undirected RBM in the DNN. The factorial approximation can either take the result obtained from the first RBM layer or from the second layer. By taking the average value of these two distributions the approximation to the posterior can be balanced. 3. GAN GAN consists of a discriminative model and a generative model. The real-data collected by the network is differentiated based on the distributed samples. At each iteration of backpropagation the discriminator and generator, like in the game of mouse and cat, complete the process between other samples. However, the generator tries to generate more data to confuse and fool the discriminator, and later it tries to find the real-data. 4. VAE VAE uses the log-likelihood function of information and leverages the model to derive the bound estimator with a directed graph approach using continuous variables. The generative parameters used in generative approach assist the learning procedure of the variational parameter in the variational scheme. It uses the autoencoding variational Bayes for optimizing the parameters in the neural network.
5.4.6 Deep reinforcement learning DRL is used in the area of AI, which defines the step toward building the autonomous system using a higher level of understanding in the visual world. Recently, the deep learning enabled the DRL to solve the issues that are previously intractable, like learning how to play video games based on the image pixels. The DRL is applied to robotics, which allows the control policies in robots to learn the data directly from the input of the camera in the realworld. Some of the optimizations used in the DRL are deep Q network, asynchronous actor, and trust region policy optimization. The major advantage of using the DRL is to mainly focus on visual understanding through reinforcement learning. DRL solves the complexity issues, such as computational complexity, and memory complexity. The growth of DRL relies on the representation learning properties and the powerful approximation function of DNNs. The arrival of deep learning gained significant impact in various areas in the machine learning domain, which significantly increases the state-of-the-art mechanism, such as speech recognition, language translation, and object detection [7]. The most common property of DRL is that it automatically identifies the low-dimensional features (representation) from high-dimensional data, such as audio, text, and images. However, crafting the inductive bias with the network architecture based on the hierarchical representation, the deep learning has made efficient progress in finding the dimensionality curse. Deep learning accelerated the progress in reinforcement learning using deep learning methods.
104
Artificial Intelligence in Data Mining
Moreover, deep learning enables the reinforcement learning to solve the issues in decisionmaking strategy by setting the action spaces and high-dimensional state. DRL initiates the deep learning to develop an algorithm, which learns to play around 2600 video games directly from the image pixels at superhuman level. The approximation model in DRL demonstrates that the agent is trained using high-dimensional annotations based on the reward signal. AlphaGO is a hybrid DRL system, which comprises neural networks, and is trained using the supervised learning approach. DRL is widely used to solve a wide range of issues, such as robotics, where the control policies of the robots are learned from the camera input. DRL involves the agent to interact with the environment. The environment consists of state, where the agent likes to observe. The agent performs some actions, which modifies the state environment and it receives the reward signals, while it achieves the goal. The major goal of the agent is to decide how to interact with the environment in such a way to achieve the goal. The DRL used to train the NN, with the input, output, and multiple hidden layers. The input considered in this learning is the state of the environment. For example, let us consider the car that tries to receive the passengers to move toward the destination, by considering the input such as direction, speed, and position. The output will be the series of actions, such as slow down, turn left, speed up, and turn right. The reinforcement learning predicts the future for each action based on the present environment state. Fig. 58 represents the architecture overview of DRL. Applications of DRL The list shown as follows specifies the application of DRL:
FIGURE 5–8 Architecture overview of deep reinforcement learning.
Chapter 5 • Deep learning methods for data classification
105
• robotics, such as teaching robotics; • management tasks, such as logistics, and resource allocation; and • financial tasks, such as asset pricing, portfolio design, and investment decisions. Advantages • • • •
Achieve ideal behavior within the specific context. It maintains the balance between exploitation and exploration. It is bounce to learn from the experience even in the deficiency of training set. It corrects the error during training process.
5.4.7 Deep recursive neural network Deep RvNN makes the prediction with the hierarchical structure and classifies the output based on the compositional vectors [4]. However, RvNN is developed with the inspiration of recursive autoassociative memory, which is an architecture used to process the objects that are structured in the arbitrary shape, such as graphs and trees. The key role of deep RvNN is to take the recursive data with the variable size and generates the distributed representation at fixed width. The training procedure of the deep RvNN is done using the backpropagation through the structure (BTS). BTS follows a similar procedure of the backpropagation algorithm, which supports the tree-like structure. The autoassociation is used to train the network for reproducing the pattern from input to the output layer. Deep RvNN builds the syntactic tree by calculating the score value for the possible pair. For each pair units the score for plausibility is computed to merge the segmented units. The pair having the highest score is integrated with the compositional vector. After merging the segmented units the deep RvNN generates a larger region of segmented units, compositional vector, and class label.
5.4.8 Deep longshort-term memory Deep LSTM [8] is modeled to compensate the vanishing gradient descent such that deep LSTM provides longer term memory. The deep LSTM consists of internal loops, which is used to store the information. It consists of five important elements: the input gate, cell, forget gate, state output, and output gate. The gates operations—such as erasing, writing, and reading—are operated in memory state of the cell. The memory state of deep LSTM is modified based on the decision of the input gate with respect to the sigmoid function using on/off state. When the value given to the input gate is closure to zero or very small value, then there exists no change in the state of cell memory. Deep LSTM is effectively used to handle long-term dependency issues. It has the capability to remember the values at both the short and long durations. In the driving environment the on-ramp scenario involves the interaction with the nearby vehicles, which is not predictable from the perspective of ego vehicle. The historical information regarding driving is integrated with the internal state, which is specified as an adequate representation in the interactive environment. Deep LSTM surprisingly performs well to solve the issues in sentiment classification. This is the modified version of the RNN, which handles the issues for modeling the dependency at a long range. The major element in this network structure is the memory cells, which helps to record the information.
106
Artificial Intelligence in Data Mining
5.4.9 Hierarchical deep learning for text HDLTex integrates the deep learning model to allow specialized learning in the document hierarchy [9]. It effectively performs the document classification with a limited number of classes. It creates the architecture that specializes the level of hierarchy in the text documents. Each layer in this network receives the input from the previous layer and generates the output such that the layers are fully connected. Here, the input layer consists of texture features, while the output layer consists of the node for classification label and there exists only one node for binary classification. HDLTex provides stacks of deep learning structure to ensure a hierarchical understanding of documents. HDLTex is the commonly used baseline structure in the hierarchical dataset. When the documents are grouped in a hierarchical manner, the multiclass models are difficult to process using the standard supervised learning approach. Hence, the HDLTex is launched as a new model in the neural network structure to perform document classification in a hierarchical way to generate better classification results by combining various deep learning methods. The major requirement of HDLTex deep learning strategy is to perform hierarchical document classification. However, the traditional classification methods can effectively work well only with limited classes, but performance drop may arise while increasing the classes. HDLTex effectively solves this issue by generating the architecture that integrates deep learning methods. Fig. 59 represents the architecture of HDLTex.
FIGURE 5–9 Architecture of HDLTex. HDLTex, Hierarchical deep learning for text.
Chapter 5 • Deep learning methods for data classification
107
FIGURE 5–10 Architecture of random multimodel deep learning.
5.4.10 Deep autoencoder Deep autoencoder is the feed forward neural network, which is utilized to transfer the input neurons to the output neurons using single or multiple hidden layers as like stacked autoencoder. The major parts of the autoencoder network structure are the encoder function and the decoder function, which are used to reconstruct the data. The architecture of deep autoencoder consists of three layers: input, hidden, and output layer. However, the deep learning algorithms, namely, deep autoencoder, are specially used for data reconstruction, dimensionality reduction, and feature learning. The autoencoder used to solve these issues are called sparse, denoising, and undercomplete [10]. Applications of deep autoencoder Here are some of the lists of applications of deep autoencoder: • • • • •
dimensionality reduction image denoising feature extraction recommendation system image compression
108
Artificial Intelligence in Data Mining
5.4.11 Random multimodel deep learning RDML [11] method uses three different architectures: deep CNN, deep RNN, and deep NN. The novelty of the RDML method is the usage of multirandom models to perform image and text classification. RDML can effectively be used in any kind of dataset to perform the classification process. The number of nodes and layers in the deep learning models is randomly generated for the classification procedure. It has the ability to process various kinds of data types, such as images, videos, and text. It solves the issues in finding a better deep learning architecture and structure, while simultaneously increasing the accuracy and robustness through the ensemble of deep learning methods. Fig. 510 represents the architecture of RDML. RDML includes three random models—deep NN at left, deep CNN at the center, and deep RNN at the right most ends—such that each neuron unit can be either GRU or LSTM. It searches the hyperparameters randomly based on the nodes and hidden layers. It finds the choice to select the hyperparameter based on the random feature maps. The structure for deep CNN used by the RDML is 1D convolutional layer for the text data, 2D for the images, and 3D used for video processing. It uses two specific deep learning structures: LSTM and GRU. The number of LSTM or GRU units and the hidden layers used in the RDML results in the search based on the hyperparameter.
References [1] de La Torre J, Puig D, Valls A. Weighted kappa loss function for multi-class classification of ordinal data in deep learning. Pattern Recognit Lett 2018;105:14454. [2] Amarasinghe K, Marino DL, Manic M. Deep neural networks for energy load forecasting. In: IEEE 26th international symposium on industrial electronics (ISIE). June 2017. pp. 14838. [3] Yi H, Shiyu S, Xiusheng D, Zhigang C. A study on deep neural networks framework. In: IEEE advanced information management, communicates, electronic and automation control conference (IMCEC). October 2016. pp. 151922. [4] Pouyanfar S, Sadiq S, Yan Y, Tian H, Tao Y, Reyes MP, et al. A survey on deep learning: algorithms, techniques, and applications. ACM Comput Surv (CSUR) 2019;51(5):92. [5] Murad A, Pyun J-Y. Deep recurrent neural networks for human activity recognition. Sensors 2017;17(11). [6] Lopes N, Ribeiro B. Deep belief networks (DBNs). Machine learning for adaptive many-core machines— a practical approach. Cham: Springer; 2015. p. 15586. [7] Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA. Deep reinforcement learning: a brief survey. IEEE Signal Process Mag 2017;34(6):2638. [8] Li X, Wu X. Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. In: Proceeding of IEEE international conference on acoustics, speech and signal processing (ICASSP), Brisbane. 2015. [9] Kowsari K, Brown DE, Heidarysafa M, Meimandi KJ, Gerber MS, Barnes LE. Hdltex: hierarchical deep learning for text classification. In: IEEE international conference on machine learning and applications (ICMLA). December 2017. pp. 36471. [10] Almalaq A, Edwards G. A review of deep learning methods applied on load forecasting. In: IEEE international conference on machine learning and applications (ICMLA). December 2017. pp. 51151. [11] Kowsari K, Heidarysafa M, Brown DE, Meimandi KJ, Barnes LE. RMDL: random multimodel deep learning for classification. In: Proceedings of the 2nd international conference on information system and data mining. 2018. pp. 1928.
6 Neural networks for data classification Balachandra Kumaraswamy DE PARTMENT OF ELECTRONICS AND TELECOMMUNICATION, B.M. S C OLLEGE OF ENGINEERING, BE NGALURU, INDIA
6.1 Neural networks Neural network (NN) is a circuit or network of neurons, or in the modern view, the artificial neural network (ANN) is composed of the artificial nodes or neurons. However, the NN is either termed as ANN or biological neurons, which is intended to solve the optimization problems. The Neural network requires more time for training. The major reason behind the usage of NN methods for multisource geographic data classification is that the NN classifiers are distribution-free. NN does not require any explicit modeling for data classification from the data sources. However, NN is widely applied to various fields. In the statistical model, it is not necessary to treat the data independently. However, NN resolves the issues of multisource analysis by specifying the influence of classification in the data sources. NN is used for the large dimensional dataset as NN consists of various inputs and hidden neurons. Moreover, the training time required for the large NN classifier is long. The methods used to train the NN classifier is based on the estimation of weight and bias in the networks. When the classifier is large, various parameters are required to be estimated with respect to the training samples. In such case, overfitting is possibly observed, which indicates that NN is not effectively generalized with the training data in order to achieve better classification accuracy. For the large dimensional dataset, the curse of Hughes phenomenon or dimensionality may occur. Hence, it is required to minimize the dimensionality of input data in NN in order to get a smaller network, which effectively performs well with respect to the accuracy of test classification and training data. Hence, the feature extraction process in the NN classifier is used to compute the representation of input data optimally with less dimensional space. The representation of data does not reduce the classification accuracy when compared to the feature space. Moreover, feature extraction techniques are available in the NN classifier. Some of the feature extraction techniques include discriminant analysis (DA) and principal component analysis (PCA). The NN classifier demonstrated more attractiveness in the geographical and remote sensing data than the conventional classifiers for data classification. Artificial Intelligence in Data Mining. DOI: https://doi.org/10.1016/B978-0-12-820601-0.00011-2 © 2021 Elsevier Inc. All rights reserved.
109
110
Artificial Intelligence in Data Mining
FIGURE 6–1 Architecture of NN classifier. NN, neural network.
NN is highly significant to achieve data classification when the existing statistical model is not available. Most of the NN techniques are operated based on the minimization or optimization of the cost function as represented below, X5
m X x51
λx 5
m X n 1X ðpxk 2axk Þ2 2 x51 k51
(6.1)
where x represents the pattern number, m denotes the sample size, pxk indicates the desired output of kth neuron, axik represents the actual output of kth neuron, and n represents the output neuron. The most frequently used optimization method in the NN classifier is the gradient descent optimization model. Fig. 61 represents the architecture of NN. NN is the algorithms mainly used for the purpose of learning and optimization based on the concept of the human brain. In general, the NN consists of the following five components: 1. 2. 3. 4. 5.
The directed graph also called as network topology and whose arcs are referred as links. Variable associated with the nodes. Weight connected with the link. Bias connected with the nodes. The transfer function is used at each node that specifies the state of the node as the function based on its weight and bias. The transfer function either takes the step or sigmoid function to generate the output.
6.1.1 Background and evolution of neural networks In 1982 John Hopfield modeled an approach for creating more useful machines using the bidirectional lines. In the same year, Reilly and Cooper used the hybrid network with multiple layers, where each layer is used in the various problem-solving strategies. In 1986 the NN with multiple
Chapter 6 • Neural networks for data classification
111
layers is extended by the Widrow-Hoff rule. David Rumelhart, who is a former member of Stanford’s psychology department, modeled the backpropagation network. The hybrid networks use only two layers, whereas the backpropagation network uses many layers. The drawback behind the backpropagation network is that they are slow learners, which requires thousands of iterations to learn. Recently, NNs are used in various applications. The major idea of the nature of NN is that if it works in nature then, it must be able to work on the computers.
6.1.2 Working of neural networks NN classifier consists of a number of processors, and these processors are operated in parallel and are arranged in the form of tiers. The input raw data is received by the first tier, which is equivalent to the optic nerve that receives the data from human beings. Similarly, each succeeding tier receives the input data from the previous tier and sends the output to the next tier. Finally, the tier situated at the end processes the data and generates the output. Here, the nodes in the network are strongly interconnected using the nodes in the previous tier and the nodes in the next tier. However, each node in NN consists of their own knowledge that includes the rules for programming and the rules to learn by itself. The benefit behind the efficiency of the NN classifier is that NN can quickly learn the features and is extremely adaptive than the conventional network classifiers, such as decision tree and Support Vector Machine. Each node determines the significance of input data received from the previous nodes. However, the input that contributes toward the rightmost output is based on the weight.
6.1.3 Neural networks for data classification Classification is one of the important decision-making processes of human activity. The classification issue arises when the objects are assigned to the predefined class or group with respect to the number of related objects and observed attributes. Various issues in the industry, science, medicine, and business are treated as classification issues. However, the traditional classification process, such as DA is modeled based on the principle of Bayesian decision theory. The probabilistic model is used to compute the posterior probability in order to take the classification decision more effective. The major drawback behind the usage of a conventional statistical model is that they performed well only is the assumptions are satisfied. However, the major effectiveness of the statistical model is based on the extent of various conditions and assumptions. The users must have enough knowledge in both the model capabilities and data properties before applying the data model. NN emerges as an attractive tool to perform data classification. The recent research activities modeled that the NN classifier is a promising alternative model than the traditional classification techniques. NN has the capability to learn by themself and to generate the output. The input in NN is stored in its own networks rather than storing in the database; hence, the data loss will not affect the working process. Hence, such networks can be learned from the examples and can be applied to similar events by making them to work in real-time events. When a neuron is not functioning or a piece of data is missing, the NN can easily detect the fault and generate the proper output. It
112
Artificial Intelligence in Data Mining
FIGURE 6–2 Block diagram of NN. NN, neural network.
can achieve multiple tasks parallely without degrading the system performance. Fig. 62 depicts the block diagram of NN.
6.1.3.1 Advantages The trained NN classifier has various advantages than the classical methods that are listed below: 1. Adaptive learning: It has the facility to learn the performance of various tasks with respect to the training data. 2. Self-organization: It creates its own representation or organization of data by receiving the proper information at learning time. 3. Real-time operation: It allows the hardware devices and parallel computation and takes the benefits of capability being manufactured and designed. 4. Complex problem solving: It has the capability to learn the faults and increases the performance by solving complex issues. Moreover, the major benefit of NN classifier that lies in the theoretical aspects is listed as follows: Data-driven self adaptive model: It can adjust according to the data it seems without any explicit distribution or specification of functional form. Universal functional approximators: It can effectively approximate any functions with respect o the arbitrary accuracy.
6.1.3.2 Applications NN is widely applied to various real-world data classification process, such as • • • • • • • • • • •
Bankruptcy prediction and credit scoring Speech recognition Product inspection Fault detection Bind rating Medical diagnosis Handwriting recognition Financial trend prediction Optical character recognition Face recognition Image analysis
Chapter 6 • Neural networks for data classification
113
6.1.4 Characteristics of neural networks The architecture of NN is based on the model of the human brain as the processing tasks are distributed into numerous neurons, such as units, processing elements, and nodes. However, a single neuron has the ability to perform the data processing strategy, and the strength of this network is achieved with the result of collective behavior and connectivity with the simple nodes. NN records the strong point of synaptic connections through which the information can be reproduced. NN is trained by adjusting the strength of the connection for achieving the abstract relations between the specified input and output data. However, the NN classifier is characterized through the following elements, namely • • • • •
Set of processing elements Connection between the elements Rule of signal propagation with the network Transfer or activation functions Training algorithms, such as learning algorithms or learning rules
The configuration of the NN classifier is defined by setting the processing elements. The processing elements, such as units, neurons, and nodes, are the simple elements used to perform the signal processing tasks in order to specify the output signal. Each node receives the input signal from the neighboring nodes and transmits the output signal to the destination node. NN is considered as a parallel model as it contains numerous processing elements to make the computations more feasible. Moreover, the NN classifier effectively performs the data processing task with the distributed system in a parallel manner. NN consists of three different types of nodes, as input, hidden, and output node. The input node receives the input signal from the outside sources that are the source lies outside the network. The output node forwards the output value to the outside of the network. However, each node transmits the signal to the neighboring nodes with varying strength. Most of the currently available NN classifier methods have a fixed network structure. The network structure is nothing, but the number of nodes and the types of connection that exists between the nodes, and although the network connection is very flexible. However, the ANN is completely specified using the generalization, representation, and the capability of modeling based on the activation function, learning rule and network structure.
6.2 Different types of neural networks Different types of NN classifier use various principles to determine their learning rule and each have their own strength. Fig. 63 represents the types of NN classifiers. The types of the NN classifier are listed below: • Feedforward NN Multilayered feedforward NN • Radial basis function NN
114
Artificial Intelligence in Data Mining
FIGURE 6–3 Types of NN classifier. NN, neural network.
• • • • • • •
Multilayer perceptron (MLP) NN Convolutional NN Recurrent NN Modular NN Artificial NN Fuzzy NN Probabilistic NN
6.2.1 Feedforward neural network Feedforward NN [1] is the basic type of NN classifier. In the feedforward NN, the data is passed through various input nodes until it reaches the output. Here, the data is moved in a single direction from the first tier to the output node. However, this process is termed as the front propagated wave and this process is effectively achieved using the activation function of network classifier. Most of the complex network classifier does not use any backpropagation model. A feedforward NN consists of either a single layer or many hidden layers. Here, the sum of the products of input and their weights are evaluated and is forwarded to the output node. The feedforward NN does not contain any closed path in the topology structure. The input node of this classifier is the one with no acrs and the output node does not have any arcs away from them. Fig. 64 represents the architecture of the feedforward NN classifier. The feedforward NN is used to deal with the data that contains a lot of noise. This classifier is relatively very simple to maintain, and the continuous function of the NN classifier can be differentiated allowing, and gradient descent, and uses the sigmoid activation functions, like bipolar and binary.
Chapter 6 • Neural networks for data classification
115
FIGURE 6–4 Architecture of feedforward NN classifier. NN, neural network.
6.2.1.1 Applications Due to the difficulty in classifying the target classes, the feedforward NN is used in various technologies, such as • Computer vision • Face recognition 6.2.1.1.1 Multilayered feedforward neural network The multilayered feedforward NN [2] is a network, where the input signals are extensively forwarded through various layers until it reaches the output. Each layer holds a number of nodes, and each node in this classifier is referred as the processing element that is associated with the suitable activation function through which the weighted sum of the input values are send to the output node. However, the output obtained from the previous nodes is send to the input of the next successive layers. Each node contains a weight vector through which the input vector is transformed linearly. Fig. 65 depicts the architecture of multilayered feedforward NN.
6.2.2 Radial basis function neural network The radial basis function neural network classifier (RBFNN) [3] is the most popular topology structure in the NN classifier, which has the property of universal approximation. Due to the simple structure and fastest learning process, RBFNN significantly outperformed than the classic MLP, especially for the approximation applications. The RBFNN is a three-layered feedforward NN model. RBFNN owns the single-hidden layer as it does not require multiple layers in order to achieve nonlinear behavior in the classification stage. However, the neurons of the RBFNN classifier are considered as the nonlinear Gaussian function, as the results of the network are used with the MLP layer. The lightest network with one input, one hidden, and one output layer is selected in the RBFNN classifier. However, the input layer executes the state vector A^ j21 5 ½A1 ; A2 ; . . .; At T , where j 2 1 specifies the time constant. The hidden layer executes the nonlinear mapping functions based on the input value, while the
116
Artificial Intelligence in Data Mining
FIGURE 6–5 Architecture of multilayered feedforward NN classifier. NN, neural network.
FIGURE 6–6 Architecture of RBFNN classifier. RBFNN, radial basis function neural network classifier.
output layer contains the linear combination of hidden neurons that is transformed to the output space value. However, the output space is termed as the disturbance vector that is represented as, b^ 5 ½b1 ; b2 ; . . .; bt T . The RBFNN classifier is used for estimating the unmodelled disturbances with respect to the nonlinearities present in the dynamics of the system. Fig. 66 shows the architecture of the RBFNN classifier. The RBFNN consists of three layers in the network topology structure, which comprises the input layer, hidden layer, and output layer. However, the input vector A is employed in
Chapter 6 • Neural networks for data classification
117
order to derive the structure of RBF network. The hidden layer specifies the radial basis function φ using the centered Gaussian as expressed as, φr ðAÞ 5 e2αð:A2lr :Þ
2
(6.2)
here, r represents the number of neurons, and lr indicates the center of the neuron r, which is randomly selected. However, the number of neurons specified is the user-defined parameter, and its value is the application-dependent that is selected through the reconstruction accuracy with respect to the computation time. The parameter α defines the shapes of Gaussian function such that the high value for α sharpens the shape of the Gaussian as bell shape, whereas the low value for α specifies the real space. The narrow Gaussian function maximizes the responsiveness of the RBFNN classifier. Due to the narrow bell shape Gaussian function, the output generated by the classifier vanishes. Hence, the parameter α is selected with respect to the order of magnitude of the exponential argument. The radial function obtained as the output of the network classifier is normalized as, φðAÞ r51 φr ðAÞ
φnorm ðAÞ 5 Pg
(6.3)
However, the classical RBFNN structure specifies the localized characteristic features, whereas the normalized RBFNN exhibits better-generalized properties by reducing the curse of dimensionality.
6.2.3 Multilayer perceptron The concept of perceptron was first introduced by Rosenblatt in the year 1958 and is started in the development of NN [4]. The perceptron element consists of the single node, which accepts the input weight and the threshold value and generates the result based on the rule. The perceptron effectively classifies the separable data, but failed to handle the nonlinear data. The alteration of perceptron in the NN classifier is the single-layered delta rule, which was developed by Widrow and Hoff in the year 1960, respectively. However, the supervised learning is modeled using the least mean square algorithm. In the past decades, the development of the NN classifier using MLP was very slow due to the inability of solving the nonlinearly separable problems. In 1980 there exists major attention in the network structure and the increasing factor of computing power develops various new topologies and network algorithms. However, the MLP is widely used in various applications based on the category of NN. It consists of various interconnected nodes, which is equivalent to the biological neurons. Each node in the network acts as the processing element that receives the input and generates the output at the output layer. However, the arrangement of nodes in the MLP is represented as the network architecture. MLP effectively separates the nonlinear data, as the network is multilayer. MLP consists of three or more types of network layers. The number of network layer specifies the number of nodes in the
118
Artificial Intelligence in Data Mining
layer, but not the number of weights of the network. The input layer is the first type of layer present in the network, where the nodes are specified as the elements of the feature vector. The input vector consists of wavebands in the dataset, like texture features of the image or some other complex parameters. The hidden or the internal layer is specified as the second type of layer as it does not have any output units. It does not have any rules, but the theory represents that one hidden layer can be represented using the Boolean function. The increase in a number of nodes in the hidden layer facilitates the network structure for learning more complex problems, whereas the capacity of the rule is minimized. The number of nodes present in the second hidden layer is three times the total number of nodes present in the first hidden layer. In 1997 Foody and Arora assess various factors that affect the accuracy of NN, where the number of nodes present in the hidden layer is one among them. The output layer is the third layer, which specifies the output data. In the image classification process, the number of nodes available in the output layer is equivalent to the number of classes in the classification. Each node is interconnected with the other nodes in both the following and the preceding layers by means of connections. Fig. 67 represents the architecture of MLP. The architecture of MLP consists of four MLP layers. It is viewed that a single unit in the network layer, which connects all units in the successive layer. The flow of data in the network is performed in the forward direction that is from left to right, and the last layer is the output layer that returns the result. The input given to the fault diagnosis system is the features extracted from the mechanical vibration signals. The features specify the discriminative
FIGURE 6–7 Architecture of MLP. MLP, multilayer perceptron.
Chapter 6 • Neural networks for data classification
119
data for signals of various health situations by eliminating the unrelated variations. The labels are the actual outputs that specify the health conditions for the specified inputs. The major aim of the MLP is to reduce the discrepancy among the desired labels and the output result. This can be achieved by training the MLP with the optimization algorithm in order to adjust the parameters.
6.2.4 Convolutional neural network Convolution neural network (CNN) [5,6] utilizes the convolution operation rather than using the matrix multiplication. The output generated from the previous layer is given as the input to the next successive layer. However, the output of the previous layer contains the activation function, offset of the feature map, the weight of the feature map, and the dimension of the convolutional kernel. CNN consists of three different layers, namely the pooling layer, convolution layer, and fully connected layer. Here, the pooling layer is utilized to compress the parameters and data in order to reduce the overfitting. However, the maximum pooling layer is used to generate the maximum number of outputs. The pooling and the convolution layers are stacked on the toplevel and form the structure of the hidden layer in the NN. However, these layers are used to extract the complex features from the original information. Accordingly, the full connection layer takes the feature vector as the input and processes the convolution operations based on the activation function. The classical CNN has numerous parameters and complex structure hence it requires a number of samples for supporting the network training in order to achieve effective parameter identification. Fig. 68 represents the flowchart of CNN. CNN is fundamentally defined as the mapping among the input units and the output units. CNN can studyvarious different number of mapping relatives among the input layer and the output layer. It does not require any mathematical model while CNN is trained by the known patterns hence information may not loss in middle. However, the training models are categorized into two different stages. However, the first stage is the forward communication, where the input vectors are given as the input to the network through which the output values are calculated at each layer. Accordingly, in the second stage, the distinction among ideal output and actual output is computed and the weights are adjusted based on the minimal error, and the actual output value is recalculated based on the adjusted weight. When the dissimilarity among the ideal output and the actual output
FIGURE 6–8 Flowchart of CNN. CNN, convolution neural network.
120
Artificial Intelligence in Data Mining
FIGURE 6–9 Architecture of CNN. CNN, convolution neural network.
meet the requirements of accuracy then, the CNN is trained otherwise, the weight needs to adjust until it meets the requirements. The architecture of the CNN classifier is depicted in Fig. 69. CNN is a type of machine learning model based on the view of a supervised learning method. CNN is a type of feedforward NN model based on the convolution computation and deep structure. The network structure with the translation-invariant is obtained by connecting the weighted neurons with the various region of the upper-layer network structure. CNN can propagate to the backward and forward direction. When the network is propagated forward then, the desired output is generated based on the input vector. If the network is propagated in the backward direction, then the error value is used to compute the difference between the expected output and the actual output value. In the CNN structure, the input of the neurons is relatively small and this increases the network layers, which can be propagated through gradients. However, the connection exists between the network layers are highly effective to identify the tasks. Based on the network parameters, the training performance of CNN is increased. CNN can able to capture only very few dependencies from the input data and they specify the local dependencies, as the convolution filters have a very short length.
Chapter 6 • Neural networks for data classification
121
6.2.4.1 Application CNN is widely applicable in various vision related tasks, such as • • • • • • • •
Video compression Object and action recognition Image retargeting Surveillance system Computer vision and neuroscience Natural language processing Speech recognition Acoustic modeling
6.2.5 Recurrent neural network Recurrent neural networks (RNN) [7,8] is a type of NN, which is widely used to perform the sequence analysis process as the RNN is designed for extracting the contextual information by defining the dependencies between various time stamps. RNN consists of numerous successive recurrent layers, and these layers are sequentially modeled in order to map the sequence with other sequences. RNN has a strong capability in order to capture the contextual data from the sequence. However, the contextual cues in the network structure are stable and are effectively used to achieve the data classification process. RNN can operate the sequences with arbitrary length. Fig. 610 represents the architecture of the RNN classifier.
FIGURE 6–10 Architecture of RNN classifier. RNN, recurrent neural networks.
122
Artificial Intelligence in Data Mining
RNN is the extension of feedforward NN with the presence of loops in hidden layers. RNN takes the input with the sequence of samples and identifies the time relationship between the samples. The Long short-term memory (LSTM) solves the classification issues by adding the network parameters with the hidden node and releases the state based on the input values. RNN achieves better performance than LSTM by activating the states based on network events. The regular RNN node consists of a single bias and weight. The RNN is evaluated using the gated recurrent unit and LSTM. The one-to-one network configuration is formed using the network parameters, where the time step of each input data generates the output with the specific time step. The regular RNN node consists of a single bias and weight, whereas the LSTM consists of four bias or weights as specified below: • • • •
Forget gate layer Input gate layer Output gate layer State gate layer
The input and the forget gate controls the previous hidden state and the present input state that contributes to the cells state. However, the input, output, and the forget gate activation is scaled using the sigmoid function, and the output of the hidden state is filtered using the hyperbolic function. The optimization of network parameters using the stochastic gradient is achieved based on the sequence of input data. However, the hyperparameters are the structure of the network (size and layers), sequence length, batch size, momentum, and learning rate, respectively. The hyperparameters are set through the stochastic or manual search. The input of the RNN is the sequence of vectors as y1 ; y2 ; . . .yM , the sequence of hidden states as fz1 ; z2 ; . . .zM g, and the output unit as fv1 ; v2 ; . . .vM g, respectively. The recurrent layer consists of the recurrent function d, which takes the input vector yx and the hidden unit of the previous state zx as the input and generates the hidden state as, zx 5 dðyx ; zx21 Þ 5 tanhðP yx 1 Q zx21 Þ
(6.4)
Moreover, the output units is calculated as, vx 5 softmax ðR zx Þ
(6.5)
where, P, Q, and R represents the weight matrices, and the activation function tanh indicates the hyperbolic tangent function. RNN uses the highly complicated function in order to learn and control the information flow in the recurrent layer for capturing the long term dependencies. The structure of RNN is represented in Fig. 611.
6.2.6 Modular neural network The modular neural network [9] is operated based on the decomposition principle, where the complex tasks are decomposed into simpler subtasks. The entire modules are separated into various subtasks such that each module makes the process simpler. The subtasks are
Chapter 6 • Neural networks for data classification
123
FIGURE 6–11 Structure of RNN. RNN, recurrent neural networks.
executed through a series of models. Each model performs its own operations based on their characteristic features with respect to the problem. The resolution of integrated objects is achieved by integrating the grades of the individual local system with the dependent task. The division of the entire tasks into the smaller subtasks can be achieved either hard or soft unit subdivision. However, in the first case, more than two subtasks are simultaneously assigned to the local system, whereas in the latter case, single-computing model is used for executing each task in the computing system. The modular system consists of various submodules to be worked in the main tasks. Each module in the computing system has certain characteristic features that are listed below: • The domain modules have specialized architectures for responding and recognizing the certain subsets of entire tasks. • Each module is independent with the other modules, but does not influence the other modules. • The computing modules contain simpler structure when compared with the entire system. • A module can provide a faster response based on the input rather than using the monolithic system. • The reply obtained from each module is integrated together in order to generate the overall response of the system. The human visual system is the example of the modular system, where various modules are responsible to perform specific tasks, such as color recognition, shape, and motion detection, respectively. However, the central nervous system receives the responses from the individual modules and generates a complete object realization to be processed using the visual system. The modular NN classifier is especially effective in certain types of applications, like classification or the forecasting problem in contrast with the existing monolithic NN classifier. However, the
124
Artificial Intelligence in Data Mining
classification problem includes various characteristic problems in various modules. For example, in the function approximation case, the continuous function failed to model the conventional NN at the same time, whereas the modular network classifier quite solves this issue effectively. The modular NN is made up of the subsystems separated based on the functionalities and structures. Each subsystem represents the separated network structure, which performs the operations of subtasks individually. Various learning algorithms are integrated together to achieve better training performance with respect to their tasks. However, the modularity of the prior knowledge issue is introduced in the network structure in order to provide the structural representation.
6.2.6.1 Advantages of modular neural networks Here are some of the advantages of modular NN classifier listed below: 1. Simplification of traditional NN system: The complexity of the monolithic NN maximizes the size and complexity issues. However, the number of scales maximizes the size of the NN. The hybrid model separates the tasks into subtasks in order to simplify the process. 2. Immunity: The homogeneous communication of the conventional network structure leads to a susceptibility of interference and poor stability. However, the hybrid network model increases fault tolerance and reliability. 3. Extensibility: Scalability is a major features in the hybrid network model. If it is required to retrain the network, then the necessary retraining process will be carried out. Due to the design of the network structure, it is not required to completely retrain the hybrid network model. 4. Retraining is no longer needed: Hybrid network model is the basis of incorporation and has the capability of learning the controlled and uncontrolled paradigm. Each module is individually pretrained with the specific subtasks and integrates the unit together or it can be trained using the integrating factor. In later situation, the individual modules cooperate or complete the preparation tasks in order to produce the desired objectives. However, the training model is the combined function of the uncontrolled and controlled learning paradigms. 5. Efficiency: By partitioning the entire tasks into simpler subtasks helps to minimize the computation cost. The hybrid network model learns various functional maps quicker than the traditional monolithic network structure as each module in the hybrid network structure learn the functional parts of general mapping. The hybrid network model increases the training time and learning ability by degrading the degradable tasks. 6. Speed training: To enhance the survival of the biological system, the features are integrated with the existing approach in order to adapt the changing conditions. The hybridity model enables the economy in the view of changing the operating conditions, which does not make any changes to the entire system. Accordingly, it is possible to reprocess the specialized module with the same nature in different applications. 7. Biological analogy: Modular or the hybrid network model has an analogy with the biological nervous system. The visual tasks can be broken into various subtasks, and the
Chapter 6 • Neural networks for data classification
125
subtasks can be optimized with different situations. Moreover, the same structure can be replicated many times by deserving the property of accuracy and robustness by providing the visual cortex. Here are some of the major benefits of the modular network structure, such as • • • • •
Scalability Constant adaptation Computational efficiency Additional training Economics of education and training
6.2.7 Artificial neural network The ANN [10] is the system of software or hardware, which is defined based on the working process of neurons in the nervous system and the human brain. ANN was initially designed in 1943, but it comes recently under the light of artificial intelligence. ANN is a wide variety of network technology that comes under the field of artificial intelligence. However, deep learning is the branch of machine learning as it uses various categories of NNs. The learning algorithms are stimulated using the function of the brain hence, many experts move the learning algorithms towards real artificial intelligence. ANN is widely used to solve the issues in the following two main areas, namely • Learning a complicated nonlinear mapping • Classification of a large amount of data ANN is currently used to solve various complex issues, and hence, ANN is increased on-demand with respect to the time. ANN is a powerful tool to bridge the gap in data classification. ANN significantly increases by dealing with complex processes. The capability to approximate the unknown functions is adopted to remodel the nonlinearity system and the unmodeled disturbances in the environment. The major advantage of estimating the uncertainty behavior is based on the usage of a nonlinear adaptive network control model. Traditionally, ANN is solely used in the disturbance estimation in order to estimate the state of system using the navigation filter. The extended Kalman filter (EKF) is used to train the ANN by augmenting the state of the configuration with the weight of ANN. For the large-scale network, the configuration process increases the burden of computation. Furthermore, the coupled structure does not provide any estimation for the uncertainty unless the vector in the disturbance is added with the state vector for estimating the constant parameter. On the other hand, the output of ANN is fed to the propagating of a filter for estimating the disturbance vector. Instead of using the state vector, the dynamics of EKF is augmented with the ANN, which effectively captures the dynamics of the unmodeled system. ANN learns the function that describes the disturbance, which means that the mismatch between the a-priori guess and the measurement is selected for EKF. However, the accuracy
126
Artificial Intelligence in Data Mining
of the augmented dynamical model changes with respect to the time, and hence, the covariance matrix is used to capture the variation. ANN accurately classifies the training data by adjusting the weight factor, and hence, it can classify the unknown data. ANN requires long time to achieve the training process and has high tolerance to the incomplete and noisy data. Some of the salient features of ANN are described as follows: • • • • • •
Real-time operation Massive parallelism Adaptive learning and self-organization Learning and generalizing ability Distributed representation Error acceptance through redundant information coding
6.2.7.1 Advantages The ANN used to perform the data classification has the list of advantages described below: • Stronger generalization by eliminating the redundant nodes through the removal of saturation. • Achieves better robustness • Small deviation, accurate, and fast real-time processing • High predicted and classified precision • Scalable algorithm
6.2.7.2 Applications ANN is widely preferable in various fields, namely • • • • • •
Language processing and translation Speech recognition Route detection Forecasting Image processing Face recognition
6.2.8 Fuzzy neural network Among various intelligent methods, the Fuzzy neural network (FNN) [11] becomes the most popular in the past decades. FNN is a type of hybrid model, which integrates the learning capabilities in NN using the reasoning abilities of the fuzzy inference system. FNN is the functional component of the NN classifier that comes under the sound learning algorithm, namely, error backpropagation in order to estimation the parameters. The fuzzy inference system supports the reasoning ability and offers an effective way for representing the linguistic expert knowledge in a formal manner. FNN integrates the advantages of fuzzy sets with neurocomputingto solve the issues in data classification.
Chapter 6 • Neural networks for data classification
127
6.2.9 Probabilistic neural network The probabilistic neural network (PNN) [12] is composed with four layers, namely the input layer, model layer, cumulative layer, and the output layer. The first layer is termed as the input layer. Each neuron in the network is the single input-single output model and transfers the functions in a linear manner. The second layer is called as the model layer, which connects the input layer with the weight. The third is the cumulative layer, which has the linear summation. The fourth is the output layer, which generates the output based on the decision function. PNN is trained using the weight vector at the pattern layer, and connect the output of the pattern layer to the cumulative layer. The structure of PNN is simple and the mathematical principle is very clear. For a small training set, PNN achieves high classification results. Fig. 612 represents the architecture of PNN.
6.3 Training of neural network ANN is capable in modifying the behaviors with the response to the environment. When a set of data is presented as input to the system they self adapt it to generate suitable responses. Various training algorithm are introduced, and there are numerous merits and demerits. The structure of the network model influences the rate of convergence of the training algorithm and specifies the category of learning to be used. However, the training algorithms are the simple mechanism
FIGURE 6–12 Architecture of PNN. PNN, probabilistic neural network.
128
Artificial Intelligence in Data Mining
used to adapt the weights in the network branches. It does not require any powerful computation configuration, and it does not involve any complex computations.
6.4 Training algorithms in neural network for data classification Here are some of the algorithms used to train the NN classifier as follows: • Backpropagation algorithm • Genetic algorithm (GA) • LevenbergMarquardt (LM) algorithm The above specified training algorithms are briefly elaborated as follows:
6.4.1 Backpropagation algorithm The training algorithm named backpropagation is introduced to find the optimal weight and bias by tuning the optimization [13]. The backpropagation algorithm is the variation of the gradient search. It uses the criteria of least square optimality and the key role of the backpropagation model is to calculate the gradient of error based on the weight of the input value by propagating the error to the network through backward. It effectively works well in a simple training problem. Due to the increase in the dimensionality or the higher complexity of information increases the problem complexity, which degrades the performance of the propagation model. This makes the algorithm in feasible for various real-world application problems. The performance degradation arises when the complex spaces move to the global minima that is entirely sparse from the local minima. The gradient search algorithm is used to trap at the local minima. With enough momentum or gain, the backpropagation escape from the local minima. It leaves the local minima without knowing the next one in performing the classification process. If the global minima is hidden from the local minima, the backpropagation end-up by vigorous between the local minima without requiring any improvement, which makes the training process to be very slow. The drawback exists in the backpropagation algorithm are listed as follows: • Differentiability in computing the gradient • Scaling problem Hence, the backpropagation is not handled effectively with the optimal criteriaor the transfer functions of the discontinuous node. Fig. 613 represents the architecture of the backpropagation algorithm. The most popular algorithm used to train the MLP of the NN is the backpropagation algorithm. In 1986 the backpropagation is rediscovered by Geoffery Hinton, David Rumelhart, Ronald Wiiliams, and David parker, independently. However, the backpropagation algorithm is popularized after modeling the book named parallel distributed computing, which is published in the year 1987 by James McClelland and David Rumelhart. Hence, the backpropagation algorithm becomes the most popular algorithm in training the NN. The key idea of
Chapter 6 • Neural networks for data classification
129
FIGURE 6–13 Architecture of the backpropagation algorithm.
backpropagation is the parameters, namely bias and weight, which is updated for minimizing the prescribed error. The error specifies the discrepancy among the desired output and the network output, respectively. The training process of the backpropagation algorithm is done in two phases as shown below: • In the forward phase, the network parameters are fixed such that the input data is send to the network layer until the data is reached at the last layer. In this phase, the changes are made using the activation function and the output unit of MLP. • In the backward phase, it computes the error signal with the network result and the desired label. The error signal is introduced in the output layer, and then, it is propagated to the network layer. However, the propagation is achieved through the backward direction. Here, certain adjustments are made to the bias and weight of MLP. The adjustments are calculated in the straightforward to the output layer, but it results in a challenging issue for the middle layer. This algorithm provides the approximation for the trajectories of bias and weight space, which is then calculated based on the gradient descent approach. When the learning rate is small then, the changes made to the bias and weight in the networksare smaller. The process to
130
Artificial Intelligence in Data Mining
update the weight and bias is achieved using less learning rates. When a high learning rate is used to increase the speed of learning then, the resultant changes made in the bias and the weight of the network is become unstable. Hence, the backpropagation requires a simple method to eliminate the instability problem and to speedup the procedure in training the data.
6.4.2 Genetic algorithm The GA [1,14] is the optimization algorithm used to perform the learning and optimization process based on biological features. A GA is the global search optimization algorithm that operates based on the population of rules. According to the mechanism of natural genetics and selection, it promotes with respect to the rule over time. It effectively performs well in the given environment based on the population-specified rules. However, the rules are coded in the binary strings with finite length. The performance of the rules is measured based on the fitness function. The five major components used to perform the learning process are: • • • •
Way to encode a solution for the problems with chromosomes. Evaluation function, which returns the rating for the chromosome. Way to initialize the population of the chromosome. Operators, like mutation as well as crossover, are applied to the parents for altering the genetic composition. • Set the parameters based on the operations. Based on the above five components, the GA operates using the following steps: • Initializing the population such that the result of initialization is based on the set of chromosomes. • Each member present in the population group is evaluated using the objective function. However, the evaluation will be normalized. • The procedure undergoes the reproduction process until it met the stopping criteria. The reproduction process of the GA consists of a number of iterations defined in the following steps: • One or more parents are selected to achieve the reproduction process and the selection process is made using stochastic. However, the parents having the highest evaluation function is probably selected. • The children are produced by applying the operators to the parents. • New children are added to the population group. In some version of GA, the population group is replaced at each cycle for reproduction. In certain cases, only the subsets are replaced.
6.4.3 LevenbergMarquardt algorithm The LM algorithm [15] increased in the popularity of the network community in the past decades. However, the differences exist between the network applications and the
Chapter 6 • Neural networks for data classification
131
optimization comes under the fact that the network parameters are required to be estimated. The optimization performance achieved by this algorithm is significantly better. LM uses the concept of neural neighborhood in order to increase the behavior of both the memory and time constraints. The LM algorithm adaptively varies by updating the parameter between the Gauss Newton update and gradient descent update. When comparing the applications of LM with the NNs, LM suffers the list of drawbacks as, • It is both memory and time-consuming. • It was not biologically feasible.
References [1] Montana DJ, Davis L. Training feedforward neural networks using genetic algorithms. In: Proceedings of the eleventh international joint conference on artificial intelligence 1989, p. 762767. [2] Tang CZ, Kwan HK. Multilayer feedforward neural networks with single powers-of-two weights, IEEE Transactions on Signal Processing 1993;41(8):27242727. [3] Pesce V, Silvestrini S, Lavagna M. Radial basis function neural network aided adaptive extended Kalman filter for spacecraft relative navigation. Aerosp Sci Technol 2019;96:105527. [4] Rosenblatt F. The perceptron: A problablistic model for information storage and organization in the brain, 1958;65(6):386-408. [5] Yu H, Wang G, Zhao Z, Wang H, Wang Z. Chicken embryo fertility detection based on PPG and convolutional neural network. Infrared Phys Technol 2019;103:103075. [6] Chen C, Zhang P, Liu Y, Liu J. Financial quantitative investment using convolutional neural network and deep learning technology. Neurocomputing 2019;390:38490. [7] Wu H, Prasad S. Convolutional recurrent neural networks forhyperspectral data classification. Remote Sens 2017;9(3):298. [8] Simão M, Neto P, Gibaru O. EMG-based online classification of gestures with recurrent neural networks. Pattern Recognit Lett 2019;128:4551. [9] Yang XS, Deb S. Engineering optimization by cuckoo search. Int J Math Model Numer 2010;1 (4):33043. [10] Basheer IA, Hajmeer M. Artificial neural networks: fundamentals, computing, design, and application. J Microbiol Methods 2000;43(1):331. [11] Chien Chen Y, Teng C. A model reference control structure using a fuzzy neural network. Fuzzy Sets Syst 1995;73(3):291312. [12] Gu XF, Liu L, Li JP, Huang YY, Lin J. Data classification based on artificial neural networks. In: Proceedings of the international conference on apperceiving computing and intelligence analysis, IEEE, December 2008, p. 223226. [13] Hirose Y, Yamashita K, Hijiya H. Back-propagation algorithm which varies the number of hidden units. Neural Netw 1991;4(1):616. [14] Arifovic J, Gencay R. Using genetic algorithms to select architecture of a feedforward artificial neural network. Phys A Stat Mech Appl 2001;289(3-4):57494. [15] Lourakis M. A brief description of the Levenberg-Marquardt Algorithm implemented by levmar. Found Res Technol 2005;1:16.
This page intentionally left blank
7 Application of artificial intelligence in the perspective of data mining D. Menaga1, S. Saravanan2 1
KCG COLLEGE OF TECHNOLOGY, CHENNAI, INDIA 2 HCL T ECHNOLOGIES LIMITED, CHENNAI, INDIA
7.1 Artificial intelligence Artificial intelligence (AI) is widely used by the programmers and the computer system to perform an executable task that is highly complex and straight forward in programming. The AI consists of various branches, namely fuzzy logic (FL), artificial neural network (ANN), expert system (ES), genetic algorithm (GA), and hybrid system (HS), respectively. However, the HS is formed by integrating two or more AI branches. AI is the technology that has natural synergy and can be effectively used to generate powerful computing processes. AI is used to solve the deficiencies that occurred by the traditional approaches. The key role of AI is to generate an effective, more efficient, and better computing system. Sometimes, the AI requires additional features of human intelligence, like the ability to interpolate and learning from the current knowledge. The primary usage of intelligent technology is that it leads to the development of a useful system with enhanced performance and characteristics that could not be achieved by the existing techniques. Accordingly, AI methods are widely used in various applications and domains. However, the AI is commonly used in some of the applications that as listed as follow: • • • • • • • •
Forecasting Power system and optimization Social or physiological sciences Signal processing Manufacturing Medicine Robotics Pattern recognition
Moreover, the AI is particularly used in system modeling, like system identification, and implementation of complex mappings [1].
Artificial Intelligence in Data Mining. DOI: https://doi.org/10.1016/B978-0-12-820601-0.00006-9 © 2021 Elsevier Inc. All rights reserved.
133
134
Artificial Intelligence in Data Mining
7.1.1 Artificial neural networks ANN is the collection of various interconnected processing devices called units. Here, the information is send through the units in addition to the interconnection. The incoming connection contains two values, such as weight and input values that are associated with it. However, the output unit is the function that contains the summed value of inputs. While implementing the ANN on computers, it is not programmed for performing certain tasks instead of that they are trained with the dataset to learn the patterns, as the patterns are acts as the input. After training the ANN, the patterns are specified with the trained units to perform classification and prediction [2]. Fig. 71 portrays the structure of ANN. ANN can learn the patterns automatically from the real-world system or the computer programs, physical models, and from other sources. ANN can effectively handle a number of inputs and generate the output for the respective input units based on the system design. The functioning of ANN is based on the understanding of the human brain and its connected nervous system. It uses the processing elements that are connected using the weights to form the system with black box representation. However, the typical ANN contains a number of layers such that each layer is interconnected with the neurons. The input data is passed to the network through the input layer, whereas the output layer derives the response. Once or more number of hidden layers will exist between the input and output layer. The hidden and the output layer process the input by multiplying the input with the weight and sum all the product values. The nonlinear transform is used to process the sum and to generate the final result.
FIGURE 7–1 Structure of ANN. ANN, artificial neural network.
Chapter 7 • Application of artificial intelligence in the perspective of data mining
135
7.1.2 Fuzzy logic Fuzzy system (FS) is developed by Zadeh in the year 1972 based on the fuzzy set theory. The key role of FS is to mimic the behavior of human cognition called approximate reasoning. However, the FS is less precise than the traditional system, but it looks like the experience of human decision. FL is mainly used in the control engineering such that the FL is based on the FL reasoning that uses the linguistic rules, which is in the form of an if-then statement. The feature of fuzzy control and FL are the relative simplification that is derived from the description of control methodology. However, it allows the human language application for describing the problems as fuzzy-related solutions. In various control applications, the system model is input or the unknown parameters that are highly unstable and variable. In certain cases, fuzzy controllers can be used. It is very easier to modify and understand the rules of the fuzzy controller, which not only uses the strategy of human operators but can be specified in the natural linguistic terms. FL is commonly used to model the imprecise and complex systems. With the fuzzy set theory, the elements of a fuzzy set is mapped with the membership values based on the theoretic function that belongs to the interval ranges from 0 to 1, respectively. A major step of the fuzzy method is the assessment of the membership function that estimates the probability of thestochastic model. Accordingly, the membership function used in the fuzzy set theory is effective to model the preference of decision-maker. Modeling the AI based on FL is the simple strategy that operates using the if-then principle, where “if” specifies the fuzzy explanatory vector or fuzzy sets for the membership function, whereas “then” specifies the consequence of fuzzy set. Fig. 72 portrays the components of FS.
FIGURE 7–2 Components of FS. FS, fuzzy system.
136
Artificial Intelligence in Data Mining
7.1.3 Genetic algorithm GA is inspired using the way of living organisms that are used to harsh the reality of life in a hostile environment by inheritance and evolution. The GA imitates the evolution of the population process by fitting the individual for reproduction. Hence, GA is considered as the optimum search method using the concept of survival of fittest and natural selection. GA is operated based on the fixed population size called individuals that evolves over time. GA uses three genetic operators, namely crossover operator, mutation operator, and selection operator. In GA, the stronger individual among the population has more chance to create the offspring. However, the GA is implemented as the optimization, and the computerized search procedure based on the principle of natural selection and natural genetics. The major intention of search optimization is to find the possible set of solutions for the search problem based on the strings of zeros and ones, respectively. A different portion of the bitstreams specifies the parameters of the search problem. When the problem-solving strategy is represented using the compact form, the GA method can be effectively applied to maintain the population process of knowledge structure to define the candidate solution. However, the population evolves over time based on the competition, like controlled variation and survival of fittest. The genetic operations of GA are used to find the solution and to select the suitable offspring for succeeding generations. The GA considers various search points simultaneously in search space such that it is found to offer rapid convergence with a global optimum solution. GA results in better performance, but it suffers from the issue of excessive complexity. GA is also started to use with intelligent technologies, like case-based reasoning, ES, and neural network [3]. Fig. 73 represents the life cycle of the population.
7.1.4 Expert system The term knowledge-based expert system (KBES) and ES can be used interchangeably. KBES is the initial version of research in the AI field. ES usually contains five different components, such as acquisition component, user interface (UI), explanation components, and inference mechanism. Accordingly, the knowledge based consists of two databases, namely dynamic and static databases. Here, the static one contains the knowledge of the domain that is
FIGURE 7–3 Life cycle of population.
Chapter 7 • Application of artificial intelligence in the perspective of data mining
137
FIGURE 7–4 Architecture of KBES. KBES, knowledge-based expert system.
represented in a certain format. The static database can be created once at the time of the system being developed by the user in such a way that it cannot be modified during runtime. However, the dynamic database is enriched at the time of program execution, but the data get lost during the termination of execution. Dynamic is the one used to record all the data obtained from the user such that the facts can be inferred at the reasoning process. Fig. 74 represents the architecture of KBES. However, the knowledge base consists of the procedures, rules, and the facts, which is very essential to solve the problems in specific applications. Accordingly, the two major components of the ES are specified as follows: Knowledge base: It is the collection of knowledge that is used to solve the problems in the real-world scenario. Control mechanism: It verifies the availability of facts and selects the suitable knowledge source from the knowledge base and verifies the matches with the knowledge to generate some additional facts. Some of the domains where the ES is used to solve the problems are listed as follows: • • • • • • • •
Medicine Engineering Geography Business Defense and education Mathematics Chemistry Computer science and law
7.1.5 Hybrid systems The HS is increasingly popular in recent decades due to the extensive success of the system in various real-world complex issues. The key role of its success is based on the components
138
Artificial Intelligence in Data Mining
of computational intelligence, like a neural network, GA, FL, and machine learning. Each of these methods offers the HS with searching techniques and complementary reasoning and allows to uses the empirical data and domain knowledge for solving complex issues. However, the HS integrates the neural networks, ES, GA, and FL and reveals their effectiveness in the real-world scenario.
7.2 Artificial intelligence versus data mining • Data mining and AI methods are widely used in various domains to solve the segmentation, prediction, diagnosis, classification, and association problems. • AI is the branch of science that deals with the generation of intelligent machines. However, these machines are called intelligent machines, as these machines have their own decision-making capability and thinking as like human beings. • AI is a large area of machine learning such that the AI uses machine learning classifiers based on its intelligent behavior. • AI will learn algorithms automatically that are used to perform some task. • Data mining is otherwise called a knowledge discovery process, which is the field of science used to find the properties of datasets. However, large sets of information that are collected from the data warehouses, like spatial, and time series are mined to extract interesting patterns and correlations. Accordingly, these results are utilized efficiently to increase the business process, which in turn result in gaining business insight. Factors
Data mining
Artificial intelligence
Scope
It is used to identify how various attributes of information are related to other instances using data visualization and pattern strategies. However, the key role of data mining is to identify the relationship between two or more attributes to predict the actions or outcomes. It is the method used to dig the deep for finding useful information. It is highly used in various research fields, like text mining, web mining, and fraud detection.
It is used to find the prediction like, time duration approximation, or price duration. However, it learns the model automatically based on the experience and provides effective feedback in real time.
Working Uses
Concept
Method
It extracts the essential information using mining methods and finds patterns and trends. It performs the analysis with the batch format and produces the results at a certain time rather than a continuous basis.
It is the technology used to makes the machines perfect using the training dataset. It is highly useful in making the recommendation of prices, products and to estimate the time required to deliver the services. Here, the concept is learned from the existing data. It changes the behavior for future input. It continuously runs and increases the system performance automatically and analyzes if any failure occurs in the future. (Continued)
Chapter 7 • Application of artificial intelligence in the perspective of data mining
139
(Continued) Factors
Data mining
Artificial intelligence
Nature
It requires human intervention for extracting the information using the mining techniques. It requires the analysis to be initiated by the human and hence the data mining is the manual method.
It is entirely different from data mining, as it learns the patterns automatically. It is the step ahead of data mining, as AI uses the same methods that data mining used and adapts to the changes and learns the patterns automatically. However, the result of AI is more accurate than the output of data mining. It can be implemented using AI algorithms, such as decision trees, neuro FS, and neural networks. It uses the neural network for predicting the outcomes. It is proved that AI results are more accurate than data mining outcomes.
Learning capability
Implementation
Accuracy
Applications
Examples
It involves the building model, where the data mining methods can be applied. It uses the data mining engine, pattern evaluation, and database to discover knowledge. The accuracy depends on how the information is gathered. It generates an accurate result, as it requires human intervention. It can generate the result with less volume of data.
The data mining concept is used in various places, like sales trends or patterns.
Here, the data collected from various sources are transformed into the standard format for the machine to understand. Moreover, it requires a large volume of data for producing an accurate result. It is widely used in the fields, such as image recognition, medical diagnosis, and marketing campaign, respectively.
7.3 Modeling theory based on artificial intelligence and data mining This section describes the modeling theory based on the AI and data mining approach using database knowledge.
7.3.1 Modeling, prediction, and forecasting of artificial intelligence using solar radiation data The solar radiation data is widely used as the source of renewable energy. It is highly required to formulate the estimation model and forecasting the meteorological data. However, solar data play a key role in various domains. The application of AI includes modeling hourly, monthly, and daily solar radiation data. Moreover, it further performs the isolation forecasting and index modeling with solar data. The AI methods are adopted into solar radiation data to solve the technical issues in the data analysis field. Moreover, it offers an alternative solution to the existing statistical methods in the scientific fields.
140
Artificial Intelligence in Data Mining
7.3.2 Modeling of artificial intelligence in the environmental systems AI or the knowledge-based methods are widely to model the environmental system using classical methods. Some of the environmental applicability using the data mining approaches is discussed in this section. Some of the methods used in the environmental system include: • • • • • • • • • •
ANNs GA Multiagent system (MAS) Reinforcement learning (RL) Cellular automata Fuzzy models Rule-based systems (RBSs) Case-based reasoning (CBR) Swarm intelligence HSs
7.3.2.1 Case-based reasoning The CBR is used to solve the problem by recalling the solution of past problems. It requires a number of past cases for generating a new solution for the new problem. It recognizes the problems quickly and solves the problem by accruing learning and repeated attempts. The CBR involves four different steps • Retrieve the related past cases from the database. • Utilize the knowledge of previous cases and generate the solution to a new problem based on repeated attempts. • Revise the solution through test execution or simulation. • Finalize the solution for using it in the future after making a successful adaptation. Retrieval is recognized using either semantic or syntactical similarity to a new case. Here, the semantic specifies the meaning, whereas the syntactical defines the grammatical structure. The syntactic similarity can be easily applied in the environmental system using the data mining approach. Based on the context, the semantic matching is used by the CBR model. However, the major retrieval methods used by the CBR are knowledge guided, inductive, and the nearest neighbor model. The nearest neighbor method finds the solution for the new cases based on the features associated with the learned model and the weights. To determine the weights of the nearest neighboring model poses a challenging task in the environmental system. The case-based model overlooks that the features that are influenced with others are case-specific. The retrieval time is linearly increases for the size of the database, hence this method consumes more time. However, the inductive model decided which features can discriminate the cases best. Here, the cases are organized in the form of a decision tree (DT) based on the features and effectively reduces the retrieval time. Moreover, this method demands the case database with reasonable quality and size.
Chapter 7 • Application of artificial intelligence in the perspective of data mining
141
Accordingly, the knowledge-based retrieval approaches uses the existing knowledge for identifying the essential features at each case, and thereby, it accesses all the cases independently. However, it is highly expensive for large volume databases. Fig. 75 depicts the process of CBR. The CBR uses past knowledge and finds the solution for a new one. However, the adaptation in CBR can be derivational or structural. Accordingly, the structural adaptation generates the solution by modifying the previous case solution, whereas the derivational model uses the method, rules or the algorithms for finding the solution of a new case. Once the solution is confirmed, the solution will be recorded in the database.
7.3.2.2 Rule-based system This section describes the RBS in modeling the AI technique. • The RBS is used to solve the problem by deriving the rules from the expert knowledge. Here, the rules have action and condition parts, namely if and then, which is passed to the inference engine, as it has the working knowledge of any problems, rule applier, and pattern matcher.
FIGURE 7–5 Process of CBR. CBR, case-based reasoning.
142
Artificial Intelligence in Data Mining
• However, the pattern matcher defines the working memory that specifies which learning rules are related together and then finalize the rule to be applied by the rule applier. Accordingly, the new information generated by the action is the part of the rule that is included in the working memory. The iteration is repeated to find the relevant rules between the knowledge base and working memory. • The rules and facts are imprecise. Some of the uncertainty can be integrated with the RBS using the methods, like Prospector’s subjective Bayesian model, certainty factor, Dempster-Shafer theory, subjective probability theory, and the qualitative methods, like Cohen’s theory of endorsements. • It assigns the rules and facts with the uncertainty values, like membership values, belief functions, and probabilities using human experts. RBS uses two different rules systems, namely backward and forward chaining. Forward chaining: It is the data-driven and draws the conclusion based on the rules using initial facts. Backward chaining: It is the goal-driven and looks for the rules by starting from the hypothesis. It seeks the justification to make the decision.
7.3.2.3 Reinforcement learning RL is the learning process that can be made between the learning agent and the environment system. However, the agent can learn the system and achieve the goal using the error and trial. The RL problem consists of three states, namely • Environment • Reinforcement function • Value function Here, the environment is dynamic such that it contains the set of possible states. Each state contains the set of possible actions for the time. Here, the agent is used to select the action that increases the total reward from initial to a goal state. The RL contains three classes of function that are the minimum time to achieve goals, games, and pure delayed reward.
7.3.2.4 Multiagent systems The application of AI for modeling the environment using the MAS is discussed in this section. • The MAS contains the network of agents that interacts with each other to achieve the goals. • Here, the agent is the software component that contains the data and code. However, it is infeasible to solve the problem that is assigned to MAS. • However, the agents communicate with others using the high-level agent communication (ACL) to share the data and request the services with others. • Knowledge Query and Manipulation Language is the most commonly used language by the ACL.
Chapter 7 • Application of artificial intelligence in the perspective of data mining
143
MAS contains the communication layer that covers the lower level parameters, like communication identifiers, recipient, and sender, respectively. • Message layer: It defines the interpretation and the performative protocol • Content layer: It holds further information regarding the performative protocol. The coordination of agents is very essential in the MAS, as the messages are relayed and the responses are processed between the agents asynchronously. However, the coordination is based on the infrastructure of the agent system that governs the flow of information, resource usage, nature of the interaction, and the degree of concurrency between the agents. Accordingly, the simplest infrastructure used in the MAS is the peer-to-peer network, where the agents communicate directly with other agents. It can be effectively used to solve the problem by concurrent processing without any conflict in resolution. The additional infrastructure used in the MAS is a multiagent blackboard and federated agent, respectively. Federated agent: It is the network that contains the facilitator agents, which acts as the mediator between the agents. Multiagent blackboard: It is the network that contains the central controller for coordinating the activity of the agent and the public blackboard space, where the data is shared.
7.3.3 Integration of artificial intelligence into water quality modeling In the analysis of the coastal water process, different methods are used for simulating the flow of water and to solve the issues in the water quality. However, the increasing growth of the numerical method offers a large number of models to use in the environmental and engineering problems. Various water quality methods are available today such that these methods become quite mature. However, the numerical methods are operated based on the finite element model, Eulerian-Lagrangian model, boundary element model, and the finite difference method. Accordingly, the time stamping method can be characteristic, explicit, or implicit based. Moreover, the shape function is of first, second, or the higher order. Modeling of AI with the water quality is divided into various spatial dimensions, such as, • • • •
Single-dimensional model Two-dimensional depth averaged method Two-dimensional layered method Three-dimensional method
Moreover, the analysis of water quality and the coastal hydraulics generally includes the empirical and heuristic experience and is degraded using the modeling, and the simplification methods based on the experience of specialists. The accuracy of the prediction is extended based on the accuracy of open boundary restriction. Parameters used and numerical methods adopted.
144
Artificial Intelligence in Data Mining
The numerical method is delineated as the process that is used to transform the knowledge about the physical phenomena into the digital format, translate the result back to knowledge format, and simulate the behavior. The inherent problem is the sight of model manipulation that leads to quite changes when the slight modification occurs in the parameter. However, the knowledge of manipulation includes real-world observations, water quality, mathematical description of water movement, discretization of the chemical and physical process to solve discretized equations and to generate the output more accurately. The experienced modeler finds the model failure by comparing the simulated result with the real data and the heuristic judgment of environmental behavior. Many modelers do not use the requisite knowledge for cleaning the input data and to evaluate the result. However, this process may result under utilization create the inferior design and result in total failure. The primary goal of the model manipulation is to gather a satisfactory simulation in coastal engineering. The computer is used by this model with limited speed and memory that effectively utilized to track the balance between the accuracy of modeling. On the other hand, the modelers use fundamental parameters at the time of the manipulation process.
7.3.4 Modeling of the offset lithographic printing process This section elaborates on the modeling of the offset printing process. • The offset lithographic printing is the most commonly used commercial printing model. In general, it is used to generate a number of printed materials, like newspapers, catalogs, and magazines. The color pictures are printed using the cyan (C), yellow (Y), black (K), and magenta (M) dots with the varying size in the surface of metal plates. The metal plates are mounted on the plate cylinders, respectively. • The empty area is the one to be printed in the same plane and is separated by the other planes using the ink receptive and water receptive. • During thin-layer printing, the ink keys used in the printing process are approximately 4 cm wide such that 36 ink keys are found in the paper web. • Each of the ink keys is uniquely responsible for feeding the ink in the respective zone. To adjust the ink key is the major action in the printing process. Here, the color of the printed result can be predicted using the set of input values. • To know the color of the printed result, the special test area called a double gray bar is printed at each ink zone at the bottom of the page. • However, the color camera records the Red Green Blue (RGB) image with agree bar. Accordingly, the RGB values are directly used for representing the colors in RGB color space. • However, various acquisition equipment provides various RGB values for similar incident light. Accordingly, Euclidean distance do not specify the color differences at a uniform scale and making the process more complex in evaluating the similarity between two colors from the distance in space.
Chapter 7 • Application of artificial intelligence in the perspective of data mining
145
The offset printing press is equipped with the online press monitoring system to obtain the parameters required for modeling the AI [4]. The printing system consists of the following parts, namely • • • • •
Color camera to capture the gray bar image Linear transmission unit to traverse the camera between paper web Ink temperature sensor Database Barcode reader
Fig. 76 represents the schematic diagram of the inking system. However, the computer has a connection with all the functions, and the equipment called the master unit. The master unit is used to control the data sampling process and synchronizes it more effectively. It controls the transmission system and traverses the camera to take the snapshots of print on the paper web. However, it merges the information that is extracted from the image using the status of the printing press, ink temperature, and the data collected from the database. Accordingly, the ink is applied to some specific areas in the printing plate. The ink is shifted from the printing plate to the paper through a blanket cylinder. However, the ink of each color is separately applied to the paper using the inking system. With the data mining method, the printing process can be continuously sampled filtered, and the data is analyzed, and the dataset with the reasonable sizes are continuously processed and updated. However, changes in the dataset are detected, and the processing strategies are retrained in the batch mode. The following steps are required by this model to perform the data analysis process: • Design and train the filter to achieve data analytics procedure • Execute online data mining process based on filter
FIGURE 7–6 Schematic diagram of inking system.
146
Artificial Intelligence in Data Mining
• To detect when the models are required to be updated • To detect if the filter is required to be modified • Update the process adaptively
7.3.5 Modeling human teaching tactics and strategies for tutoring systems Here, the tutorial system modeling based on the AI and data mining approach is explained as follows: • The tutorial system is used to characterize the loyalty of teaching repertoire with the human expert teachers and working with the students to achieve the goal. • The theoretical model is used at the starting point for designing the tutorial system. • Due to the limited teaching actions and the symbolic procedure or the multiplying fractions result in some complexity in the tutorial system such that these issues can be resolved using the expert machine teacher. More expert teaching tactics and the strategies are developed through the theoretical derivation of learning theories, empirical observation of simulated students and humans, and from the observation of expert teachers. By looking into the sources of these ideas, certain methods are used to solve the central issues, like motivating students and dealing with student errors [5]. The theoretical methods enable the intelligent learning environment for evaluating the learning process to make the decision using the following four properties, namely • • • •
Constructiveness Reflectiveness Self-regulatedness Cumulativeness
The learning opportunities are provided to the learners and the taking into consideration about the learning properties and the learning situations.
7.3.6 Modeling of artificial intelligence for engine idle speed system and control optimization In recent decades, the automotive engines are controlled using the electronic control unit (ECU) such that the performance of the engine at the idle speed is degraded using the setup of the control parameters. In the spark ignition engines, the efficient performance is required for the idle speed to fulfill the increasing necessity of fuel consumption, pollutant emissions, and vehicle driveability. In general, controlling the idle speed is the major problem with the low-engine speed to reduce the disturbance ability, reduce the emissions, and save fuel. With the view of the control point, controlling the idle speed is the major difficulty, as the engine at the idle speed may be subjected to some disturbance from the unknown accessory load and the external loads, like power steering load, air conditioning, and so on. These
Chapter 7 • Application of artificial intelligence in the perspective of data mining
147
factors reduce the speed of the engine rapidly, and hence, it is required to reject these disturbances. Recently, the control parameters of the engine idle speed in ECU for inventing the cars are formulated using the lookup table called control maps. However, there exist various maps for target idle speed to the engineer to set the ignition maps and fuel. Moreover, the performance of the engine idle is depends on the experience of an automotive engineer, who can handle a number of control parameters of the engine. Accordingly, tuning the idle speed of the engine is empirically achieved using the dynamometer (dyno) tests. By obtaining the optimal control parameters, the number of resources and the time consumption can be achieved more effectively. In recent decades, some of the idle speed controllers described by the researchers are: • • • •
Adaptive control algorithm Online proportional integral derivative tuning controller Robust control algorithm Model-based control algorithm
7.3.7 Data mining approach for modeling sediment transport To estimate the rates of sediment transport using the data mining approach is highly important with the context of water management issues. Even though various research are focused in producing the suspended load, total load, and bed load, modeling the predictive accuracy results is a major issue in the sediment transport. However, sediment transport is the immensely quit complex process such that expressing the transport process using the deterministic mathematical model is not feasible in the future. The research in modeling the data mining approach into the sediment transport had opened a new opportunity for processing at the level of knowledge-based system. When the available knowledge is very less, it is very complex to place the relevant information in the mathematical model. The data mining approach is widely being used in different branches of science such that it is very complex for the existing physical-based modeling system. This practical issue can be resolved using the ANN in sediment transport. The data mining approach is considered as the successful application in modeling the water engineering problems in such a way that data mining is used as the potential candidate for modeling the documents of sediment transport and lead to the development of total load and bed load using neural network. The nonlinear parametric function is used in the data mining approach such that the coefficient of decomposition function can be generated from the input to output pairs using the systematic rules and specified topology structure. However, the parametric description can be approximated, and the general rules are computed using systematic learning. ANN and the model trees (MTs) also called as water sector are used as the modeling methods in sediment transport. The MT uses the automatic splitting of the input domain to assign the multivariate regression model more accurate for individual subarea. While training the data, relevant information are
148
Artificial Intelligence in Data Mining
used to create the structure of the tree that contains the decision nodes, which holds attribute name and branches for other DTs. The attribute value is specified as one, and the leaf is represented as answer nodes with the linear model. Hence, MT is the integration of the linear method such that each method is used for a specific domain in the input space. To increase the capability of generalization, it is required to prune the tree into the simpler one. The values that are obtained at the leaf node is adjusted using the smoothing operation to reflect the values at the path and from root to the leaf node.
7.3.8 Modeling of artificial intelligence for monitoring flood defense structures This section elaborates the modeling of AI for monitoring the flood defense structures. The increasing number of floods that are caused by the failure of dikes and dams requires come implementation model for detecting the failure conditions. However, the most commonly used methods for detecting the failure development of critical water levels are • Data-driven or model-free method • Model-based or numerical or physical method
7.3.8.1 Data-driven approach This approach requires the availability of monitoring the system that measures the essential characteristics of the object defined by the experts. The data-driven methods are widely used in certain applications, like hydro-informatics, geodetic deformation analysis, and online real-time computations. This method is operated based on the availability of data. In the case of the dike construction model, the historical measures are required to find the output from the object behavior. However, the class of data-driven includes various methods that are listed as follows: • • • •
Data mining Machine learning methods Statistical methods like clustering, linear correlation, and central moments Soft computing
7.3.8.2 Model-based approach The numerical modeling requires data regarding the monitored object such that the constructed method will not depend on the online measurements. However, the sensor values can be utilized to validate the constructed model in such a way it can be used as the initial condition for modeling. Some of the examples of the model-based methods are the finite volume method or the finite element method. Moreover, physical modeling is the timeconsuming process such that it requires more computational resources. Due to the high complexity of this method, the model-based approach cannot be used for processing the data at real time.
Chapter 7 • Application of artificial intelligence in the perspective of data mining
149
In the absence of data, the model-based approach is used to generate the required data. However, in case of pattern failure, the sensor data can be gathered at the time of the dike collapse. Accordingly, the physical model is used to generate the patterns for the failure development such that these data can be used to train the data-driven methods, like classification task or pattern recognition [6].
7.3.9 Modeling of artificial intelligence in intelligent manufacturing system AI is widely applied in the area of intelligent manufacturing using an intelligent manufacturing system. However, the application of AI and data mining beyond the intelligent system does not create any sense [7]. Accordingly, the intelligent manufacturing system can be effectively characterized using the following factors, namely • • • • • • • •
• • • • •
Autonomous intelligent sensing Interconnection Learning Collaboration Cognition Decision-making Execution of human, material, and machine Information and environment However, the intelligent manufacturing system consists of various layers that are listed as follows: Resources or capabilities layer Ubiquitous network layer Service platform Intelligent cloud service application layer Security management and specification system
7.3.9.1 Resources or capabilities layer The resources or the capabilities layer includes the manufacturing capabilities and manufacturing resources, like hard manufacturing resources, soft manufacturing resources, and manufacturing capabilities. Hard manufacturing resources: Some of the hard manufacturing resources are robots, machining centers, simulation test equipment, machine tools, computing equipment, energies, and materials. Soft manufacturing resources: Some of the soft manufacturing resources are information, knowledge, software, and big data of the manufacturing process. Manufacturing capabilities: Accordingly, the manufacturing capabilities include design, simulation, production, management, experiment, operation, sales, demonstration, integration, and maintenance of the manufacturing process. Moreover, it includes interconnection, network, and digital products.
150
Artificial Intelligence in Data Mining
7.3.9.2 Ubiquitous network layer This network layer contains a virtual network layer, a physical network layer, an intelligence sensing or access layer, and the business arrangement layer. • The physical layer contains programmable switches, ground base stations, communication satellites, optical broadband, boats, wireless base stations, aircraft, and so on. • The virtual layer is used to achieve the open network using the northbound and southbound interfaces for equipment management, topology management, data transmission and reception, IPv4 or IPv6 protocol management, quality of service management, and host management, respectively. • The business layer offers the network function in the form of software using hardware and software decoupling and abstracting the function to achieve quick deployment and development for new business and to provide virtual firewalls, payload balancing, traffic monitoring, virtual router, and virtual wide area network. • The intelligence or the sensing layer is used to sense the objects, like industry, material, machine, and enterprise through intelligent sensing units of the electronic sensors, radio frequency identification sensor, light, sound, wireless sensor network, radar, and barcode.
7.3.9.3 Service platform layer The service layer contains an intelligent UI layer, the core intelligent support function layer, and the virtual intelligent capacities or resources layer. • The virtual intelligent capacities or resources layer offers the intelligent details and the virtual settings of manufacturing capacities or resources for mapping the physical capacities or resources into virtual intelligent capacities or resources pool through logical intelligent capacities or resources. • The intelligent support function contains the common cloud platform and the intelligent manufacturing platform such that each intelligent system is responsible to offer middleware functions, like intelligent system operation management, AI engine, intelligent system service evaluation, as intelligent system construction management, and the manufacturing functions, namely big data, intelligent human hybrid production, intelligent management of decision-making, swarm intelligence design, and the online service support. • The intelligent UI layer is used to support the interaction equipment for the operators, users, and service providers to achieve the customized user environment. Fig. 77 represents the categories of intelligent manufacturing system technology. However, the intelligent manufacturing system contains five different types namely, • Basic platform technology • Intelligent platform technology • Product lifecycle technology • Supporting technology • Ubiquitous network technology
Chapter 7 • Application of artificial intelligence in the perspective of data mining
151
FIGURE 7–7 Categories of intelligent manufacturing system technology.
7.3.9.4 General technology Some of the general technologies includes in the intelligent manufacturing system are listed as follows: • • • • • • • • •
Software-defined networking technology Manufacturing architecture technology Ground system technology Enterprise modeling Business model Application technology Evaluation technology Intelligent manufacturing services Intelligent standardization technology.
7.3.9.5 Intelligent manufacturing platform technology The intelligent manufacturing platform technology includes: • • • • •
Intelligent resources/capacity sensing Internet of Things technology Intelligent virtual/resource services and capacity technology Big data network connection technology Intelligent service environment operation/construction/management evaluation technology • Intelligent model/big data/knowledge management • Mining and analysis technology • Hybrid production technology
7.3.9.6 Ubiquitous network technology The ubiquitous network technology contains the space air ground network and integrated fusion network technology.
152
Artificial Intelligence in Data Mining
7.3.9.7 Product life cycle manufacturing technology Some of the technologies included in the product lifecycle manufacturing technology are • • • • • •
Cloud product design Cloud production equipment technology Cloud management and operation technology Cloud innovation design technology Intelligent cloud simulation Intelligent cloud service guarantee technology
7.3.9.8 Supporting technology The supporting technology contains communication and information technology, cloud computing technology, manufacturing technologies, like print and electrochemical technology, and professional technology, such as aerospace, automobile, and shipbuilding technology.
7.3.10 Constitutive modeling of cemented paste backfill Mine tailings are considered as the major byproducts that are remained after processing the minerals, and in certain cases, the mine tailings are disposed into tailing dams without any reclaim. However, thousands are tailings dams are dispersed all over the world that occupies the land resources and tends to pollute the mining environment, which causes the risk to human health. In general, over 10 billion tons of mine tailings are discharging every year such that it makes the remediation of mine tailings as the major challenge in environmental science. Moreover, the mine tailings result in enormous costs in the financial revenue for remediation. In the United States, nearly 63 national priority mining sites, the cost of remediation is estimated and is exceeded to 7.8 billion dollars such that 16.5 billion dollars are needed for future sites. The environmental concerns increase the usage of cemented paste backfill (CPB) methods for recycling the mine tailings. The CPB is the material of mine composite that is produced using dewatered tailings, binders, and waters. Moreover, the CPB is commonly used in a various metal mine in the worldwide. Accordingly, the increasing growth of CPB technology is verified with other backfilling categories and is primarily related to the economic, environmental, and technical benefits. The CPB is also used as the ground support for preventing and alleviating the potential risks that is associated with the underground mining. The CPB offers the environmental and safe way to dispose of the waste tailings, reduce the volume of deposited tailings in the surface. CPB can reduce the ore dilution, surface subsidence, and speedup the process of production. The advantage of CPB is based on mechanical stability and to remain stable even adjacent stopes are excavated. Various researchers focused in the study of mechanical properties and the response of CPB. Moreover, most of the previous research works are focused on the experimental characteristics of the mechanical behavior of CPB. However, the constitutive modeling of CPB is very essential for predicting and accessing the mechanical stability.
Chapter 7 • Application of artificial intelligence in the perspective of data mining
153
Constitutive modeling with the data mining approach is verified for various types of minerals in the existing literature works. The major benefit of using the data mining approach is its high efficiency than the mathematical expression. The data mining approaches will suffer from a “black box” nature that makes the physical clarification of constitutive modeling more difficult. DT is the nonparametric algorithm, which uses the flowchart-like structure for supporting decision-making. However, the feature space of DT is partitioned into various subspaces to ensure the samples in the subspace as homogeneous.
7.3.11 Spontaneous reporting system modeling for data mining methods evaluation in pharmacovigilance The key role of pharmacovigilance is to detect the adverse effects of marketed drugs. However, pharmacovigilance is the spontaneous event reporting and is supposed to adverse the effects of drugs. The drug event couples are currently human-based and are supported by the heuristic rules that are implemented using the basic software. Due to the mass of data, some useful features may get loss and hence the general methods cannot be effectively used to exploit the information present in the database. The data mining methods are developed for generating the signals automatically and enable to support the pharmacovigilance experts. However, the data mining methods are operated based on the association measures with the detection thresholds. Accordingly, the measures are used to find the difference between the expected one and the observed number of reports under the assumption of drug event independence. However, the data mining approaches are developed for various databases, such as information component (IC) for World Health Organization (WHO) database, empirical Bayes method for food and drugs administration (FDA), and proportional reporting ratio (PRR) for medical control agency (MCA) for the United Kingdom. Moreover, these methods are evaluated using real data with the pharmacovigilance database. The Spontaneous Reporting System is operated based on subjective appreciation of the medical community such that it does not offer the exhaustive reporting of adverse effects. Initially, the adverse effects are diagnosed, and then, it is judged more accurately to report with sufficient information. However, it is very difficult to find the proportion of adverse events that are reported. The reported events are related to the prescribed drugs such that the presence of adverse effects and the drugs are coincidental. On the other hand, the adverse effects are no properly reported and such that the adverse events that are reported using the adverse effects are not adverse drug reactions. The number of patients exposing to drugs are increasing, and the background incidence of adverse events of the population are ill-known. To acquire this, knowledge requires deep investigations and is not conceivable with the database scale. Moreover, the relative risk of drug event in the real-time database is unknown.
154
Artificial Intelligence in Data Mining
7.3.11.1 Summary This chapter comprises the background of AI and the modeling of AI with the data mining approaches. The primary concept of AI and the applications involved in AI modeling are elaborated. However, the branches of AI modeling are elaborated for their computing process. The modeling theory adopted based on the concept of AI, and data mining methods are discussed in this chapter.
References [1] Mellit A. Artificial intelligence technique for modelling and forecasting of solar radiation data: a review. Int J Artif Intell Soft Comput 2008;1(1):5276. [2] Chen SH, Jakeman AJ, Norton JP. Artificial intelligence techniques: an introduction to their use for modelling environmental systems. Math Comput Simul 2008;78(2-3):379400. [3] Mellit A, Kalogirou SA. Artificial intelligence techniques for photovoltaic applications: a review. Prog Energy Combust Sci 2008;34(5):574632. [4] Englund C, Verikas A. A SOM-based data mining strategy for adaptive modelling of an offset lithographic printing process. Eng Appl Artif Intell 2007;20(3):391400. [5] Du Boulay B, Luckin R. Modelling human teaching tactics and strategies for tutoring systems: 14 Years on. Int J Artif Intell Educ 2016;26(1):393404. [6] Pyayt AL, Mokhov II, Kozionov A, Kusherbaeva V, Melnikova NB, Krzhizhanovskaya VV, et al. Artificial intelligence and finite element modelling for monitoring flood defence structures. IEEE Workshop Environ Energy Struct Monit Syst 2011;17. [7] Li BH, Hou BC, Yu WT, Lu XB, Yang CW. Applications of artificial intelligence in intelligent manufacturing: a review. Front Inf Technol Electron Eng 2017;18(1):8696.
8 Biomedical data mining for improved clinical diagnosis G. Nalinipriya1, M. Geetha2, R. Cristin3, Balajee Maram3 1
DEPART ME NT OF INFORMATI ON T ECHNOLOGY, SAVEETHA ENGINEERING C OLLEGE,
CHENNAI, INDIA 2 SRM INSTITUTE OF SCIENCE AND TECHNOLOGY, VADAPALANI CAMPUS, CH E NN A I , IN DI A 3 DEPARTME NT OF CSE, GMR INSTITUTE OF TECHNOLOGY, R AJAM, ANDHRA PRADE SH, INDIA
8.1 Introduction Data mining is defined as the process of identifying the patterns, associations, or the relationships among the data based on various analytical approaches involving the creation of the model, and the final outcome is the useful knowledge or information. Data mining is grown rapidly in various research fields because of its boundless approaches and the applications for mining the data in an appropriate manner. Nowadays, there is a need for data mining researchers for collaborating biologists and clinical scientists. Thus the data mining researchers having the opportunity for contributing to the development of the clinical and the life sciences by introducing various computational techniques for discovering the useful knowledge from the large-scale biomedical data. The ultimate goal of biological data mining in healthcare applications is to seek for hidden trends, patterns from the voluminous data, and it is broadly utilized to predict various diseases in the medical industry. This serves as the motivation for several researchers to employ the data in order to improve the services for the public and predict the disease before it is too late. On the other hand, the recent developments in data mining research lead to developing various scalable and efficient methods for predicting diseases.
8.2 Descriptions and features of data mining The healthcare information system aims for capturing the data to obtain the health status of people and individual satisfaction in the society. Hence, data mining is introduced for serving medical science and has been shown to be a sensitive, valid, and reliable method for discovering relationships and patterns. Thus the data mining tools are employed to identify the unknown, patterns, and relationships in the large datasets. While data mining indicates the significant advance in the kind of analytical tools available currently, medical research studies have an advantage from its application in several areas of interest. Data mining involves Artificial Intelligence in Data Mining. DOI: https://doi.org/10.1016/B978-0-12-820601-0.00012-4 © 2021 Elsevier Inc. All rights reserved.
155
156
Artificial Intelligence in Data Mining
the number of statistical approaches, like link analysis, disproportionality assessment, and deviation detection. The advantages of data mining in the medical field are illustrated below. • • • •
Detection of fraud as well as abuse by the healthcare insurers. Effective treatments and the best practices are identified by physicians. Patients have the best and more affordable healthcare services. One of the major benefits of data mining is that it can increase the working speed of huge datasets. The creation of a quicker report and the faster analysis maximize the operational efficiency with less operating cost. • Data mining is utilized for extracting predictive information from the vast databases which are a very significant for data mining.
8.3 Revolution of data mining Data mining is the branch of applied artificial intelligence (AI) since the 1960s. Data mining is the relatively novel concept to the data analysis and the knowledge discovery that developed in the middle of the 1990s. In this case, data mining is worked for file processing. The next stage of data mining is the database management systems to be started in the 1970s early to the year 1980s, which is worked based on data modeling tools, Online Transactional Processing, and the query processing. Data mining has made a breakthrough for processing and analyzing data, and they are rated as one of the promising techniques in the 2001 Massachusetts Institute of Technology review [1]. In the Mid-1980s, the advanced database systems were introduced in which the applicationoriented process, as well as the data models, is worked. In the year 1995, the first Association for Computing Machinery Conference based on Knowledge Discovery and Data Mining was held in the United States, and in late 2009, the term data mining was initially registered for 2010 Medical Subject Headings. While data mining was mainly originated from work performed in statistics and machine learning, data mining has advanced from these beginnings to include database design, pattern recognition, visualization, AI, and so on. In the last 10 years, the total number of papers with the title name data mining is referenced in Medical Literature Analysis and Retrieval System Online, which has grown rapidly by 10-fold. The activities of the interdisciplinary researchers for promoting clinical data mining approaches are likely to be one of the reasons for this explosion. The Intelligent Data Analysis and Data Mining has conducted a workshop based on intelligent data analysis and data mining in the field of biomedicine and pharmacology since 1996. This workshop is termed as IDAMAP2, which is an opportunity for the practitioners and the researchers to discuss about the data analysis techniques in the biomedical area.
8.4 Data mining for healthcare Data mining is the motivating research in order to identify meaningful information from the large datasets. Nowadays, data mining becomes very popular in the field of healthcare to detect the unknown and valuable information about health data with the use of efficient analytical methods. In the health industry, data mining provides various advantages, like fraud detection in
Chapter 8 • Biomedical data mining for improved clinical diagnosis
157
health insurance, disease cause identification, availability of patients’ medical solutions at limited cost, and identification of medical treatment. In addition, the data mining techniques help healthcare experts to make the best healthcare policies, establishing health profiles of the individuals, designing the drug recommendation systems, and so on. The data produced from the health organizations are very complex and huge because it is tough for analyzing the data for making significant decisions about patient health. Here, the data comprises the details about the patients, hospitals, treatment costs, medical claims, and so on. Hence, the powerful tool is required to analyze and extract the important information from the complex data. The health data analysis enhances healthcare by improving patient performance tasks management. Thus the data mining technologies output provides the advantage for a healthcare organization to group the patients with the same kind of health issues so that the healthcare organizations deliver better treatments. It is also utilized to predict the stay patients in the hospital for treatment and making a plan for the effective system management. In addition, the data mining techniques are employed for analyzing several factors that are responsible for diseases, like the different working environments, types of food, living conditions, healthcare services, availability of the pure water, agricultural factors, and cultural environment [2]. The mapping of data mining techniques with respect to applications is depicted in Fig. 81. In the figure, the data mining tasks are classified into predictive and descriptive. The predictive task is utilized for predicting the future or the outcome of interest. In healthcare, the predictive analytics enables the best decisions to be made, allowing for care to be personalized to each patient, whereas the descriptive tasks refer to the analysis of data collected concerning the patient’s behavior and diagnosis, clinical data, and other healthcare activities. Here, the data is collected from electronic health records and are further employed for making clinical and business decisions. The main tasks of predictive and descriptive data mining are classified into clustering, association, and classification. Clustering: Clustering is utilized for finding the similarities between the data points. Here, each data points having greater similarity within the same cluster as compared to data points that correspond to another cluster. As pointed out, clustering needs less amount of information to analyze the data. Hence, it is broadly utilized for analyzing microarray data because of the limited details of genes. Classification: The classification technique is utilized for predicting the target class for every data point. In medical, the patient is classified as “low-risk” or “high-risk” patient based on their disease pattern. Association rule: In the medical field, the association is utilized for detecting the relationship between the health state or symptoms. This approach is also very useful to find the improper prescriptions and fake or irregular patterns in the medical claims made by hospitals, physicians, patients, and so on.
8.4.1 Applications for mining healthcare data Data mining is very useful for healthcare research in the healthcare industry for making a valuable decision. Some of the data mining applications in healthcare are illustrated below.
158
Artificial Intelligence in Data Mining
FIGURE 8–1 Mapping of data mining techniques versus application.
8.4.1.1 Hospital infection control Nosocomial infection is an infection that is obtained from the hospital or healthcare facility. This infection affects millions of patients in the United States every year, and the infections of drug-resistant are really very high. In this case, the recognition of earlier outbreaks and the emerging resistance needs pro-active surveillance. Here, computer-assisted surveillance is used to determine the deviation detection and high-risk patients in the occurrence of
Chapter 8 • Biomedical data mining for improved clinical diagnosis
159
predefined events. The surveillance system employs data mining approaches in order to find the interesting patterns in the infection control data. This system considers the association rules on patient care data, collects the data from information management systems, and produces monthly patterns in infection control.
8.4.1.2 Ranking hospitals The data mining techniques are employed for studying all the details of several hospitals for ranking. In this case, the organizations rank several hospitals all over the world based on the capability for handling patients with severe illness, which means the hospitals with higher ranks are more appropriate to manage high-risk patients. Standardized reporting is very important because of the hospitals underreport risk factors having lower predictions for patient mortality. Even if the success rate is equal to other hospitals, then the ranking is considered as low because they reported the greater difference between the actual and the predicted mortality.
8.4.1.3 Identification of high-risk patients American Health ways help hospitals with diabetes disease management services in order to enhance the quality and reduce the cost of diabetic patients. The predictive modeling technique is used by American Health ways for differentiating the low-risk and high-risk patients. Based on predictive modeling, the high-risk patients, who required more concern about their health, are identified by healthcare providers. Here, the patient information is explored and combined for predicting short-term health issues and proactively intervene for better longand short-term results.
8.4.1.4 Diagnosis and prediction of diseases The prediction and the diagnosis of disease are very important in the healthcare industry. The data mining techniques is utilized for doctors for improving the health services provided by them. No patients can waste their money and time by selecting the incorrect treatment for the patient that harms patient’s health.
8.4.1.5 Effective treatments The data mining analyses the effectiveness of the treatments by comparing several factors, such as symptoms, causes, treatment costs, and side effects. For instance, the effective treatments are found out by comparing the treatment result of different patients affected by a similar disease, but treated with dissimilar drugs.
8.4.1.6 Best quality services provided to the patients The better-quality services are provided to the patients on the basis of data mining. Here, data mining is applied to large medical data in order to extract many uninteresting unknown patterns. Using this pattern, the quality of services and the care provided to the patients are improved. In addition, data mining is very helpful for knowing the patients’ needs for enhancing the services provided by healthcare organizations.
160
Artificial Intelligence in Data Mining
8.4.1.7 Insurance abuse and fraud reduction The insurer of healthcare introduces the model for identifying the unusual patterns of claims by the patients, hospitals, and physicians. In the year 1998, the Texas Medicaid Fraud and Abuse detection system saved millions of dollars by identifying the fraud and abuse using data mining technique.
8.4.1.8 Appropriate hospital resource management Hospital resource management system is a very important in the healthcare industries. Here, the data mining model constructs the model to manage hospital resources. The group health cooperative employs data mining and provides services to the hospital at a limited cost. In addition, the blue cross-organization manages the disease efficiently by mitigating the cost and enhancing the outputs using data mining.
8.4.1.9 Better treatment approaches Data mining is very useful for both the patient and doctor to select the best treatment by comparing entire treatment processes. Here, better treatment approaches are selected in terms of cost and effectiveness. In addition, data mining is employed to identify the side effects of several treatments and hence reduces the risk of patients.
8.5 Data mining for biological application The biologists, as well as clinicians, step their efforts in unraveling biological processes for underlying the disease pathways in the clinical contexts. This results the flood of clinical and the biological data ranging from genomic sequences to the DNA microarrays, biomolecular interactions, protein and small-molecule structure, biomedical images, biomolecular interactions, gene ontology, disease pathways, and the electronic health records. In general, the biomedical data are generated in an unprecedented scenario for analyzing the data effectively. The biomedical research shifts the data generation to knowledge discovery termed as biological data mining [3]. The emphasis is no longer on allowing clinicians and the biologists for generating several data rapidly, but the conversion of useful knowledge is performed using data mining and acting it in a timely manner. In this case, biological data mining is developed to understand the intrinsic disease mechanism for introducing new drugs for patients benefit. The biological data set consists of several data, which stores the operational details in the single-cell organism. The information within the database are enlisted below: Genome: Gene locations and DNA sequence. Proteome: In the proteome, the organisms full complement of proteins, not necessarily the direct mapping from its genes. Metabolic pathways: Here, the biochemical reactions are linked, which involves protein protein interactions, small molecules, and multiple proteins.
Chapter 8 • Biomedical data mining for improved clinical diagnosis
161
Regulatory pathways: The mechanism, where the expression of some genes into proteins, like transcription factors influences the expression of other genes includes proteinDNA interactions.
8.5.1 DNA sequence analysis DNA is a very important molecule that contains the genetic information of all the living organisms. The human genome comprised three billion base pair information, which is arranged into 23 chromosomes. In addition, the DNA sequence comprises two biopolymer strands to form a double helix. These two strands consist of simpler units, termed nucleotides. The nucleotides present in the DNA sequence are cytosine (C), adenine (A), thymine (T), and guanine (G). The analysis of DNA sequence is very important in several medical and scientific advances in order to determine the similarities and the differences of organisms and also to find the exploration of the evolutionary relationship between them. This process needs the comparison of related DNA sequences, for instance, whether one sequence is the subsequence of another is checked or comparing the specific subsequence occurrences with respect to related DNA sequences. The identification and the comparison of DNA sequences enable vital information extraction from reported data. It is utilized for inferring biological functions and the evolutionary history of the sequence as the query. The comparison of query and the known RNA or DNA sequences becomes the popular tool for drug design, biodiversity, phylogenetic analysis, pharmacogenomics, epidemiology, and genetic disease detection. With the development of computational biology and bioinformatics, the collected DNA sequence grows rapidly and doubling every 18 months. Due to the large-scale and the complex structure of DNA datasets, the DNA sequence analysis is a very challenging task in the field of computational biology and bioinformatics. In bioinformatics, the sophisticated, fast, and parallel computing techniques are needed in order to analyze a large number of DNA sequences within a reasonable timeline. Thus the parallel techniques are introduced in order to solve the large-scale DNA sequence computational issues.
8.5.2 Protein sequence analysis The protein sequence analysis is defined as the process of subjecting peptide or protein sequence in order to study the function, features, evolution, and structure. The most sensitive employed method for sequence analysis is based on Edman degradation chemistry. In this approach, the compound phenyl isothiocyanate is reacted with free N-terminus of peptide or protein to obtain thiourea. The subsequent treatment with acid cleaves the modified N-terminal amino acid from the protein. This material is analyzed and collected for identifying amino acid. However, the protein sequencing is difficult than the DNA sequencing because it occupies the key between the modern molecular biology and the classical biochemistry. In several cases, protein sequencing is the path to gene, and the protein sequence identification in the short region of protein permits the oligonucleotide probes to the related gene are synthesized and utilized in the isolation of gene by the recombinant DNA methods.
162
Artificial Intelligence in Data Mining
After the isolation of genes, the molecular biological approaches are introduced to find the complete protein sequence and corresponding gene structure and to analyze the expression of several tissues. Sequence comparison, pattern finding, and the similarity search are considered as the basic processes of the protein sequence. The basic algorithms and the mathematical theory of sequence analysis are developed for predicting the phylogenetic relationship of the corresponding protein sequences during the evolution. In addition, several statistical, algorithms, models, and the computation techniques are applied for the protein sequence analysis. In this case, several tools, such as sequence alignment, pairwise alignment, and multiple sequence alignment tools, are considered for the analysis of protein sequence. Here, the pairwise alignment tools, like Basic Local Alignment Search Tool, and the multiple sequence alignment tools, namely Clustal W, are used based on dynamic programming. The abovementioned tools are established for constructing phylogenetic trees based on sequence alignment principles and various probability models. Hidden Markov models (HMMs) are the widely used to study about the protein family, protein structural motifs identification, and the prediction of gene structure. This HMMER is the popular HMM tool, which is utilized for finding the conserved sequence domains in the set of protein sequences and the spacer regions between them. However, several stochastic and the probability models are introduced in order to solve the search issues.
8.5.3 Gene expression analysis Gene expression analysis is utilized for studying the activity or the occurrence of the formation of a gene product from their coding gene. It is the sensitive indicator for biological activity in which the changes in gene expression pattern reflects the change of biological process. Several genes are determined by finding the mRNA levels with many techniques, like expressed cDNA sequence tag sequencing, microarrays, massively parallel signature sequencing, serial analysis of gene expression (SAGE) tag sequencing, and so on. Here, several tools are introduced for separating the signal from noise in the high-throughput gene expression studies. The microarray data analysis is partitioned into the selection of genes, clustering, and classification. The microarray dataset is the repository having microarray gene expression data. This improves the likelihood of identifying false positives. In the gene expression analysis, the features are considered as the genes, and the selection of gene is an important process to identify the genes in the particular class and to mitigate the feature dimension. In addition, the vast number of genes is relevant when the classification step is applied. However, the danger of overshadowing relevant genes is mitigated when the gene selection step is performed. Another far most method in the gene expression analysis is the clustering. Clustering methods are categorized into one-way and two-way clustering. The one-way clustering is utilized for grouping the genes with the same behavior or the samples with the same gene expressions, whereas the two-way clustering clusters the genes and samples simultaneously. The other most frequently used techniques in the gene expression analysis are the hierarchical clustering.
Chapter 8 • Biomedical data mining for improved clinical diagnosis
163
An important problem related to the applications of clustering methods in the microarray data is the quality of the cluster. Several techniques, like repeated measurements, bootstrap, subsampling, mixture model-enabled techniques, and so on, are developed for dealing with the cluster reliability assessment. In the microarray analysis, the classification is performed for discriminating or predicting the diseases using the gene expression patterns and also to identify the best treatment for the given genetic signatures. The methods utilized in the gene expression analysis are the class prediction and the class discovery, gene identification with the same expression patterns, and the gene expression data with the Plaid models. Here, most of the methods deal with the microarray data analysis.
8.5.4 Gene association analysis Gene association analysis (GAA) provides the associations, which are not adjacent to each other in one shot clustering strategy. GAA is utilized over the applications of the sophisticated association data mining methods. These association methods are represented with respect to the implication rules, which mean how to describe the expression of one gene to be associated or linked with the expression of the set of genes. In addition, the gene networks are generated from the discovered associations. In the last few decades, the research community mainly focused on the association mining methods used in integrative genomic studies, but failed for revealing interesting gene relationships. Here, the gene associations are computed based on the linkage of other information achieved from the several biological data sources. When GAA is applied to the single source (microarray) or the enriched one (microarray and other biological information), identifying the interested gene associations is not a trivial task. The intrinsic characteristics of microarray give the curse of dimensionality dilemma to GAA and are more remarkable when one integrates other biological information to enrich the final data model.
8.5.5 Macromolecule structure analysis The analysis of macromolecule includes the secondary structure of proteins and RNA prediction, protein structure classification, comparison of protein structures, and the visualization of protein structures. Some of the software tools employed for the analysis of macromolecule structure is Cn3d, DALI, and Mfold. Here, the Cn3D is utilized to view the 3D structures, the alignment of the structure is done using DALI, whereas the Mfold is for the prediction of RNA secondary structure. In addition, the protein structure databases and the associated tools play a very significant role in the structure analysis. Some of the best protein resources utilized in the analysis of structure include homology database, Protein Data Bank, architecture, classification by class, topology, Molecular Modeling Database, Swiss-Model resource, and the structural classification of proteins database. With the rapid growth of highthroughput structural biology and proteomics, the new techniques and the tools are required in order to solve the issues in structure prediction.
164
Artificial Intelligence in Data Mining
8.5.6 Genome analysis The analysis of the genome is performed based on three components. They are DNA sequencing, assembly of that sequence for creating the representation of the original chromosome, and the analysis and annotation of the representation. Sequencing of the complete genome and the subsequent annotation of features in the genome face several challenges. The initial one is how to place the whole genome together from many small pieces of sequences, which is related to the sequence assembly and genome mapping. The secondary one is where the genes are located on the chromosome. To solve the above-mentioned challenges, several tools are introduced by the researchers for assembling a large number of sequences based on the same algorithms used in basic sequence analysis. The most commonly employed algorithms, such as Consed/PHRAP and the CAP3. The second challenging issue is corresponding to the gene structure prediction in the eukaryotic genomes. The DNA sequence is searched by encoding the protein for the open reading frame. The gene prediction in prokaryotic organisms is more accurate and easier than eukaryotic organisms. The gene structure of eukaryotic is more complex because of the exon or the intron structure. GeneMark and the Glimmer are the software tools employed for predicting genes in the prokaryotic genomes.
8.5.7 Pathway analysis In the pathway analysis, the gene sets related to the biological pathways are tested for significant relationships with phenotype. The primary data for the pathway analysis are commonly sourced with genotyping or gene expression arrays. The biological process in the cell forms the complex networks from the gene products. Pathway analysis tries to visualize, model, and build these networks. The pathway tools are normally associated with the database for storing the information about the involved molecules, biochemical reactions, and genes. Several databases and tools, like KEGG database, GenMAPP, and EcoCyc/MetaCyc, are introduced and used in the pathway analysis. Here, the KEGG database consists of a huge collection of metabolic pathway graphs, and the GenMAPP is the pathway building tool to build and view metabolic pathways, whereas the GenMAPP is designed to work with the microarray data. With the rapid growth of functional proteomics and genomics, the pathway tools become more valuable to understand the biological processes at the system level.
8.5.8 Microarray analysis The microarray analysis allows the biologists for monitoring the genome-wide patterns of the gene expression in high throughput. Microarray applications generate a large volume of gene expression data with various levels of experimental data complexity. For instance, the simple experiment is conducted by 10,000 gene microarrays with the samples, collected at the five-time points for five treatments with three replicates creates the dataset with 0.75 million data points. Here, the hierarchical clustering is first applied to find similar gene expression patterns in the microarray data. Meanwhile, several clustering methods, like a
Chapter 8 • Biomedical data mining for improved clinical diagnosis
165
self-organizing map, k-means, association rules, Support Vector Machine (SVM), and the neural networks along with the GeneSpring or the Spotfire software packages, are used for the analysis of microarray. Nowadays, the microarray analysis is far from clustering. By integrating priori biological knowledge, the microarray analysis becomes very popular in order to model the biological system at the molecular level. For instance, the sequence analysis approaches are combined to identify the common promotor motifs from clusters in the microarray data based on various clustering algorithms.
8.5.9 Computational modeling of biological networks Computational modeling has been paid great attention in the biological networks to develop the high-throughput techniques in order to study proteomics (2D protein gel, mass spectrometry, and the protein chips) and gene expressions (microarray technology). The proteomics and the gene microarray generate a large amount of data for providing the rich resources of complex biological systems.
8.5.9.1 Biological networks The molecular interactions in the cell are represented based on network connections graphs similar to the network of the power lines. The set of connected molecular interactions are referred as the pathway. Here, the cellular system involves the complex interactions between DNA, proteins, smaller molecules, and the RNA and is categorized into protein network, metabolic pathway or network, and the gene or genetic regulatory network. The metabolic network is utilized for representing enzymatic processes in the cell, which provides the energy and building blocks of the cell. This is formed by combining substrate and enzyme in the degradation or biosynthesis reaction. Usually, the mathematical representation of the network is the graph in which the edges and the compounds connecting two adjacent substrates. The catalytic activity of the enzyme is regulated by the multiple processing in vivo, which includes reversible covalent modifications, extensive feedback loops, reversible peptide-bond cleavage, and allosteric interactions. For well-known organisms, like E. coli provides the considerable information about the metabolic reactions that have been accumulated through several years and organized to large online databases. Protein network is utilized for describing the communication and signaling networks in which the basic reaction is performed between two proteins. These proteinprotein interactions are followed by the signal transduction cascade, namely p53 signaling pathway. Here, the proteins are connected functionally using allosteric interactions, other mechanisms to biochemical circuits, or posttranslational. Regulatory or the genetic network signifies the functional interference of the direct casual gene interactions. According to Central Dogma, the gene expression is regulated at many molecular levels, like DNA ! RNA ! Protein ! functions. In this case, the gene products interact at various levels. The large-scale gene expression analysis is conceptualized as the genetic feedback network. The main aim of microarray analysis is the complete reverse engineering of the genetic network. The genetic network modeling is discussed in the below section.
166
Artificial Intelligence in Data Mining
8.5.9.2 Modeling of networks The systematic approach is essential for designing regulatory networks to understand their dynamics. For many years, network modeling is employed in the field of economical and social. Nowadays, several high-level models, such as probabilistic models, continuous systems of coupled differential equations, and Boolean networks are developed for the regulatory network, where these models are summarized by Baldi and Hartfield. In the Boolean networks, the gene or the protein may be inactive or passive states, which is represented by 0 or 1. This binary state depends on the state of other proteins and genes in the network and is expressed through the discrete equation, Yj ðτ 1 1Þ 5 Ej ½Y1 ðτ Þ; . . . ; YM ðτ Þ
(8.1)
where, the Boolean function is denoted as Ej with j th element at time τ. The simple example to find the Boolean network description is depicted in Fig. 82. The gene expression patterns containing the genetic information of the genetic network and may measure experimentally. The genes with the same temporal expression pattern share the common genetic control processes and may, therefore, be related functionally. Clustering gene expression patterns based on the similarity or the distance measure is the initial step toward designing the wiring diagram for the genetic network. The continuous differential equation is the alternative model for the Boolean network. Here, the state variables Y are continuous and fulfill the differential equation system and are expressed as, dYj 5 EJ ½Y1 ðτ Þ; . . . ; YM ðτ Þ; J ðτ Þ dτ
(8.2)
where, the term J ðτ Þ indicates the external input to the system, and the variable Yj is interpreted by representing the concentrations of mRNAs or the proteins. Such model is utilized for designing the biochemical reactions in metabolic pathways and gene regulation. Most of the models failed to consider spatial structure. Each network element is characterized by a single time-dependent concentration level. However, several biological processes heavily rely on the spatial structure as well as compartmentalization. Thus it is very necessary for modeling the concentration on both time and space with continuous formalism based on partial differential equations. The Bayesian networks (BNs) are provided by graphical model theory in the statistics. The main aim is to approximate the complex multidimensional probability distribution by the product of simple local probability distributions. The BN model of the genetic network is presented by the Directed Acyclic Graph with M nodes. Here, the node refers to the proteins or genes at the random variable Yj levels of activity. The parameters of the network model belong to the local conditional distributions, and every random variable provides the random variables with parent nodes.
QðY1 ; . . . ; YM Þ 5 L Q Yj =Yi :iAM ðjÞ j
(8.3)
where, the term M ðjÞ represents the parents of j th vertex. Given the dataset B indicating the expression levels derived based on DNA microarray experiments, it is possible for using the
Chapter 8 • Biomedical data mining for improved clinical diagnosis
D
E
167
F
D'
E'
F'
(a) D' = E
E ' = D or E F ' = ( D AND E ) OR ( E AND F ) OR ( D AND F )
(b) D 0 0 0 0 1 1
Input
E
F
0 0 1 1 0 0
0 1 0 1 0 1
Output
D' 0 0 1 1 0 0
E' 0 1 0 1 1 1
F' 0 0 0 1 0 1
(c) FIGURE 8–2 Target Boolean network of reverse engineering (A) network wiring, (B) determination of logical rules, and (C) dynamic output.
learning techniques with the heuristic approximation methods for inferring network architecture and the parameters. However, the data from the microarray experiments are still insufficient and limited for determining the single model, and thus the researches have been introduced the heuristics to learn the classes of models rather than the single model.
8.6 Data mining for disease diagnosis This section presents the data mining techniques employed for the diagnosis of disease. This is an affluence of data attained in healthcare, but there is no effectual tool for finding the secret relationship in the data. There are many technologies and applications existing for
168
Artificial Intelligence in Data Mining
generating new information, and among them, data mining is very useful to have the novel information from the huge preexisting datasets, especially in the medical field [4]. In the last few decades, millions of people die of various diseases. Predicting the outcome of the disease is one of the most challenging and interesting issues, where to develop data mining applications. The usage of computers with the automated tools, huge volumes of medical data are collected and made available in the medical research groups. As a result, the data mining techniques have the substantial expansion in the medical industry in order to predict the various diseases, like heart disease, diabetes, cancer, liver disease, and so on.
8.6.1 Neural network for heart disease diagnosis Heart disease is a leading cause of death in worldwide over the last few years. Heart disease diagnosis has become a difficult task in the field of medicine. This diagnosis depends on a thorough and accurate study of the patient’s clinical tests data on the health history of an individual. Data mining aims to develop an intelligent automated system in order to predict the disease. Such an automated system for medical diagnosis would enhance timely medical care followed by proper subsequent treatment thereby, resulting in significant life-saving. The European public health Alliance informed that the strokes, heart attacks, and the other circulatory diseases account for about 41% of deaths. The Economical and Social Commission of Asia and the Pacific reported that in one-fifth of Asian countries, most lives are lost to noncommunicable diseases, such as cardiovascular diseases, cancers, diabetes, and chronic respiratory diseases. The Australian Bureau of Statistics reported that heart and circulatory system diseases are the first leading cause of death in Australia, causing 33.7% of all deaths. The statistics of South Africa reported that heart and circulatory system diseases are the third leading cause of death in Africa. Motivated by the worldwide increasing mortality of heart disease patients each year, and the availability of a huge amount of patients’ data helps the healthcare professionals for the diagnosis of heart disease. Developing a tool to be embedded in the hospitals management system to help and give advice to the healthcare professionals in diagnosing and providing suitable treatment for heart disease patients is important. Several data mining techniques are used in the diagnosis of heart disease, such as Naïve Bayes, Decision Tree, neural network, kernel density, automatically defined groups, bagging algorithm, and SVM showing different levels of accuracies. K. Subhadra, Vikas B [5] developed multilayer perceptron neural network with a backpropagation algorithm, as a practitioner needs to make a decision from multiple inputs, such as current and previous medical history of a patient. Neural networks are proved to be effective in making decisions by predicting the data. As the inputs used in predicting the disease are more in number and diagnosis has to be performed at different stages, multilayer perceptronbased neural networks are used in several researches. Neural network extends its predictive capability at different hierarchical levels in a multilayered structure of networks. This multilayered structure helps in selecting features from the dataset at different scales in order to refine them into more specific features. To facilitate this, the concept of multilayer perceptron neural network has been introduced through the implementation of a backpropagation algorithm for
Chapter 8 • Biomedical data mining for improved clinical diagnosis
169
efficient diagnosis of heart disease. Thus 14 attributes are used as inputs for training the system of neural networks for diagnosing heart disease risk levels using a multilayered network. Traditional diagnosing approaches have no proper automated tools use for the purpose of the heart disease diagnostic system. The commonly used data mining algorithms for predicting diseases are: • Genetic algorithm • k-means algorithm • MAFIA algorithm Heart diseases caused by atherosclerosis have become the world’s most important causes of mortality. However, atherosclerosis will develop for a long time before the pain and other symptoms perceptible. Thus the prevention of atherosclerosis is undoubtedly important for medical data mining and a key public health issue. Though several main risk factors of atherosclerosis have been identified in Ref. [6], an effective risk prediction model was developed to support the diagnosis and prevention of cardiovascular diseases. This study employs the STULONG dataset, which concerned the 20 years lasting longitudinal study of the factors of atherosclerosis in the population of 1417 middle-aged men and was collected at IKEM and the Medicine Faculty at Charles University.
8.6.2 Apriori algorithm for frequent disease Apriori algorithm is the most classical and important algorithm for mining frequent itemsets. Apriori is used to find all frequent itemsets in a given database. Based on the Apriori principle, any subset of a frequent itemset must also be frequent. For example: if fABg is a frequent itemset, both fAg and fBg must be frequent itemsets. The key idea of the Apriori algorithm is to make multiple passes over the database. It employs an iterative approach known as a breadthfirst search (level-wise search) through the search space, where the l-itemsets are used to explore ðl 1 1Þ itemsets. In the beginning, the set of frequent one-itemsets is found. The set of that contains one item, which satisfies the support threshold, is denoted by K1 . In each subsequent pass, a seed set of itemsets found to be large in the previous pass. This seed set is used for generating new potentially large itemsets, called candidate itemsets, and count the actual support for these candidate itemsets during the pass over the data. At the end of the pass, we determine which of the candidate itemsets are actually large (frequent), and they become the seed for the next pass. Therefore K1 is used to find K2 , the set of frequent two-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found [7]. In Ref. [8], the data are from Hospital Information System (HIS), which has the sufficient details of patient including patient’s name, age, disease, location, district, and date from laboratories, which keeps on growing year after year. Having collected the data from the hospital information system, this research can find frequent disease with the help of association techniques. This research helps to mine the data about the frequent diseases with the help of weka tool applied over the training data set.
170
Artificial Intelligence in Data Mining
In Ref. [9], the Apriori algorithm is developed to explore for medical data mining for identifying frequent diseases. Here, the association rule-based algorithm could identify frequent diseases. The main difference is that the original Apriori algorithm was modified in order to make it suitable for finding locally frequent diseases.
8.6.3 Bayesian network modeling for psychiatric diseases Psychiatry is a medical specialty devoted to the treatment, study, and prevention of mental disorders. Whatever the circumstance of a person’s referral, a psychiatrist first assesses the person’s mental and physical condition. This usually involves interviewing the person and often obtaining information from other sources, such as health and social care professionals, relatives, associates, law enforcement, emergency medical personnel, and psychiatric rating scales. A mental status examination is carried out, and a physical examination is usually performed to establish or exclude certain illnesses, such as thyroid dysfunction or brain tumors, or identify any signs of self-harm. With the precise objective of helping physicians in the final stage of their decisional process of diagnosis and treatment planning, we developed a BN model for analyzing the psychiatric diseases, based on observed symptoms and a priori known causal relationships. BNs are complex probabilistic diagrams that systematize a mixture of domain expert knowledge and observed datasets by mapping out cause-and-effect relationships between key variables and encoding them with numbers that signify the amount in which one variable is probable to influence another. In conjunction with Bayesian inference methods, the BN modeling proves to be an efficient instrument in the decision-making process for a variety of medical domains. BNs can be used to implement decisional systems, which also give solutions for other types of problems connected with the medical domain. In this sense, BNs are implemented based on different learning algorithms [10] and used to represent healthcare systems for patients that arrive to the emergency department of a hospital. This kind of system can make predictions about some variables of interest or can offer support in decision-making regarding the actions that must be performed. Using this type of system, the general activity from a hospital and the medical care can be improved.
8.6.4 Adaptive fuzzy k-nearest neighbor approach for Parkinson’s disease diagnosis Parkinson’s disease (PD) is one kind of degenerative disease of the nervous system, which has influenced a large part of the worldwide population up to now. Till now, the cause of PD is still unknown; however, it is possible to alleviate symptoms significantly at the onset of the illness in the early stage [5]. It is claimed that approximately 90% of the patients with PD show vocal impairment, the patients with PD typically exhibit a group of vocal impairment symptoms, which is known as dysphonia. The dysphonic indicators of PD make speech measurements an important part of the diagnosis. Recently, dysphonic measures have been proposed as a reliable tool to detect and monitor PD.
Chapter 8 • Biomedical data mining for improved clinical diagnosis
171
Wan-Li Zuo et al. [11] developed an effective diagnosis system using particle swarm optimization (PSO)-based fuzzy k-nearest neighbor (FKNN) for the diagnosis of PD. Here, the binary and the continuous version of PSO were utilized for performing the parameter optimization and the selection of features simultaneously. On the other side, the neighborhood size and the fuzzy strength parameter in FKNN were specified adaptively based on continuous PSO, and the binary PSO was used for choosing the most discriminative subset of features for the prediction. Zhennao Ca et al. [12] presented enhanced FKNN for the detection of PD based on vocal measurements. In this case, an evolutionary instance-based learning approach termed CBFO-FKNN was introduced by integrating chaotic bacterial foraging optimization with Gauss mutation (CBFO) approach with FKNN. This framework solved the parameter tuning issues of FKNN.
8.7 Data mining of drug discovery In this section, the eight stages in the drug discovery process are examined. Fig. 83 shows the process of drug discovery in data mining. The drug discovery process, in which compounds are screened and evaluated for therapeutic use, has resulted in safe and effective therapies for a variety of diseases [13]. However, assessments of new compounds for drug development are notoriously long and costly. They typically span more than a decade and exceed a billion dollars. Furthermore, only a small percentage of candidates actually make it to the market. The open-source developments in these areas, Konstanz Information Miner (KNIME) integration with external tools for drug discovery and examples of real-world applications are focused.
8.7.1 Target identification The early stages of drug discovery involving target identification, target validation, and highthroughput screening (HTS) are dominated by the fields of bioinformatics, sequence
FIGURE 8–3 The pipeline of drug discovery.
172
Artificial Intelligence in Data Mining
analysis, and image analysis. The process of drug discovery begins with determining a drug target (a gene, an RNA transcript, or a protein) linked to a disease of interest. Putative targets are identified through several avenues of scientific research, often stemming from academia. Clustered regularly interspaced short palindromic repeats (CRISPR)-based high-throughput screens are commonly used to systematically knockout, inhibit, or activate large numbers of candidate genes. Perturbations that exacerbate or hinder a disease can reveal potential drug targets. Once a putative target is identified, further functional information is collected through in vitro and in vivo studies. CRISPR may aid in these processes by facilitating gene knockouts or protein over expression in cell lines. If a causative relationship between a target and disease is established and if the target is druggable, then a screening campaign is initiated to search for potential therapeutic candidates. The identification of a particular molecular target comes from extensive work in biology through genomic and proteomic screening as well as metagenomic analysis. Sequence analysis is performed using high-throughput sequencers, also known as next-generation sequencers (NGS). These technologies generate a large volume of data that needs to be analyzed. For example, DNA sequencing determines the order of the nucleotide bases in a genome for which the human genome contains over three billion base pairs. The analysis of this volume of data has until recently required expert knowledge in bioinformatics, thus limiting the accessibility of the data. The need for graphical pipelining tools for bioinformatics was identified nearly a decade ago. Since then, this field has become more accessible through a number of web-based tools for life sciences, such as Mobyle, Cyrille, and Galaxy. The KNIME extensions for NGS allow for an alternate method to access and manipulate data that is visually intuitive and is designed to make workflow processes easily understood. There are a number of contributors for KNIME nodes for sequence analysis and include KNime4Bio, NGS, and Sequime. The KNime4Bio project contains over 70 nodes for manipulating and postprocessing NGS data, particularly Variant Call Format genomic variation text files. There are also nodes for retrieving data from the National Center for Biotechnology Information and visualization tools, such as the IGV browser. The NGS package contains nodes for reading and writing NGS-related file types, such as FastQ, Sequence Alignment/Map (SAM), and BAM (the compressed binary version of SAM) formatted files and for writing out bed Graph formatted files. Also available are nodes for identifying a chromosomal region that has no gaps (so called regions of interest), for calculating all of the overlapping regions and for splitting a sequence into one nucleotide per row. The Sequime package contains a number of amino acid and nucleotide sequencing nodes. It includes a database reader to access sequence data from a given set of internet sources, such as the EBI and UniProt, and read/ write nodes that can handle EMBL, GenBank, FASTA, and UniProt file formats. Also useful are the sequence alignment tools and annotation nodes for physiochemical properties, Gene Ontology terms, and principal property scales, Z-scales. These last nodes are particularly interesting for building Quantitative Sequence Activity Models to design biologically active proteins. The NGS tools within KNIME can be used together with the KNIME clustering algorithms to provide workflows to tackle biological data mining problems. This has been illustrated,
Chapter 8 • Biomedical data mining for improved clinical diagnosis
173
where they had used KNIME to build classification models of pyruvate dehydrogenase protein sequences from a variety of microbial groups. The models are then employed for classifying and clustering the unannotated sequences to ascertain the position of taxonomic. In other study, classification models are used for predicting colon cancer diagnosis on the basis of DNA microarray expression data. After the identification of therapeutic target, it is very crucial for reviewing the literature to establish the scope of the chemical space around the drug molecules acting at that target. Traditionally, this is a very time-consuming task; however, tools are now available to simplify these processes. There exists a vast pool of data on chemical, chemogenomic, and biological data from numerous data resources, such as the PubChem Substance database, PubChem BioAssay database, ChEMBL, DrugBank, and KEGG. Combining heterogeneous data from these varied sources is possible through a variety of technologies, like eXtensibleMarkup Language (XML) for conveying metadata, Web Ontology Language for describing ontologies and taxonomies, and Resource Description Framework, which extends the connectivity of the web for creating relationships between data. These languages are part of the Semantic Web that allows for integration and to search the large volume of life science datasets from the public domain. KNIME provides a platform to perform these data mining techniques. This was recently demonstrated in a workflow at the KNIME UGM 2012, which used the KNIME XML processing nodes to extract data from DrugBank and then perform a variety of text-based and data mining functions using the Text Processing, Itemset Mining, and Network Mining KNIME nodes and then visualization of the results using the BisoNet nodes. This workflow identifies relationships between drug names, drug generic names, drug targets, genes, and drug mode of action.
8.7.2 Target validation and hit identification The drug discovery pipeline from hit identification through to candidate identification involves an iterative design process. Cheminformatics methods developed during hit identification can be applied during lead optimization. The main difference between early- and late-stage development is that compound progression down the discovery pipeline is a consequence of an increased understanding of the molecular interactions between a compound and its target, the subsequent functional effects, and improved physiochemical properties. This progress is fundamentally a result of taking a lot of biological and chemical data and extracting the important information in order to improve the profile of a drug molecule. At this early stage of drug discovery, a lot of effort is needed to develop the biological assays and to identify the chemical starting points for hit expansion. A good chemical starting point is vital to the success of compound development. The methods employed routinely at this crucial stage of drug discovery, HTS, high-content screening (HCS), and virtual highthroughput screening (vHTS) are designed to identify the best chemical matter to commence a drug discovery campaign. Within the bioinformatics domain, technological advances in cell imaging and fluorescence microscopy have enabled HCS methods that are used once to characterize preclinical
174
Artificial Intelligence in Data Mining
chemical entities to be employed at other stages in the drug discovery cycle. HCS protocols are now being implemented early in the drug discovery process during hit finding, in the analysis of structureactivity relationships (SAR) during hit expansion, and in ADMET (adsorption, distribution, metabolism, excretion, and toxicity) profiling of compounds. There are six main data mining approaches taken during HTS analysis: (1) identifying the project objectives, (2) data inspection, (3) data preprocessing and extract, transform and load (ETL), (4) modeling, (5) model analysis, and (6) knowledge deployment. These are the data mining steps for which KNIME was designed to perform. It is therefore not surprising to see the emergence of KNIME nodes to handle data produced from HCS and HTS. The contributions in this field include the collaborative work on the HC/DC nodes from ETH Zurich and the HiTS nodes as well as the community contributions from the MPI-CBG and the highthroughput screening exploration environment (HiTSEE) chemistry focused analysis tools from the University of Konstanz. There are around 40 HCS nodes from the MPI-CBG. These include plate readers (Envision, GeniusPro, and MSD Sector Imager) and imagining instrument readers (Opera, Operetta, Cell Profiler, and motion tracking), quality control metrics, screening data, and library annotation tools. As well, standard plate normalization methods, an interactive plate heatmap viewer, and barcode and plate layout utility nodes are provided. In addition, there are preprocessing and data manipulation nodes as well as integrated Rtools for visualization and modeling. The KNIME nodes from the HiTS project are designed to import data from the IN-Cell Analyzer 1000 instrument, to analyze the data using a KNIME node implementation of CellHTS2 via additional KNIME plug-insto perform CellHTS2 node output analysis. Also, the HiTS package includes a plate-based heatmap and dendrogram viewer node. The HiTS nodes are routinely used in academia and have been used to study cytotoxicity in conjunction with data mining nodes from KNIME. The early stage of the discovery pipeline involves identifying molecules that hit the target with a degree of potency. When a suitable biological assay is available, HTS approaches can be employed to screen compounds. However, biological screening is not an option; a complementary approach is to perform docking-based vHTS. Molecular docking (principally applied to protein targets) can be parallelized to process up to millions of compounds in a realistic time scale amenable for drug discovery. This process requires a 3D model of the protein receptor and compound libraries to dock. There are nodes available to obtain protein structures and to generate compound libraries in KNIME. There are open-source KNIME nodes developed for downloading protein structures from the RCSB PDB. For example, Vernalis has developed the PDB Connector node, and the Active Flow package contains the PDBjMineWeb node for connecting to the PDBj.
8.7.3 Hit to lead In early stage drug discovery, library design is aimed primarily at identifying suitable chemical matter to be carried forward into hit expansion and lead identification. One aspect of library design during the hit to lead phase of drug discovery is to explore the chemical space around a hit. However, this is done without using property space. A modified version of diversity selection
Chapter 8 • Biomedical data mining for improved clinical diagnosis
175
that takes into account a biological or property endpoint, termed MSDS or Maximum-Score Diversity Selection, was developed in [14]. This approach was tested in KNIME using the MoSS node to compute maximum common substructures. The authors showed multiple approaches to defining a diverse set. Compound selection is not a trivial problem. A very recent and elegant KNIME package called HiTSEE has been developed to facilitate the visualization and selection of diverse compounds clusters. In this approach, diverse chemical space is sampled in an efficient manner around a subset of compounds.
8.7.4 Lead optimization During hit to lead and lead optimization, there is an emphasis on the rational design using knowledge gained from SAR analysis. Routine SAR data mining nodes are used for identifying common scaffolds, substructure searching, and R-group decomposition are covered within the community cheminformatics packages. During lead optimization, a lot of effort is applied in optimizing synthetic routes and in building focused compound libraries to scope out the SAR around lead series. Here, the KNIME nodes are developed to allow the analysis of chemical reactions in order to design the molecules. Synthetic routes to drug molecules, which are captured in an electronic laboratory notebook (ELN), and then, this information used to design synthetically tractable chemical entities. Nodes to perform these operations are within the Reaction Generation package in the Erl Wood Cheminformatics node repository. There are also other nodes within the Indigo and RDKit packages to generate one and two-component reactions to perform combinatorial reaction enumeration and to query reactions. The utility of KNIME as a framework for custom node extensions has seen the development of reaction-orientated workflows to build compound libraries. The choice of reactions to using in the design of drug-like compounds was recently reviewed by Roughley and Jordan, who exploited the KNIME framework in their analysis. Another aspect during lead optimization is designing out the off-target activity and minimizing adverse compound toxicity. This is achieved through analyzing off-target SAR and in predicting ADMET profiles for compounds using bespoke QSAR models. There are a number of chemical property filtering nodes within KNIME and QSAR datasets that are suitable for this purpose. Thus KNIME is a suitable framework to support model generation for the toxic effects of pharmaceutical compounds. The toxicity modeling approaches that applied in the COSMOS project to predict toxicity for cosmetics with KNIME can similarly be applied to drug discovery. As an example, KNIME was used as the framework to study the toxic effects exhibited by compound upon binding to DNA. Specific nodes for studying activity cliffs and performing multi-objective optimization are available from the Erl Wood Cheminformatics package. Predictive models can be assessed using nodes from Enalos. More advanced computational chemistry methods, such as the alignment of molecules to bioactive conformations, can be achieved using either external connection to other vendor software. The ability of KNIME to interface with ELNs has meant spectroscopic and analytical data can be studied using tools in KNIME. There is growing interest in this area as seen with the recent partnership between Cambridge Soft and KNIME that serves to further align compound management applications and KNIME.
176
Artificial Intelligence in Data Mining
8.7.5 Late-stage drug discovery and clinical trials The main aim of late-stage drug discovery is to maximize the favorable drug Metabolism and Pharmacokinetics properties. KNIME is useful in this regard as it can support the packages for the study of pharmacokinetics, the Bioconductor libraries for metabolomics and mass spectroscopy. Similarly, as a project progresses into clinical trials, there are several packages to aid in the design, monitoring, and analysis of data from clinical trials. Patient health is also an area that can be studied using KNIME. In related work, KNIME was used as a data mining tool to predict cerebral aneurysms, and it has also been used to study the automated classification of arrhythmic events.
References [1] Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA. Deep reinforcement learning: a brief survey. IEEE Signal Process Mag 2017;34(6):2638. [2] Patel S, Patel H. Survey of data mining techniques used in healthcare domain. Int J Inf Sci Tech 2016;6 (1/2). [3] Page D, Craven M. Biological applications of multi relational data mining. ACM SIGKDD Explor Newsl 2013;5(1):6979. [4] Amin MS, Chiam YK, Varathan KD. Identification of significant features and data mining techniques in predicting heart disease. Telemat Inform 2019;36:8293. [5] Subhadra K, Vikas B. Neural network based intelligent system for predicting heart disease. Int J Innov Technol Explor Eng 2019;8(5):22783075. [6] Cheng T-H, Wei C-P, Tseng VS. Feature selection for medical data mining: comparisons of expert judgment and automatic approaches. In: Proceedings of 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06). 2006. pp. 165170. [7] Qin Y. Novelty evaluation of association rules. Res Appl Comput 2004;1:1719. [8] Ilayaraja M, Meyyappan T. Mining medical data to identify frequent diseases using Apriori algorithm. In: Proceedings of International Conference on Pattern Recognition, Informatics and Mobile Engineering. February 2013. [9] Khaleel MA, Pradhan SK, Dash GN. Finding locally frequent diseases using modified Apriori algorithm. Int J Adv Res Comput Commun Eng 2013;2(10). [10] Curiac D-I, Vasile G, Banias O, Volosencu C, Albu A. Bayesian network model for diagnosis of psychiatric diseases. In: Proceedings of the ITI 31st International Conference on Information Technology Interfaces. IEEE, 2009. pp. 6166. [11] Zuo W-L, Wang Z-Y, Liua T, Chen H-L. Effective detection of Parkinson’s disease using an adaptive fuzzy k-nearest neighbor approach. Biomed Signal Proc Control 2013;8:36473. [12] Cai Z, Gu J, Wen C, Zhao D, Huang C, Huang H, et al. An intelligent Parkinson’s disease diagnostic system based on a Chaotic bacterial foraging optimization enhanced fuzzy KNN approach. Comput Math Methods Med 2018;2018. [13] Xiao C, Sun J. Tutorial: Data mining methods for drug discovery and development, In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. July 2019. pp. 39153196. [14] Meinl T, Ostermann C, Berthold MR. Maximum-score diversity selection for early drug discovery. J Chem Inform Model 2011;51(2):23747.
9 Satellite data: big data extraction and analysis Rahul Kotawadekar MA STER OF COMP UT ER APPLICATIONS (MCA), FINOLEX A CADEMY OF MANAGEME NT AN D TECHNOLOGY, UNIVERS ITY OF M UMBAI, RA TNAGIRI, IND IA
9.1 Remote-sensing data: properties and analysis Remote sensing (RS) is defined as the science which deals with obtaining information about objects on Earth surface by analysis of data, received from a remote platform. In addition, the RS is utilized to identify the features of the Earth’s surface and to estimate the geobiophysical properties based on electromagnetic radiation as a medium of interaction. The spatial, spectral, temporal, and polarization signatures are the major characteristics of the sensor or target that facilitate target discrimination. The RS data provides a better alternative for natural resources when compared to the traditional methods. Moreover, the RS of the Earth surface comes the long way from the 19th-century aerial photography to the newest Unmanned Aerial Vehicle RS. Generally, the satellite RS starts the Landsat-1 launch in 1972 for the civilian application [1]. In 1979 the Seasat-1 is the first Radio Detection and Ranging imaging satellite [2], which starts the new domain for RS. Nowadays, the Satellite series, such as Planet and Sentinel Labs, are revolutionizing sector by providing high temporal, spectral, and spatial resolution data at a very low cost. Furthermore, the RS data introduces the better Geographical Information System (GIS) for land management, natural resources management, environmental, education, and aeronautical applications. However, the Government organizations open their own portal for providing data to the common citizen [3].
9.1.1 Satellite sensors Satellites are furnished with more instruments or sensors based on their purpose. The sensor is the device that collects the energy from Electronic Medical Record or other and converts it into a signal to obtain the information about target under the investigation. Satellites are categorized based on their function as they are launched into the space to perform the specific job. However, the satellite is designed specifically and launched into space to collect the information and to communicate it back to the Earth. In addition, there are nearly 3000 satellites are operating in the Earth orbit with respect to the United States (US) Government along with rocket parts and 8000 dead satellites in the orbit. Smaller space junk or orbital debris is of Artificial Intelligence in Data Mining. DOI: https://doi.org/10.1016/B978-0-12-820601-0.00008-2 © 2021 Elsevier Inc. All rights reserved.
177
178
Artificial Intelligence in Data Mining
growing concern since it damages the active satellites. Over 20,000 pieces of the trackable debris are resulting from collisions and explosions. Because of extreme velocities, the debris pieces as small as 1 mm may cause significant damage in the collision event. Various types of satellites are designed for several purposes, which are enlisted as follows: • • • • • •
Earth Observation and Mapping Weather and Atmosphere Monitoring Communication Astronomical and Planetary Exploration Military Navigation
The sensor may be passive or active based on the source of energy. Here, the active sensors use their own energy source, whereas the passive sensors receive solar electromagnetic energy reflected from the energy emitted by the surface itself. However, these sensors do not have their own energy source and are not be utilized at the night time apart from thermal sensors. Moreover, some of the sensors are passive, and the record shows the reflected solar radiation, like visible light, and the emissive radiation, namely, thermal energy, whereas other instruments are active and forward the pulses of energy using laser or radar through the atmosphere to Earth surface and record returning signal. Some of the sensors utilized to capture the images using satellite are as follows: • • • •
Advanced Very High-Resolution Radiometer (AVHRR) Landsat Multi-Spectral Scanner Landsat Thematic Mapper Landsat Enhanced Thematic Mapper Plus
9.1.1.1 Advanced Very High-Resolution Radiometer AVHRR is utilized to produce 1 km multispectral data from the National Oceanic and Atmospheric Administration satellite series (1979 to present). This radiometer consists of four or five spectral bands to map the large areas by considering the good temporal resolution. The spatial resolution of AVHRR is 1.1 km due to the limitations of onboard storage and the computation of downlink capability in the early days. In addition, the AVHRR consists of a multispectral sensor with five spectral bands, like thermal, red, mid, and the nearinfrared (IR), but in overtime, the ranges of spectral bands may vary. The spectral bands and their range of AVHRR are given in Table 91.
9.1.1.2 Landsat Multi-Spectral Scanner The first Landsat series was launched in the year 1972. This satellite is said to be Earth Resources Technology Satellites (ERTS-1), and later renamed by Landsat-1. Furthermore, the Landsat-1 consists of two sensing systems: Return Beam Vidicon and Multispectral Scanning System. However, the first five Landsat missions provide continuous and comparable data
Chapter 9 • Satellite data: big data extraction and analysis
Table 9–1
179
Spectral bands specifications.
Spectral bands
Name
Range (µm)
Band-1 Band-2 Band-3 Band-4 Band-5
Visible red Near IR IR Thermal IR Thermal IR
0.580.68 0.7251.10 3.553.93 10.3011.30 11.5012.50
IR, Infrared.
over the period of 1972 to 1993. This spatial resolution of Landsat Multi-Spectral Scanner is 57 m, and the spectral region of each band is depicted in Table 92.
9.1.1.3 Landsat Thematic Mapper This Landsat TM missions started in 1982 with Landsat-4 and to be continued to present the Landsat-7 mission. The main advantage of this method is given as follows: • The number of spectral bands is increased. • Increased spectral and spatial resolution. • Increased angle of view from 11.56 to 14.92. The Landsat Thematic Mapper spatial Resolution is about 30 m, and Table 93 shows the spectral regions of each band. Table 9–2
Specifications of spectral bands.
Spectral bands
Name
Range (µm)
Band-1 Band-2 Band-3 Band-4
Visual green Visual red Near IR Near IR
0.500.60 0.600.70 0.700.80 0.801.10
IR, Infrared.
Table 9–3
Spectral regions specifications.
Spectral bands
Name
Range (µm)
Band-1 Band-2 Band-3 Band-4 Band-5 Band-6 Band-7
Visual blue Green Red Near IR Mid IR Thermal IR Mid IR
0.450.52 0.520.60 0.630.69 0.760.90 1.551.74 10.4012.50 2.082.35
IR, Infrared.
180
Artificial Intelligence in Data Mining
9.1.1.4 Landsat Enhanced Thematic Mapper Plus This sensor is utilized to record the data based on the same seven bands as in Landsat Thematic Mapper. The panchromatic band is the advance feature added in this sensor with 15 m spatial resolutions, and bandwidth from 0.52 to 0.90 μm. In addition, the spatial resolution is increased with a thermal band from 100 to 60 m. However, this sensor is launched in 1999 on the Landsat-7 mission.
9.1.2 Data resolution characteristics The resolution of an image signifies the potential detail provided by imagery. There are four different types of resolution: • • • •
spatial spectral radiometric temporal
9.1.2.1 Spatial resolution Spatial resolution refers to resolving the power of the sensor for distinguishing among the closely spaced objects. In other words, the spatial resolution is defined as the measurement of a small object, which is resolved by the sensor, or the ground area imaged for the instantaneous field of view (IFOV) of the sensor, or the linear dimension on the ground represented by each pixel. Here, every sensing element containing IFOV to determine the spatial resolution of the sensor. The contrast of the object with respect to its background affects the spatial resolution. However, the spatial resolution is reported usually as the length of one side of the single pixel. Moreover, different sensors having various IFOVs. Here, the quality of the image is found better, but still more space is needed for storage purposes.
9.1.2.2 Spectral resolution Spectral resolution signifies the sampling rate and bandwidth in which the sensor collects the information about the scene. In this case, the high spectral resolution is characterized by narrow bandwidth. In addition, spectral resolution refers to the width and number of spectral bands in the sensor system. However, many of the sensor systems have a panchromatic band, which is considered as one of the single wide band in the visible spectrum, and multispectral bands in visible-near-IR or the thermal-IR spectrum. The main aim of the spectral resolution is to get the data in various wavelengths of the electromagnetic spectrum. The absorption or reflection of Earth surface materials is captured accurately with high spectral resolution sensors.
9.1.2.3 Radiometric resolution Radiometric resolution refers to dynamic range, or a number of various output numbers in every band of the data, and is identified by the number of bits into where the recorded radiation is partitioned. In the 8-bit data the Digital Numbers range from 0 to 255 for every pixel. In addition, this resolution is utilized for describing the frequency of coverage by the sensor at any specific area, which corresponds to pixel size, pointing capabilities, and sensor field of view.
Chapter 9 • Satellite data: big data extraction and analysis
181
However, the radiometric resolution becomes very popular in change detection, global change studies, and Earth monitoring studies. Here, the radiometric resolution ranging from 8 to 14 bits, which is related to 256 levels of grayscale and up to 16,384 intensities or the shades of color in every band. The radiometric resolution signifies the dynamic range or the total discrete signals of the particular strengths that the sensor may record. However, the Landsat-7 sensor is utilized to record 8-bit images; hence it may measure 256 unique gray values of reflected energy, whereas Ikonos-2 containing 11-bit radiometric resolution that is 2048 gray values. Another word to say is, the higher radiometric resolution allows the observation of low and high contrast objects simultaneously in the scene. If the radiometric resolution is higher, the range of intensities is greater to record and distinguish. Radiometric resolution is expressed typically as a number of bits for every band. The radiometric resolution is represented based on the number of bits, or binary digits for indicating the range of the available brightness values.
9.1.2.4 Temporal resolution Temporal resolution is the measure of frequency or repeat cycle where the sensor returns to the same part of Earth’s surface. Here, the characteristics of frequency are obtained by the satellite sensor and their orbit pattern. The frequency of the flyovers by plane or satellite is only applicable in the time series as in the deforesting monitoring. This resolution was first utilized by the intelligence community in which the repeated coverage revealed the changes in infrastructure, deployment of units, or introduction/modification of the equipment. Cloud cover over the given object or area makes it appropriate for repeating the collection of a particular location. In addition, the temporal resolution is utilized to measure the time elapsed among the consecutive images of the same ground location gathered by the sensor. The temporal resolution primarily depends on the platform, for instance, satellites usually containing set return times while the sensors mounted on the unmanned aircraft systems or aircraft with several return times. For the satellites the return time is mainly based on orbital characteristics (low vs high orbit). Landsat has a return time of approximately 16 days, whereas the other sensors, such as Moderate Resolution Imaging Spectroradiometer (MODIS), have nearly daily return times.
9.1.3 Data representation Data representation requires making sense of data. To process the RS data, the knowledge and access to the GIS is useful but not needed. Here, the landmarks are utilized for providing the sense of the geographic location. Thus the data representation is very necessary for assessing the health, crop status, and finding the factors producing the production fields. If the user failed to have skills or the GIS for processing the RS data, it is impossible for purchasing the RS in the ready-to-use form. However, the modern remote-sensor systems gather the huge qualities of data, such that the data flow creates the data-management issues while transmitting the data, retrieval and storage, output and input, pattern recognition, and image processing. The two most commonly employed GIS data structures are: • Vector data type • Raster data type
182
Artificial Intelligence in Data Mining
9.1.3.1 Vector data type Vector GIS data are the compilations of lines, polygons, or points, which considerably vary across the landscape and likely not cover the landscape completely. Thus this data type is also referred to as the irregular data structure. However, any landscape feature is represented by either the line, the point, or the polygon. The nodes and vertices corresponding to these features are indicated in the geographic space by Y (northsouth), X (eastwest), and may be Z (elevation) coordinates with a huge number of attributes. Points are the zero-dimensional objects containing single coordinate pairs for modeling singular, discrete features, such as buildings, walls, power poles, sample locations. However, the points have only the property of location. Other kinds of point features, which includes node and vertex. In addition, databases containing line features represent trails, roads, streams, or survey lines, whereas the polygon features consist of timber stands, property boundaries, logging units, wildlife habitat areas, watershed boundaries, or fires. A vector is the data structure for storing spatial data. Vector data consists of arcs or lines, which are defined by the beginning as well as endpoints to meet the nodes. However, the location of these nodes and topological structures are stored explicitly. In addition, the features are represented by their boundaries, whereas the curved lines are defined by the series of connecting arcs. Vector storage follows explicit topology storage, but it stores only those points that define the feature, and outside these features is said to be “nonexistent.” However, the vector-based GIS is defined by the vectorial representation of their geographic data. According to the data model characteristics, the geographic objects are represented explicitly, and in the spatial characteristics, the thematic aspects are associated. Generally, the vectorial systems consist of two components: • spatial data • thematic data Thus the above two components are termed as a hybrid organization system, since it links the relational database for attributes with the topological one for spatial data. The key element of these types of systems is to identify each object. Thus this identifier is unique for every object and allows the system for connecting both datasets. The data attributes of the features are stored in a separate database management system. The attribute information and spatial information are linked through a simple identification number and is fed to every feature map. The formats utilized in the vector data are as follows: • ESRI shapefile vector file format • Census 200 Topologically Integrated Geographic Encoding and Referencing (TIGER)/Line vector file format 9.1.3.1.1 ESRI shapefile vector file format The shapefile otherwise called ESRI shapefile, which is the most popular file format to store shape, attributes, and location of geographic features. This shapefile contains huge features with several associated data, and historically utilized in the GIS desktop applications, like ArcMap. This file supports only 255 columns, and the limit of file size is 2 GB. Moreover, the
Chapter 9 • Satellite data: big data extraction and analysis
183
ESRI shapefile failed to support all the possible geometry types. Before uploading, the shapefile is combined into a single zip file, because the shapefile contains various individual files. Here, the shapefiles are uploaded to the Mapbox studio as the vector tilesets, which is used in the custom map style. Fig. 91 depicts the model of the shapefile format that defines geometry and attributes of geographically referenced features in three or more files with specific file extensions that should be stored in the same project workspace: • • • • •
.shp .shx .dbf .prj .xml
.shp: This the main file utilized to store the feature geometry. Here, only one kind of geometry is stored per shapefile. Thus this information is stored using the Cartesian reference system, which is compatible with several spatial reference models, such as latitudes and longitudes. In addition, three-dimensional data is also stored, like altitude information corresponding to each component of the feature. The shapefile is limited to 2 GB in size and does not have more than 2000 lines or 4000-point features or polygon features. .shx: The index file, which stores the index of feature geometry. .dbf: It is utilized to store the attribute information of features. However, this information is stored in the dBase IV format that has the legacy format with various limitations. Here, the files cannot be more than 255 fields in the dataset, and every field’s name limits to 10 characters. .prj: It stores a georeferencing system related to the features. .xml: This extension is used to store Metadata.
FIGURE 9–1 Structure of the ESRI shapefile model.
184
Artificial Intelligence in Data Mining
Files should be stored in the same folder, or else they are not accessible. In addition, the topological information is not stored in shapefile, but still, the GIS creates the topology based on contained shapefiles information. If there is no topology, then shapefiles are limited computing-intensive to the display, which is the important parameter while the processing speed is slower with less relevance. 9.1.3.1.2 Census 200 Topologically Integrated Geographic Encoding and Referencing (TIGER)/Line vector file format The Census Bureau maintains and creates statistical, legal, and administrative boundaries for all the geographic areas. In addition, the data are published for creating and maintaining geographic features, like roads, water, and the landmarks for statistical boundaries. These files are originally in the vector format created by a census called TIGER. For instance, the census relevant territories are the census tracts, which are composed of block groups, and are depicted in Fig. 92. In addition, the block-level maps are included for each town, city, and the village in the US. It also includes geocoded block faces with the address ranges of street numbers, which means they include topology for matching. The maps are the integration of DataBase File/Direct Internet Message Encapsulation and Digital Line Graph (DLG) files. However, 1980 Census Bureau’s maps are used along with USGS’s DLG maps, thus merging nonurban and urban areas. Moreover, TIGER has arc/node type arrangement with the separate files for lines (one cell), points (zero cells), and areas (two cells), which are linked together based on cross-indexing. Cross-indexing means some features can be encoded as landmarks that allow GIS layers to be tied together. The files relevant to the boundary point extraction are as follows. • • • • •
Record Type 1: Edge ID (TIGER/Line ID or TLID), Lat/Long of End Points Record Type 2: TLID, Shape Points Record Type I: TLID, Polygon ID Left, Polygon ID Right Record Type S: Polygon ID, Zip Code, County, Census Tract, Block Group, etc. Record Type P: Polygon ID, Internal Point (Lat/Long).
9.1.3.1.3 Data description TIGER/Line files are based on an elaboration of the chain file structure, where the primary element of information is an edge. Each edge has a unique ID number (TLID) and is defined by two endpoints. Each edge then has polygons associated with its left and right sides, which in turn is associated with a county, zip code, census tract, etc. The edge is also associated with a set of shape points, which provide the actual form an edge takes. 9.1.3.1.4 Software functionality • Loader for TIGER files • Conversion function from TIGER data structure to ESRI shapefile (LLS) data structure • 2D visualization of TIGER files
Chapter 9 • Satellite data: big data extraction and analysis
185
FIGURE 9–2 Hierarchical view of US census Bureau territories.
9.1.3.1.5 Advantages of vector data • Vector data are easily overlaying rivers, roads, and land use. • Vector data are easy to re-project, register, or scale. • Vector data are compatible with the relational database management system. • Vector file sizes are much smaller than the raster image file.
9.1.3.2 Raster data type Raster GIS databases are the arrangements of pixels or grid cells, which are referenced by column and row positions. This data type is sometimes said to be a regular data structure. Here, the pixel locations are solved based on the coordinate system that takes top-left pixel
186
Artificial Intelligence in Data Mining
as the origin. In addition, the raster data model containing columns and rows with equally sized pixels interconnected to form a planar surface. These pixels are employed as the building blocks to create points, lines, areas, networks, and surfaces. However, the pixels may be hexagons, triangles, or octagons, whereas the square pixels refer to the simplest geometric form where the huge majority of the available raster GIS data are developed on the square pixels. Then, these squares are reformed typically to rectangles at various dimensions when the data model is transformed from one projection to another. These raster GIS databases are familiar with Digital Elevation Models (DEMs), satellite images, Digital Raster Graphics (DRGs), and digital orthophotographs. These were developed using the satellites or sensors or other space vehicles, by mounting the digital cameras in airplanes, or through scanning of maps. The satellite images contain reflectance values of the Earth’s features in each of the raster cells. In typical, they have the spatial resolution of 1, 2, 5, 10, or 30 m based on a satellite system under several considerations. DEMs are utilized to provide the information about the topography of the landscape, typically in 3, 10, or 30 m grid cells, and several kinds of terrain analysis are accommodated with this raster data. In addition, the digital orthophotographs are the digital versions of vertical aerial photographs for representing the broader area and registered to coordinate system. Here, these images are viewed in GIS in the corrected format in which several distortion and topographic displacement are removed, hence highly accurate aerial and the linear measurements are made from the photography. DRGs are the scanned versions of the topographic maps that are used in GIS for facilitating the planning and management efforts. In addition, other types of raster GIS databases are created in the GIS-based on several manipulation functions and spatial analysis. Three types of data formats utilized in the raster data types are as follows: • Band Interleaved by Pixel (BIP) • Band interleaved by line (BIL) • Band Sequential 9.1.3.2.1 Band Interleaved by Pixel Fig. 93 shows the data storage sequence in the BIP format. The image of size 3 3 3 (3 rows and 3 columns) with three bands are considered. Here, the band, row, and the column (pixel) are represented generally as b, r and p, respectively. Here, b1 , r1 and p1 refer to band-1, row-1, and the column (pixel)1. In this BIP, the first pixel of row-1 of band-1 is stored initially, after that the first pixel of row-1 of band-2, and then the first pixel of row-1 of band-3. The abovementioned formats are followed by the second pixel of row-1 of band-1, the second pixel of row-1 of band-2, and the second pixel of row-1 of band-3 and likewise. 9.1.3.2.2 Band interleaved by line The data storage sequence in the BIL format is depicted in Fig. 94. Here, all pixels of row-1 of band-1 are stored in the sequence initially, then all pixels of row-1 of band-2, and then the pixels of row-1 of band-3. These are followed by all pixels of row-2 of band-1, then all pixels of row-1 of band-2, and then all pixels of row-1 of band-3 and the same process is
Chapter 9 • Satellite data: big data extraction and analysis
187
FIGURE 9–3 Data storage format of BIP.
FIGURE 9–4 Data storage format of BIL. BIL, Band interleaved by line.
repeated for each data. The similarity of both BIL and BIP format is to store the data/pixels in the line (row) at the time. 9.1.3.2.3 Band Sequential This format is utilized to store each data band as a separate file. Fig. 95 depicts the sequence of data arrangement in every file. Assume band image of dimension 3 3 3. Here, all the pixels of band-1 are stored in sequence first, followed by all the pixels of band-2, and then the pixels of band-3. 9.1.3.2.4 Advantages of raster data • Raster data record value of all points of the area covered which required more data storage than the model represented by the vector model. • Raster data is less expensive to create computationally compare to vector graphics. • The data structure is simple.
188
Artificial Intelligence in Data Mining
FIGURE 9–5 Data storage format of BIL. BIL, Band interleaved by line.
9.1.4 Data mining or extraction Data mining is the emerging research field for discovering and searching valuable information on satellite data and also provides the knowledge in huge volumes of data. Data mining techniques have been successfully applied in many different fields, like marketing, manufacturing, process control, fraud detection, and network management besides a variety of datasets like market basket data, web data, Deoxyribonucleic acid (DNA) data, text data, and spatial data. Data mining techniques enable more opportunistic use of data banks of aerial RS images. In addition, the data mining technique also addresses the type of change, time of change, pattern of re-growth, the extent of change, and so on. Therefore three important data mining of satellite images are as follows: • spatial data mining • temporal data mining • spatiotemporal data mining
9.1.4.1 Spatial data mining Mining the knowledge from a huge amount of spatial data is known as spatial data mining in which a large amount of spatial data is obtained in several applications, such as ranging from planning and environmental assessment, computer cartography, and the RS to GIS. However, the space relation, existent knowledge, and other modes of space database are extracted by the spatial data mining method. Thus the mining of synthesis and spatial data are required for spatial data mining. The data inputs of the spatial data mining are very difficult than inputs of the classical data mining because they include extended objects, like lines, points, and the polygons. However, the data inputs of spatial data mining consist of two
Chapter 9 • Satellite data: big data extraction and analysis
189
attributes, which includes spatial and nonspatial attribute. The nonspatial attributes are utilized for characterizing the nonspatial features of objects, like population, name, and the unemployment rate of the city, whereas the spatial attributes define the spatial location and extent of spatial objects. The spatial attributes of the spatial object provide the information with respect to spatial locations, for example, latitude, elevation, longitude, and shape. In addition, the relationships between nonspatial objects are explicit in the data inputs, which includes ordering, arithmetic relation, subclass-of, is instance-of, and the membership-of, whereas the relationships between the spatial objects are implicit, like intersect, behind, and overlap. One possible way for dealing with implicit spatial relationships is to materialize the relationships to traditional data input columns, and then the classical data mining techniques are applied that results in loss of information. Another way of spatial data mining is to develop techniques or models, and spatial information is incorporated into the spatial data mining process. Spatial data mining is to apply data mining methods to the spatial data related to spatial relations. The discovery of interesting similarities of patterns and characteristics that exist in huge spatial datasets is the spatial data mining. Spatial data mining is evolved from data mining fields in the statistics computer as well as science, such as clustering, classification, and visualizing information. The most broadly utilized spatial data mining methods are decision trees, neural networks, Support Vector Machines, rule-based classification, and machine learning. Spatial data mining varies from general business data mining because of the complexity of space data resembling. However, some of the characteristics are illustrated as follows: • Very huge quantity of data, abundant data source, several data types, accessing complicated methods. • The applied realm is very extensive where the data corresponding to the space location is excavated. • There are several excavating methods and arithmetics are exits, but still, most of the arithmetics are very difficult and complicated. • The expressive method of knowledge is diverse, and comprehension and evaluation of the knowledge depend on a person’s cognitive degree to the objective world. The prediction, detection, and mapping of any of the phenomena that manifest spatial pattern is capable of applying spatial data mining techniques. Drought, forest fire, flood with a huge spatial extent are some of the few spatial phenomena with predictable spatial patterns that are collected through the RS Images/products. Fig. 96 depicts the structure of spatial data mining. For the past two decades, spatial data mining has become very popular for collecting aerial images of remote locations. However, the abundant store of satellite data or remote satellite data is present with major space research agencies of several nations spread over the continents. The satellite images are utilized to carry spatial information of coverage area, like the land cover, land use, crops and cultivation, vegetation, regional population, soil, buildings, unutilized cultivars, roads, and everything on the surface of geographical area. The images collected from several time stamps illustrate the life systems and their transformations from several perspectives. Hence, it becomes an enthusiastic research area for researchers of the diverse disciplines for making the decisions indirectly or directly.
190
Artificial Intelligence in Data Mining
FIGURE 9–6 Schematic architecture of spatial data mining.
9.1.4.2 Temporal data mining Temporal data mining is defined as the process of knowledge discovery in the temporal datasets, which enumerates the structures, like models or temporal patterns over temporal data. The major goal of temporal data mining is to discover unexpected trends, temporal patterns, or other hidden relations in huge sequential data. In addition, the temporal sequence is defined as the sequence of nominal symbols from the alphabet, whereas the continuous real-valued sequence elements are said to be time series, which is performed by integrating statistics, machine learning, and the database technologies. However, the temporal data mining consists of three major works, like the definition of similarity measures, representation of temporal data, and mining tasks. 9.1.4.2.1 Representations of temporal data Representation of temporal data is very important for representing temporal data efficiently before performing the mining operations. Following are the three major methods for representing temporal data:
Chapter 9 • Satellite data: big data extraction and analysis
191
• Time domainbased representations • Transformation-based representations • Generative modelbased representations 9.1.4.2.2 Time domainbased representations It is the simplest way for temporal data representation with minimal manipulation. It keeps the original temporal data in the sequence of initial samples ordered in its occurrence. Here, the temporal sequence is segmented into various parts in which each segmentation is derived by the linear functions. In general, this representation has the major advantages of easy to implement and preventing loss of information achieved from the temporal data. Moreover, it requires demanding memory resource and computational power for mining, but it becomes infeasible for real-world applications involving temporal data with high dimensionality and large volume. 9.1.4.2.3 Transformation-based representations This representation is utilized for transferring raw temporal data into representation space in which the features containing the most discriminatory information are extracted to represent temporal data. Generally, this representation is categorized into two: • piecewise representation • global representation The piecewise representation generates the temporal data by partitioning the data into segments and then every segment is modeled into a concise representation. For example, curvaturebased Principal Component Analysis segments, and adaptive piecewise constant approximation. However, the global representation is defined by modeling temporal data through basic function set, and hence, the coefficients in parameter space form global representation for reconstructing temporal data. The most broadly employed global representations are discrete Fourier transforms, spline or polynomial curve fitting, discrete wavelet transforms. The advantage of using transformation-based representations is to reduce high-dimensional temporal data into lowerdimensional feature space that results in improved computational efficiency. 9.1.4.2.4 Generative modelbased representations In the generative modelbased representations the temporal data are obtained from a deterministic or statistical model, like mixture of first-order Markov chain, Hidden Markov Model (HMM), Autoregressive Moving Average Model, or dynamic Bayesian networks; therefore the entire temporal dataset is represented by combining these models with the suitable model parameters. Here, the HMM has outstanding capability in capturing the temporal features whose values significantly change during the observation period, hence satisfying Markov property. Basically, the temporal data is represented by the HMM model for describing the unobservable stochastic process with a finite number of states, each of which is corresponding to the other stochastic process that emits the observation.
192
Artificial Intelligence in Data Mining
9.1.4.3 Spatiotemporal data mining Spatiotemporal data mining is defined as the process of discovering interesting and previously unknown patterns from the huge spatiotemporal datasets. It has several applications, like environmental and ecology management, transportation, public safety, epidemiology, earth science, and climatology. The spatiotemporal data complexity and the intrinsic relationships limit the usefulness of the conventional data science approaches to extract the spatiotemporal patterns. Fig. 97 depicts the spatiotemporal data mining process. The input spatiotemporal data is initially fed to the preprocessing step for correcting errors, noise, exploratory space-time, and missing data analysis for understanding underlying spatiotemporal distributions. After preprocessing, the suitable spatiotemporal data mining technique is chosen to produce the output patterns. The common output pattern families, which include associations and tele-couplings, spatiotemporal outliers, partitions and summarization, predictive models, hotspots, and change patterns. Spatiotemporal data mining approaches having the statistical foundations and combines the scalable computational techniques. Consequently, the output patterns are postprocessed and then interpreted by the domain scientists for identifying novel insights and refining data mining approaches when required. Spatiotemporal data mining is very crucial for making decisions using large spatiotemporal and spatial datasets, such as National Geospatial-Intelligence Agency, National Cancer Institute, National Aeronautics and Space Administration (NASA), National Institute of Justice, and the US Department of Transportation. The abovementioned organizations are spread over several application domains. In the ecology and environmental management, the researchers require the tools for classifying the RS images for mapping forest coverage. In public
FIGURE 9–7 Block diagram of the spatiotemporal data mining process.
Chapter 9 • Satellite data: big data extraction and analysis
193
safety the crime analysts are interested to discover the hotspot patterns from the crime event maps such that the police resources are allocated effectively. In transportation the researchers analyze the historical taxi Global Positioning System (GPS) trajectories for recommending fast routes from one place to another. Moreover, the Epidemiologists utilize the spatiotemporal data mining techniques for detecting the disease outbreak. Also the spatiotemporal data is applicable to application domains, like climatology, earth science, precision agriculture, and the Internet of Things (IoT).
9.1.5 Big data mining for remote-sensing data The big data mining of the RS data has several special and concrete characteristics, like data with multiscale, multisource, dynamic-state, high-dimensional, nonlinearity, and some characteristics. Earth observation, airborne, and space-borne sensors from various countries provide a huge number of remotely sensed data. The characteristic of multisource RS big data is obvious, but the major reason for multisource characteristic is several instruments are needed to obtain the data. Moreover, the physical meanings of the multisource data are totally different. In the perspective of the imaging mechanism the major data types are microwave data, optical data, and point cloud data. Other RS data types are the stereographic pairs produced by multiple photographs, and gravity data show the gravity situation and amount of water available in one region. The multisource data is utilized to use and understand the information from multiple viewpoints. Hence, some confusion may happen for deciding the appropriate data effectively for a particular application. Those data are employed for various applications, like natural global climate change, hazard monitoring, urban planning, and so on. Furthermore, these data become the economic asset and new important resource in numerous applications. The RS big data always reflects the dynamic state due to changes in Earth’s surface and satellite move. However, the dynamic state of RS big data consists of both nonstationary and stationary parts. Thus the changes caused by the Earth-orbiting sun and rotating about their own axis is the stochastic process, whereas changes made by natural disasters, such as earthquakes and volcanic eruptions, and human activities are the nonstationary stochastic processes. RS data mining has been paid great attention from many researches. For strengthening the epidemiological surveillance capacity, data mining algorithms are introduced for finding the epidemiological areas of risk based on RS satellite data. For the coastal areas the RS and the data mining approach is introduced for designing coupled spatiotemporal algal bloom model.
9.1.6 Big data mining methods for social welfare application This section describes some of the applications of big data mining methods used in the social welfare application.
9.1.6.1 A data mining approach for heavy rainfall forecasting based on satellite image sequence analysis For more than 30 years the meteorological satellite data have been utilized operationally in the weather services. During a certain period, severe weather forecasting using satellite RS
194
Artificial Intelligence in Data Mining
data is very challenging. Early warnings of severe weather help to prevent the damages and casualties produced by natural disasters. Thus the early warnings are very significant in China’s Yangtze River Basin where they are suffered from severe flooding such that causing flood control situations in China becomes increasingly urgent and grave. For example, severe flood in the Yangtze River Basin in 1998 that resulted in deaths of 4150 people and property damage of approximately 32 billion US dollars. As almost all the floods are created by the intensive heavy rainfall, therefore the authorities are responsible to have the key and clear mandate for providing advance forecasting of the possible heavy rainfall. The evolvement process of Mesoscale Convective Systems (MCSs) is investigated over Tibetan Plateau based on satellite RS data is very significant to forecast heavy rainfall. In Ref. [2], the spatial data mining is introduced for forecasting possible heavy rainfall using MCS tracking through RS satellite images. Initially, the automatic method is established for object tracking from satellite image sequences for finding appropriate MCSs, its characteristics, and its moving trajectories. After that, the two-phase spatial data mining technique is developed for enabling the deduction of causalities and correlations among MCS activities with the occurrences of possible heavy rainfall. This framework is utilized to lift the heavy burden of rainfall forecasting manually by interpreting and analyzing meteorological, massive RS datasets automatically for assisting weather forecasting.
9.1.6.2 Using spatial reinforcement learning to build forest wildfire dynamics models from satellite images Forest wildfire prediction is the other social welfare application, which is performed by the data mining method. The costs, impacts, and risks of the forest wildfires are a perennial and unavoidable concern in various areas in the world. The factors contribute to increasing the importance and difficulty of its domain in the future years, like growing urban sprawl to high wildfire risk areas, climate change, and the past fire management for focusing immediate suppression at cost of improved future risk. The forest wildfire spread as the major problem for spatially spreading processes (SSPs) where the local features are altered over time using a dynamic process. Spatial autocorrelation is the indicator of the presence of SSP, but still, the dynamic changes in time among the spatial positions do not result in values being similar or inversely correlated in a simple way. In Ref. [4], spatial reinforcement learning is developed for forest wildfire prediction. Here, fire is the agent in the landscape at any cell, and a set of fire actions may take from the position at any point in time, which includes spreading south, north, west, or east or not spreading. In addition, this framework is utilized to invert the usual RL setup as dynamics using the corresponding Markov Decision Process for the immediate wildfire spread. Moreover, the agent policy for dynamics predictive model was learned by a complex spatial process. In addition, the classification of correct data is compared with satellite and other related data. Then, the behavior of five RL approaches is examined for this issue: policy iteration, value iteration, Monte Carlo Tree Search, Q-learning, and the Asynchronous Advantage Actor-Critic (A3C).
Chapter 9 • Satellite data: big data extraction and analysis
195
9.1.6.3 Improved densitybased spatial clustering of applications of noise clustering algorithm for knowledge discovery in spatial data Data collection is a very important and typical process in spatial data mining and knowledge discovery, but using the government agencies’ efforts, scientific needs, and the private sectors is possible for collecting the large datasets with spatial features. For the multidimensional data the dynamic data and moving objects selection require advanced mining methods and knowledge discovery. For handling such types of research activities and challenges the spatial data mining method is developed as a strong tool in geovisualization concept. In Ref. [5], Improved Density Based Spatial Clustering of Application of Noise (IDBSCAN) is developed for the knowledge discovery. IDBSCAN has changed from time to time by several project agencies and researchers with respect to several parameters. The steps followed in the developed model are selection of data, collection of data, data cleaning, preprocessing module, clustering, classification, classification, and transformation. The clustering methods are very useful in several fields of human life, like GIS, GPS, air traffic controllers, weather forecasting, area selection, water treatment, planning of urban and rural areas, cost estimation, very largescale integration designing, and RS. This approach is designed for adding some important attributes for generating better clusters for further processing.
9.1.6.4 Data mining algorithms for land cover change detection: a review Land cover change detection is the active research topic in the RS community. The land cover change impacts the hydrology, local climate, bio-geochemistry, and radiation. Here, the vegetation index [Enhanced Vegetation Index/normalized difference vegetation index], and Terra MODIS data products are utilized for detecting land cover change. However, these data products are tackled several challenges, like spatiotemporal correlation, seasonality of data, poor quality measurement, missing values, high-dimensional, and high-resolution data. Moreover, the land cover is detected by linking more than two satellite snapshot images obtained from various dates. In Ref. [6] the cumulative sum (CUSUM) method was developed originally for change detection. This approach is the parameter change method, which uses the mean of observations as the parameter for identifying the data value dispersion. For processing, the expected value l is utilized in the basic CUSUM approach. Thus this technique manages the CUSUMs of deviations, running statistics by computing the difference of every observation to the expected value. If CUSUM is approximately zero, then there is no change in time series. The time series is changed when the CUSUM is greater than the user-specified threshold. However, the CUSUM with low-negative or high-positive value implies a decrease or increase in the mean value of vegetation.
9.1.6.5 An autonomous forest fire detection system based on spatial data mining and fuzzy logic Fire is considered as one of the main causes of the surface change and it happens in vegetation zones mass across the world. Forest fires are the key ecological threats for the leading environment besides endangering human lives and the deterioration of the economy. Moreover, forest
196
Artificial Intelligence in Data Mining
fires are the global concern, which causes several damages to the deterioration of the Earth ecosystem, particularly to global warming. The violation of the functions in natural systems, huge fires caused by humans, and other factors, such as wind, drought, plants, topography, and so on having an important indirect influence on the fire appearance and their spreading damage natural environment. Every year several millions of forest hectares (ha) are demolished all over the world due to forest fires. During 2007 fires, an overall of 575,531 ha of land is destroyed in several European nations. From the year 1980 to 2006, forest land of about 1.33 million ha is ruined by fires. The threat to the public property and safety failed to be exempted; however, the fires are considered as the customary component in Canadian forest ecosystems. In addition, the vulnerable communities’ evacuation is occasionally forced besides the heavy damages amounting to several millions of dollars. Owing to forest fires, Indian forests are in jeopardy, which leads to degradation. During the last few summers, because of forest fires, vegetation covers of Garhwal Himalayas in Himalayan forests are being gradually deteriorated. In Ref. [7], an approach is developed for automatically detecting the forest fire from spatial data related to the forest regions using clustering and fuzzy logic. At first, digital satellite images are changes into CIELabColor Space, and then clustering is carried out for identifying the regions with fire hotspots. The fuzzy set was generated with color space values of segmented regions that are followed by deriving fuzzy rules based on fuzzy logic reasoning to detect forest fires.
9.1.6.6 System refinement for content-based satellite image retrieval Content-based satellite image retrieval (CBSIR) approaches are very important to deal with traditional images. The major challenge behind CBSIR is to fill the gap among low-level features for describing scenes and understandable semantic concepts by a human, named semantic gap. Moreover, these semantic concepts are defined differently from themselves. For example, each one interprets what he sees from his point of view. In Ref. [8], the image retrieval process is developed based on the unique properties of the satellite images. This framework employed Query by polygon paradigm instead of using several conventional rectangular queries based on the image approach. The features were initially extracted from satellite images based on multiple tiling sizes. Moreover, the system employed multilevel features within the multilevel retrieval system for a better retrieval process.
9.1.6.7 Automated detection of clouds in satellite imagery The cloud detection in the satellite images has several applications in climate and weather studies. The cloud alters the energy budget of the Earth-atmosphere system by absorbing and scattering shortwave radiation. The characteristics of absorption and scattering of the clouds vary with microphysical properties of the cloud type. Hence, the detected cloud over the region in satellite imagery is very significant for deriving the surface or atmospheric parameters. Although, the clouds are the contaminant in which the presence of interferes with the retrieving atmosphere or the surface information. Here, the prediction of cloud contaminated pixels in the satellite imagery isolates the cloud-free pixels for retrieving atmospheric thermodynamic information, from the cloudy ones.
Chapter 9 • Satellite data: big data extraction and analysis
197
The automated detection methods use the spatial analysis techniques for detecting contrast among the reflected energy from the clouds, and the surrounding scenes are utilized for extending cloud cover over the region. Thus this technique has the difficulties under low solar illumination conditions when other reflective noncloud surfaces, like ice, snow, and sand, or other complicating optical conditions, such as scattering because of aerosols are present. The threshold values of the predetermined energy are computed using solar illumination angles for delineating cloudy from the cloud-free pixels in satellite imagery as well.
9.2 Summary RS data is appropriate for geospatial technologies, which have a growing impact in a wide variety of areas from commerce to public policy. In addition, the RS data evolved from interpretation of aerial photographs to the analysis of satellite imagery and from local area studies to global analyses, with advances in sensor system technologies and digital computing. Today remote-sensor systems can provide data from the energy emitted, reflected, and/or transmitted from all parts of the electromagnetic spectrum. Thus this chapter provides a brief overview of the background of RS, big data mining for RS data, and the applications of social welfare.
References [1] Lohani B, Ghosh S. Airborne LiDAR technology: a review of data collection and processing systems. Proc Natl Acad Sci India Sect A: Phys Sci 2017;87(4):56779. [2] Yang Y, Lin H, Guo Z, Jiang J. A data mining approach for heavy rainfall forecasting based on satellite image sequence analysis. Comput Geosci 2007;33:2030. [3] Roy PS, Behera MD, Srivastav SK. Satellite remote sensing: sensors, applications and techniques. Proc Natl Acad Sci India Sect A: Phys Sci 2017;87:46572. [4] Subramanian SG, Crowley M. Using spatial reinforcement learning to build forest wildfire dynamics models from satellite images. Front ICT 2018;5:6. [5] Sharma A, Gupta RK, Tiwari A. Improved density based spatial clustering of applications of noise clustering algorithm for knowledge discovery in spatial data. Math Probl Eng 2016;3(2):19. [6] Panigrahi S, Verma K, Tripathi P. Data mining algorithms for land cover change detection: a review. Sadhana 2017;42(12):208197. [7] Prasad KSN, Ramakrishna S. An autonomous forest fire detection system based on spatial data mining and fuzzy logic. IJCSNS: Int J Comput Sci Netw Secur 2008;8(12). [8] Laban N, ElSaban M, Nasr A, Onsi H. System refinement for content based satellite image retrieval. Egypt J Remote Sens Space Sci 2012;15:917.
This page intentionally left blank
10 Advancement of data mining methods for improvement of agricultural methods and productivity Anush Prabhakaran1, Chithra Lekshmi K. S.2, Ganesh Janarthanan3 1
DEPART ME NT OF ME CHATRO NICS ENGINEERING, KUMARAGURU C OLLEGE OF
TECHNOLOGY, SAR AVANAPATTI, C OIMB ATORE, I NDIA 2 DEPARTME NT OF VISUAL COMMUNICATION (ELECTRONIC MEDIA), PSG C OLLEGE OF ARTS & SCIENCE, COIMBATORE, INDIA 3 L&T T ECHNOLOGY SERV ICES LTD, BANGALORE , INDIA
10.1 Agriculture data: properties and analysis Agriculture plays an important role in the socio-economic framework of India, and agriculture is a unique business that depends on the factors, like economy and climate. More than 70% of household in rural area depend on the agriculture. However, this domain provides the employment to more than 60% of total population, but it faces several challenges, like producing better and more when improving the sustainability with the reasonable usage of the natural resources, minimizing environmental degradation, and adapting to the climate change. On the other side, according to the United Nations report, the increase in world population is expected for attaining 9.8 billion in 2050 and 11.2 billion in 2100 [1]. Hence, in next 20 years, the production of world food is increased by 50%, for feeding the projected world population. Therefore the agricultural intensification is very much needed for providing increasingly and growing demanding human population. In addition, the agricultural intensification having profound effect on environment, such as soil degradation because of water erosion and wind, water pollution, and air due to the excessive nutrients and agrochemicals, and loss of ecological and biological diversity. To minimize the negative impact of the production, it is very necessary for transforming the processes of agricultural production in the more sustainable way. Moreover, agriculture enables the farmers for growing the ideal crops based on the environmental balance. In India, rice and wheat are the major food crops along with potatoes, oilseeds, sugarcane, and so on. In addition, the farmers grow the nonfood items, such as cotton, jute, rubber, and so on. In farm output, India ranks second by Artificial Intelligence in Data Mining. DOI: https://doi.org/10.1016/B978-0-12-820601-0.00010-0 © 2021 Elsevier Inc. All rights reserved.
199
200
Artificial Intelligence in Data Mining
considering worldwide scenario, which is the widest economic sector regarding the socioeconomic fabric framework in India. However, the farming depends on several factors, which include climate and the economic factors, such as the irrigation, temperature, soil, cultivation, rainfall, fertilizers, and the pesticides. The historical information regarding the crop yield provides the major input for the companies involved in this domain. These companies making the agriculture products as the original materials, paper production, animal feed, etc. The production of crop is employed to help these companies in planning the supply chain decision, such as production scheduling.
10.1.1 Data representation Agricultural data is said to be highly heterogeneous where the heterogeneity data concerns the subject of collected data, and in which ways, the data are generated. In addition, the data obtained from field or farm, which includes spraying, soil types, and information on planting, yields, materials, in-season imagery, and weather. Large amount of information available at online about the cultivation based on drones, details about the consumption, and the production of fertilizer, crop production, and the productivity data are efficiently utilized for making the farming practices more efficient and better. In addition, the data having boundless effect in farming business; hence, it is complex to pinpoint everyone and their belongings. Additionally, the data provides the high speed, volume, and the assortment for the efficient agriculture operation from the farm cultivation into marketing. However, a large amount of information is required for providing better knowledge about the drive constant operational choices, cultivating task, and update the business forms for the expansion and diversion. Additionally, the horticulture data is accepted to be heterogeneous and unstructured in nature. The matter of the heterogeneity information comes from way, and it should be collected by various ways. The information collected from agriculturists, field, or homestead integrate data on the determination of seeds, harvests, planting patterns, plant assortment, showering, compost, yields, materials, soil types, in-season symbolism, climate, and the carious practices. The three general categories of data generation are given below. • Process-mediated (PM) • Machine generated (MG) • Human sourced (HS)
10.1.1.1 Process-mediated This data is also called as the traditional business data that result from the agricultural processes. This data is very necessary for recording and monitoring the business events, like feeding, purchasing inputs, applying fertilizer, taking an order, seeding, and so on. In addition, the PM data are highly structured with reference tables, transactions, and the relationships along with metadata for defining its context. Furthermore, the traditional business data consists of a large number of Information Technology (IT) processed and manages data, which is available for both operational and the business information system, and is stored and structures in the relational database systems. In the PM data, the small as well as well-defined subset transforms through business
Chapter 10 • Advancement of data mining methods for improvement
201
process layer, which intervenes. These sources are more flexible than the traditional processmediated data. In fact, the business processes create PM data for reducing the flexibility, and timeless in underlying information data, and for ensuring the consistency and quality of resulting PM data. Moreover, the PM data is highly organized and structures in the nature and easy for analyzing.
10.1.1.2 Machine generated MG data is derived from increasing number of smart machines and sensors, which is utilized for recording and measuring farming processes, hence this development is boosted currently so-called Internet of Things (IoT). Here, IoT is utilized for the smart farming. The range of MG data is simple sensor records to the complex computer logs with typically well-structured. Since, the sensors proliferate and grow the data volumes; it is an important component of farming information as processed and stored. Moreover, its structure is appropriate for the computer processing, but still their speed and size are beyond the traditional approaches. For the smart farming, the potential of the Unmanned Aerial Vehicles (UAVs) are well-recognized. Geostationary Positioning System (GPS) technology, the Drones with well-structured modern machinery, like infrared camera is the transforming agriculture utilized for better risk management and decision-making. In the livestock farming, the smart dairy farms are utilized to replace the labor with the robots in activities like cleaning the barn, feeding cows, and milking the cows. On the arable farms, the precision technology is employed to manage the information about every plant in field. With these new technologies, the data is not only in traditional tables, but still it appears in other formats, such as images or sounds. At the same time, various advanced data analysis approaches are developed for triggering the usage of data in the images or the other formats. In addition, the MG data is collected from remote sensing or satellite techniques or the drones. However, the data is gathered with the help of the other smart machines and sensors, which is utilized for recording and measuring the farming various processes.
10.1.1.3 Human sourced The HS data is achieved from the data published in articles, books, journals, and the art work in 10 forms of graphics, photographs, text, video, and the audio. Nowadays, HS information is mostly stored and digitalized everywhere from the personal computers to the social networks. However, the HM data is not generally loosely structured, well-structured, unorganized, messy, and the ungoverned. In big data context and the smart farming, the data from the social media is utilized efficiently for studying consumer sentiments, product feedback, and the consumer behavior. Moreover, several social media platforms, such as Twitter, Instagram, Facebook, and so on, are employed for getting the data.
10.1.2 Data management Data management or control processes ensure that business process objectives are obtained, even if the disturbances may happen. The controller is utilized to measure the system
202
Artificial Intelligence in Data Mining
behavior and modifies if the measurements are not compliant with the system objectives. Mostly, for managing the data, the feedback loop is present where the sensor, norm, discriminator, effector, and decision maker are present. As the consequence, some of the functions for managing data are: • Sensing and monitoring • Analysis and decision-making • Intervention
10.1.2.1 Sensing and monitoring The sensing and monitoring are very important for managing the data in the agriculture. Here, the measurement is very essential by computing the actual performance of farm processes. However, this process is performed manually by human observer based on sensing technologies, like satellites or sensors. Moreover, the external data is acquired for complementing the direct observations.
10.1.2.2 Analysis and decision-making It is utilized to compare the measurements with norms for specifying the desired performance, such as quality, quantity, lead time aspects, and signals deviations, that decides the suitable intervention for removing signal disturbances.
10.1.2.3 Intervention Intervention is very important in order to manage the data in the agriculture. Here, the selected intervention is implemented for correcting the performance of the farm processes. The data chain measures the sequence of the activities from the data capture to decisionmaking and then data marketing. These activities are very essential for managing the data for the farm management. The data management process is depicted in Fig. 101. Being the integral part of the business process, the data chain containing the technical layer for capturing the original data and then converts it into the information. After that, the business layer is utilized for making the decisions and derived the value from the provided business intelligence and data services. Here, two layers are interwoven in every stage, and together, they form basis of what has come to be known as data value chain.
10.1.2.4 Data capture The initial stage of the data management in the agriculture is the data capturing, where the data is collected from sensors, data captured by UAVs, open data, Genotype information, biometric sensing, and the reciprocal data. Some of the key issues faced by data capture are the quality, formats, and the availability.
Chapter 10 • Advancement of data mining methods for improvement
203
FIGURE 10–1 Data management process chain.
10.1.2.5 Data storage In the data storage phase where the data is stored in the cloud-based platform, hybrid storage systems, Hadoop Distributed File System, and the cloud-based data warehouse. The key issues faced while storing the data is quick and safe access of data and the costs.
10.1.2.6 Data transfer In this phase, the data transfer is performed using the following networks, like cloud-based platform, wireless network, and the linked open data. Safety, agreements on liabilities and responsibilities are the key issues while transferring the data.
204
Artificial Intelligence in Data Mining
10.1.2.7 Data transformation The data is performed based on machine learning approaches, visualizing, normalizing, and the anonymizing. Here, automation of data cleansing, heterogeneity of data sources, and the preparation are the key issues happened when transforming the data.
10.1.2.8 Data marketing Agricultural marketing covers the involved services in moving the agricultural product from the farm to the consumer. These services involve the planning, organizing, directing, and handling of agricultural produce in such a way as to satisfy farmers, intermediaries, and consumers. Numerous interconnected activities are involved in doing this, such as planning production, growing and harvesting, grading, packing and packaging, transport, storage, agro- and food processing, provision of market information, distribution, advertising, and sale. Effectively, the term encompasses the entire range of supply chain operations for agricultural products, whether conducted through ad hoc sales or through a more integrated chain, such as one involving contract farming.
10.1.3 Irrigation management using data mining Data mining is the process of discovering previously unknown and analyzing enormous sets of data and then extracting the useful relationships and patterns among the data which may be applicable for current agricultural problems. The “mined” information is typically represented as a model of the semantic structure of the dataset, where the model may be used on new data for prediction of prewarning model. Water is the essential factor for the development of plant. The application of high or the insufficient water quantity has the negative impact on their growth. However, the inadequate or the poorly designed irrigation is the major source of several problems. Thus, underirrigation improves the risk of salinization, whereas the overirrigation may be the source of the spread of ameba cysts, pathogens pseudomonas, and the larvae of eggs and eels of the parasites, pollutants, like biocides, drug residues, and so on in crops. Irrigation is the artificial exploitation and distribution of water at project level aiming at application of water at field level to agricultural crops in dry areas, or in periods of scarce rainfall to assure or improve crop production. Irrigation is considered as the important aspect in order to success crops, like sugarcane. Initially, the need of feasibility should be checked for irrigating based on the local climate conditions, availability of water, types of soil, and the economic aspects. However, the decision for irrigating is not depend on the distribution and quantity of the rains in region, financial capacity of producer, economic feasibility for the investment, and the response of agricultural fields for the irrigation. If economic and technical condition can justify the irrigated crops cultivation, the most suitable irrigation system is selected. Irrigation management signifies the set of technical decisions involving the characteristics of the climate, crop, soil, water, and the irrigation system. The aim of irrigation regime is to improve the crop productivity and to reducing the yield losses from the water shortage. In most of the agricultural systems, the crop transpiration is for improving the soil water deficits. When the practical current drip irrigation approaches
Chapter 10 • Advancement of data mining methods for improvement
205
replenish daily soil water deficit, it is very complex for determining deficit perceived by plant and complex to replenish the water region. Moreover, it cannot be economical for replenishing water deficit in soil, which leads to farmers and scientists using plant as integrator and sensor of the soil water deficits. Here, the irrigation efficiency is computed based on the type of used irrigation. Moreover, the irrigation scheduling is employed for determining amount of water to be used to crop. Irrigation scheduling is the key factor in the intensive agriculture where the underirrigation results in mitigated crop yield and the quality. The overirrigation improves the nutrient requirements of crop and its vulnerability to the diseases, energy costs for water pumping, water loss, and the environmental pollution because of leaching applied to crop with conventional fertilization or fertigation. The inadequate management of the drip irrigation is still performed based on grower’s experience. Hence, it is one of the reasons for the nitrate leaching in the greenhouse tomato production in Spain and Almeria. The main aim of efficient irrigation program is to supply crop with sufficient water when reducing the water waste because of deep percolation and the runoff. Different methods to irrigation scheduling are introduced, each having both disadvantages and advantages. Innovative methods based on the direct monitoring of plant water relations have been also proposed for irrigation scheduling. Irrigation also has many applications in crop production, which include: • • • • •
Protecting plants against frost Suppressing weed growth in grain fields Preventing soil consolidation For dust suppression Disposal of sewage and in mining.
The adequate irrigation management related to other cultivation methods, which allows the producer for reaching the elevated productivity levels, energy, and saving water, besides contributing to preserve the environment. Well-conducted management defines adequate moment for starting the irrigation, and for identifying necessary amount of water for crop, knowledge for time of water application or displacement speed of irrigation equipment. Soil is the natural reservoir of water for the plants. However, the water stored in soil and available to plants is comprehended among permanent wilting point and field capacity. Some of the important parameters considered in the irrigation management are listed below. • Field capacity • Permanent wilting point • Soil density
10.1.3.1 Field capacity Field capacity (Cc) refers to the maximal limit of water present and indicates the moisture of soil after the drainage water contained in macrospores by gravity action. This moisture condition favors the higher water absorption and nutrients by plants. In general, the field capacity is identified in laboratory based on retention curve method. Here, the field capacity value moisture is
206
Artificial Intelligence in Data Mining
represented by balancing the moisture with the tension of about 633 kPa based on content and structure of organic matter, texture in soil.
10.1.3.2 Permanent wilting point Permanent wilting point (Pm) is related to measure the available water inferior limit. However, this moisture condition restricts absorption of water by plants without replacement of water in soil. In addition, the permanent wilting point is identified by laboratory using retention curve method. Here, the moisture value of wilting point is denoted by balance moisture with the tension of about 1.500 kPa. To draw retention curve, the soil moisture values are determined once the samples are submitted to various tensions in Richards Extractor. For the practical purposes of the irrigation, the field capacity is obtained usually with the tension value of about 10 kPa (0.10 atm) in the sandy soils, whereas 33 kPa (0.33 atm) in the clay soil. Meanwhile, the moisture related to permanent wilting point tension of about 1.500 kPa (15 atm) [2].
10.1.3.3 Soil density Soil density is computed among the mass and volume of dry soil sample. The soli density is determined based on Uhland sampler in which the cylinder is inserted in medium depth of soil layer by roots of plants. Once the cylinder is removed, the sample is prepared, and taken to oven to dry for 24 hours. Several intelligent irrigation systems using data mining is developed for defining the crop needs corresponding to vegetative and climate cycles. Data mining in the smart agriculture is utilized mainly to plan soil and the water use, optimizing and reducing the usage of natural resources, limiting use of pollutants (e.g., pesticides, herbicides), monitoring crops health, and increasing the quality of production. Moreover, the data mining plays the very important role to ensure the better management of the irrigation and to compute the water consumption based on methods, such as crop factors, climatic elements, and the economic objectives. Most of the irrigation systems are produced for primary public health objective for improving the human nutrition, but their success in achieving their objective is reduced sometimes by the negative impacts on health. Diseases borne by the water-related vectors, especially snails and mosquitoes, can increase due to the irrigated environment supports. In addition, the water provided for the irrigation is often utilized for several purpose, such as cooking, drinking, recreation, and washing.
10.1.3.4 Fuzzy neural network for irrigation management Fuzzy representation is a way to fully describe the uncertainty associated with concepts, and genetic algorithms help to identify better solutions for data in a particular form. Fuzzy logic makes it possible to manage uncertainty in a very intuitive and natural way. The theoretical basis of fuzzy logic was established in the early 1965s by Professor Zadeh of the University of California, Berkeley. It is a rigorous mathematical theory, adapted to the treatment of all that is subjective and/or uncertain. It provides a mathematical formalism for modeling human skills. Thus inaccurate natural phenomena that present difficulty in its modeling in classical
Chapter 10 • Advancement of data mining methods for improvement
207
logic will be better understood. Fuzzy systems belong to the class of knowledge-based systems where reasoning mode relies on rules and functions belonging to sets called fuzzy sets. The role of fuzzy set theory is growing in the field of data mining. In a data mining process, discovered patterns, learned concepts, or models of interest are often vague. Most of the approaches to analyzing large datasets assume that the data is accurate and that we are dealing with accurate measurements. However, in most real-world scenarios, values are not totally accurate. There will always be some degree of uncertainty. Accuracy, efficiency, and completeness are mandatory in data mining models. To manage increasingly complex systems, there is an ever-increasing demand to maintain solutions understandable and simple. In the development of new information processing systems, fuzzy sets theory is an efficient way for solving the contradiction of the polarity between fuzziness and accuracy, constituting a bridge between the high accuracy and the great complexity of data. Fuzzy set theory therefore has the ability to produce more robust, comprehensible, and less complex models with the formalization and use of fuzzy values in a data mining process. Peng et al. [3] developed a fuzzy logic and wireless sensor network to develop an irrigation system that saves water. The system includes four parts: the cluster of sensor nodes that takes care of the collection of soil moisture, the coordinator node that contains the fuzzy controller that takes as input two variables (the soil moisture error and the rate of change of the error) and gives as output the watering time. The irrigation controller takes control of the implementation of the automatic watering and irrigation pipe network. The results showed that the amount of water required is calculated precisely and quickly by the system. Fuzzy neural network algorithm is introduced to manage irrigation, named FITRA. The algorithm controls the irrigation system based on sensor data. Several soil moisture sensors are used to minimize consumption and increase production. The system is able to automatically adapt to changing environmental conditions. The irrigation system that computes evapotranspiration (ET) and net irrigation based on the fuzzy inference system. The proposed algorithm can accomplish the following tasks: the estimation of ET which takes into account the meteorological parameters (humidity, radiation, temperature, wind), the monitoring of the soil moisture, the estimation of the quantity of irrigation required according to (reference ET) ET0, calculation of irrigation time, and irrigation schedule. This system is suitable for drip and sprinkler watering methods. The results showed that the fuzzy model is a fast and efficient tool for calculating ET and the necessary amount of water as well to indicate its relevance for water conversation and irrigation management.
10.1.3.5 Forecast the crop yield One of the major concerns of the farmers is the crop yield prediction and also influences some environmental parameters on productivity that mainly influence the market policy. Crop yield prediction is very importance for the global food production. However, the policy makers rely on the accurate predictions for making timely export and import decisions for strengthening the national food security. Accurate forecasting of crop yield during cropping season plays the major impact on the government policies. Crop yield estimation is utilized
208
Artificial Intelligence in Data Mining
for several purposes, like delivery estimation, crop insurance, planning harvesting, and to storage crop production for cash flow. In addition, the crop yield estimation monitors the crop growth during crop period, and also leaf area index, harvest index, and maturity date are computed. The seed companies required for predicting the performances of new hybrids in several environments for better varieties. Growers and the farmers also having benefit from the yield prediction in order to make the informed management and the financial decisions. In addition, the crop yield prediction is very challenging because of several complex factors. For instance, the information of genotype is represented usually by the high-dimensional marker data with many thousands to the millions of makers for every plant individual. The computations of genetic markers effects are very necessary to interact with several environmental conditions and the field management practices. Thus the agricultural production is affected significantly by various environmental factors. However, the weather influences the growth of crop that causes huge intraseasonal yield variability. Additionally, the crop agronomic management, which includes fertilizer application, planting, tillage, and irrigation are utilized to offset the yield loss because of the impacts of weather. As the result, the yield forecasting plays an important role to optimize the crop yield and to compute croparea insurance contracts. The issues faced by yield prediction are solved by data mining methods. Some of the data mining methods utilized in the crop yield forecasting are as follows. • • • • • • •
Decision trees (DT) Support vector machine (SVM) Neural networks (NN) Random forests (RF) K-nearest neighbors (KNN) Markov model (MM) Bayesian networks
10.1.3.5.1 DT method for predicting climate parameters DT is otherwise known as regression or classification trees. They provide the model in tree structure format. The classification rules are performed using the tests organized in tree. DTs are very easy to understand and assimilate through the visual presentation. However, the DT method is utilized for predicting the influence of several climatic parameters. Veenadhari et al., [4] designed software tool, named crop advisor for predicting climatic parameters, such as potential ET, wet day frequency, minimum and maximum temperature, cloud cover, and precipitation on the crop yields. In addition, the C4.5 algorithm is utilized to predict the climatic parameter by considering the selected crops. Other parameters corresponding to the crop yield are not considered in this tool because their application varies to the individual fields in time and space. In addition, the intelligent model is introduced for providing best crop sequence and production. Through the real-time soil sampling, the farmer failed to achieve current fertilizer requirements with minimal cost of fertilizer requirements.
Chapter 10 • Advancement of data mining methods for improvement
209
DT method is utilized for predicting the influence of climatic parameters on the soybean productivity. Here, ID3 algorithm is introduced based on two assumptions where the relative humidity is considered for finding the influential parameter on the yield along with precipitation and temperature also included. However, the method failed to predict the quantity of crop. The soil moisture in cojocna is detected based on nine algorithms, such as NN, SVM, KNN, rule induction, logistic regression, fast large margin, linear regression, DT, and RF. This prediction helps the farmer for taking the right decision for avoiding the crop damage with improved accuracy. 10.1.3.5.2 Artificial neural networks for wheat yield forecasting Deep learning is the subfield of machine learning, which is performed based on the learning of multiple representation levels. It employs the algorithms for modeling the high-level abstractions in the data. However, it typically uses the artificial neural networks (ANNs) or NNs with the hidden layers. Thus the deep learning has received considerable attention in the field of agricultural to perform several tasks including the detection of crop diseases. To manage the productivity loss of farmers’ illiteracy or lack of experience, the novel system is introduced in order to select the right crops to their land. Thus this prediction system used the ANN by considering seven parameters for determining the productivity, like phosphate, pH, nitrogen, potassium, precipitation, and temperature depth. In addition, this system provides the status productivity with desired crop and necessary fertilizers.
10.1.3.6 Estimation of the precise amount of water and suggestion of the necessary fertilizers For the last two years, the measurement of N fixation by the crop yields is performed using several data mining techniques. The classical approach is introduced for computing the fixation by subtracting N contained in the nonfixing crop, and then, the legume crop is disgraced by required assumption where both the crops obtain same quantity of N from fertilizer and soil. Nowadays, the sensitive and rapid acetylene reduction method is used for estimating the precise amount of water. This method estimates the N fixation over growing season due to the uncertain conversion ratios of C2H, fixed, reduced to N, large diurnal and longer term changes in the ate of enzyme activity, and difficulties in assaying and recovering the total root nodule systems. In addition, the differential dilution method is introduced in the plant of 5N-labeled fertilizer by fixed soil and nitrogen, which solves the difficulties for offering best seasonal N-fixation.
10.1.3.7 Prediction of the irrigation events Irrigation water demand is highly variable and depends on the farmers’ decision about when to irrigate. Their decision affects the performance of the irrigation networks. An accurate daily prediction of irrigation events occurrence at farm scale is a key factor to improve the management of the irrigation districts and consequently the sustainability of the irrigated agriculture. In the prediction of irrigation events, several prototype systems are introduced for the variable-rate irrigation application, but failed to consider adequate decision support
210
Artificial Intelligence in Data Mining
systems. In addition, the decision and control system are developed for increasing the practical functionality of real-time monitoring, precision irrigation, and decision and control systems. Although, previous irrigation decision support systems, such as direct crop stress measurement, soil moisture measurements, modeling approach, or the combination of these methods. The modeling approaches employed soil water and climatic data balance for predicting the availability of the water to crop. With the inherent weakness of the quality of data input, this may not or may incorporate any of the real-time site-specific measurements. Rainfall data is also utilized significantly from the regional data in order to balance the water component. Several models are developed for predicting the irrigation events with advance times, where most of the method are performed based on the numerical solutions of the overland flow models, like zero inertia, and kinematic wave, or detailed volume balance. Moreover, few analytical equations are derived to describe advance phases and the other phases. The usefulness of the analytical solutions is to solve practical issues which are overlooking by the practitioners.
10.1.3.8 Minimization of irrigation cost Water, energy, and cost optimization algorithm is introduced for reducing the irrigation cost with energy cost and consumption. This approach integrates water and energy-based sectoring operation (WEBSOM) for the multiple supply sources and the water and energy optimization by critical points control for multiple supply sources algorithm. The WEBSOM approach is introduced to minimize both energy consumption and the pressure deficits at hydrants considering operation by sectors in the pressurized multisource irrigation networks. In addition, this framework determines the number of sectors operates in every month and optimal pressure head in pumping stations. The same procedure id followed for WEBSOM. In general, the high-pressure requirements at hydrants that result in maximal energy cost that in turn yields increase in total cost of project. However, several mathematical programming methods are introduced in the dimensionalization of pressurized irrigation networks to reduce the total cost of network. Among that, Labye’s discontinuous method, Laplace multipliers method, linear programming, dynamic programming, and other heuristic, and the semitheoretical are developed for minimizing the irrigation cost.
10.1.3.9 Accurate suggestion of plants for the soil Different crops need different type of soils, different types and amounts of nutrients, and different types and amounts of water. The amount of water required by the plant is also dependent on the growing season and the climate where it is grown. Uptake of water and mineral nutrients from the soil is essential for plant life. The plant root system achieves these fundamental functions through its highly responsive and plastic morphology, which allows the plant to adjust to and exploit the wide spectrum of physical and chemical soil properties. Root system architecture (RSA), the geometry of different parts of the root system, is the overall output of growth and development of individual cells and tissues within the root. RSA feeds back into plant growth and development, at the very least by determining the rate
Chapter 10 • Advancement of data mining methods for improvement
211
of water and nutrient uptake. By selecting the right crop for the given soil conditions and climate, one can optimize yields and save water requirements for irrigation. The types of soil utilized for growing the crops are enlisted below. • • • • • •
Alluvial soils Black soils Laterite soils Mountain soils Red and yellow soils Other soils
10.1.3.9.1 Alluvial soils This kind of soil is common in the Northern India, especially in delta regions. This soil is deposited by the rivers and is rich in some of the nutrients, like potash and the humus, but are lacking in phosphorous and nitrogen. In addition, this soil is quicker-draining and sandier when compared to other soils. Wheat, rice, sugarcane, jute, and cotton all are grow well in these soils. 10.1.3.9.2 Black soils This soil gets their color from several salts or from the humus. This soil containing huge amount of clay, but still is sandy in the hillier regions. In addition, this soil consists of moderate amounts of phosphorous with less nitrogen. However, this kind of soil is utilized for growing wheat, rice, cotton, and sugarcane and also grows millet, groundnut, and the oilseeds. 10.1.3.9.3 Laterite soils These soils are found in the areas with heavy rainfall, particularly near coasts of India. It is the acidic soil and is rich in iron; hence, the soil provides red appearance. This soil is utilized for growing more tropical crops, like rubber, cashew, coffee, coconut, and tea. 10.1.3.9.4 Mountain soils Mountain soils are found in Himalayas with significant amounts of the organic matter. In addition, this soil is somewhat acidic, however suitable to grow coffee, tea, spices, and several kinds of tropical fruits. 10.1.3.9.5 Red and yellow soils This soil gets its names with the presence of iron oxide. They are somewhat acidic and sandy, but low in phosphorous and nitrogen. Moreover, wheat, rice, millet, ragi, potato, sugarcane, and groundnut are well grown in red and yellow soils. 10.1.3.9.6 Other soils There are several other kinds of soil in India, but none of them is appropriate to grow crops. Alkaline and saline soils are very low in the nutrients and very high in the salt for the productive agriculture. Marsh soils are unfit, due to its high acidity.
212
Artificial Intelligence in Data Mining
10.1.3.10 Measurement of growth rate Growth is the irreversible increase in mass, volume, or weight of the living organ or the cell. It involves both enlargement and cell division. In addition, the plant growth is visualized based on increase in length or stem diameter, volume of tissue, increase in fresh and dry weight, increase in the number of cells, dry weight, fresh, dry weight, leaf weight, and so on. Growth is divided into two. • Determinate growth • Indeterminate growth 10.1.3.10.1 Determinate growth It grows to the certain size and then stops growing because of the presence of terminal inflorescence. After that, it eventually senesce and die. 10.1.3.10.2 Indeterminate growth In this indeterminate growth, the main stem continues for elongating indefinitely without limited by terminal inflorescence or the other reproductive structure. The growth is the quantitative phenomena, which is measured with respect to time. In addition, it is measured based on increase in length or growth and increase in the areas or volume. In addition, the growth is measured in terms of various parameters given as follows. • • • •
Fresh weight Dry weight Length Area
10.1.3.11 Growth rate analysis It is defined as the mathematical expression of the environmental effects on the development and growth of the crop plants. The growth rates are denoted in two forms, such as compound and linear. The linear form is computed by fitting the straight line to index numbers, identifying slope of line, and reducing it to percentage and divided the average index in first three years. The compound rate is estimated by fitting the straight line to the logarithms of index numbers; thus the slope of line provides the constant percentage rate where the fitted line gets changed. Due to the standardizing procedure utilized for linear rates, they are higher than the compound rates for all the series that show growth. For all-India crop aggregates, the linear percentage rates are greater than the compound by 22.8% for the food grains, 22.9% for all crops, and 21.6% for the nonfood grains. In addition, it is the useful tool to study the complex interactions among plant growth and environment. Some of the common parameters utilized in growth analysis are illustrate below. • • • •
Crop growth rate (CGR) Absolute growth rate (AGR) Relative growth rate (RGR) Growth index
Chapter 10 • Advancement of data mining methods for improvement
213
10.1.3.11.1 Crop growth rate This method is suggested by Watson in 1952. It indicates the change in dry matter accumulation over the period of time. This CGR is closely related to the interception of solar radiation. The CGR expression is given by, CGR 5
ðϖ2 2 ϖ1 Þ αðτ 2 2 τ 1 Þ
(10.1)
where, the terms ϖ1 and ϖ2 indicates the whole plant dry weight at the time τ 1 and τ 2 , and the ground truth is denoted as α. 10.1.3.11.2 Absolute growth rate AGR is utilized to measure the absolute value of the biomass among two intervals. However, it is generally employed for the single plant organ or single plant. AGR 5
g2 2 g1 cm day21 τ2 2 τ1
(10.2)
where, the terms g1 and g2 represents the plant height at the time τ 1 and τ 2 . 10.1.3.11.3 Relative growth rate RGR is first modeled by Blackman in 1919 and is utilized to quantify the speed of plant growth. It is the most commonly utilized for computing the crop growth. The RGR equation is given by, RGR 5
loga ϖ2 2 loga ϖ1 cm day21 τ2 2 τ1
(10.3)
10.1.3.11.4 Growth index The growth index is defined as the measure of long-term changes in the plant community condition based on satellite measurement of peak annual normalized difference vegetation index.
10.1.3.12 Water body prediction for better crop filed Groundwater and the contamination of the surface waters (SWs) by the pesticides are detected at various sites in the Europe. The potential for water body contamination is high in several areas in which rice is cultivated under the flooded conditions. Irrigation improves the likelihood of pesticides transport through drainage or runoff to the SW, and using leaching to the groundwater if aquifers failed to confine by the impermeable soil layers. Fuzzy logic is particularly employed method utilized to determine the water body prediction in order to increase the crop field. In addition, the NN is also employed for predicting the water body to achieve better crop filed.
214
Artificial Intelligence in Data Mining
10.2 Disease prediction using data mining The plant is affected by various diseases during their growth, but its detection is the major goal for several researches. However, the emergence of various crop-related diseases affects the agriculture productivity sector. Hence, for solving this issue, and for making the aware of farmers, the effective disease prediction using data mining method is introduced. Moreover, detecting symptoms and monitoring the plant health condition early are very important for reducing the disease spread that helps the farmers in effective management practices and helps to increase productivity. They are based on the combination of data mining techniques and image preprocessing to overcome the lack of human observation and even to reduce costs. As the world is contributing toward advanced technologies and fulfillment, it is a paramount goal to trend up in agriculture also. Many researches are completed in the field of agriculture. The collected data grant the information about the assorted environmental factors. Observing the environmental factors is not the outright result to raise the yield of crops. There are number of other factors that reduce the yield to a greater quantity. Thus disease diagnosis in crops plays a major role in preserving agriculture productivity. Some of the diseases predicted in agriculture data are as follows: • • • •
Crop disease prediction Rice plant disease prediction Leaf disease prediction Plant disease prediction
10.2.1 Crop disease prediction Every country has their own agriculture challenges for the production because of the changes in some major resources, such as weather, soil, light, and water. The crop disease is the major issue for the modern agriculture that always cause huge losses to the crop production. It is computed that 16% of harvests in eight most important crops is lost because of plant diseases that manifest during preharvest and postharvest treatments every year. Thus the agricultural emergency management is utilized to reduce such type of losses by optimizing the frequency and timing of application of fungicide, pesticides, and other preventative measures and ensure environmental and ecological safety by reducing excessive chemical application. Hence, the automatic and precise diagnosis prediction crop diseases play the crucial role to ensure maximal yield and maximal quality of crops. As the crop diseases bring large losses each year in both developing and developed countries to predict the crop disease severity for facilitating the agricultural emergency management is really the worldwide issue. Sugarcane is the crop, which is affected by the number of pathogens or disease, such as red rot, grassy shoot, smut, and the red stripe. Crop disease detection plays the important role for the development of crops. In the present era, several techniques are utilized for the detection by the farmers through the naked eye. It requires the continuous monitoring and observation that results in time loss, and expensive. In some of the regions, farmers need to go several places for getting the guidance
Chapter 10 • Advancement of data mining methods for improvement
215
from experts. Therefore automatic prediction of is very important for detecting various crop diseases on the that provides the real-time benefits to farmers that results in save money, time, and the life of the crop.
10.2.2 Rice plant disease prediction Rice is the main food crops in social stability, agricultural security, and the national development. The other name of rice is Oryza sativa, which is very important for all over the world. The rice is grown in Asia and Philippines as it is the major food source for several millions of the Filipinos. The rice diseases are frequently occurred nowadays where it causes serious losses in the production of rice. In general, the rice plant diminishes their ability for producing the food. The diseases of rice plants are recognized readily based on the symptoms, like the visual features on the leaves of a rice plant. The types of diseases that happen in rice plant are Rice Blast, Bakanae, Bacterial Leaf Streak, Stem Rot, Bacterial Blight, Tungro, Brown Spot, Leaf Scald, False, and Smut. Rice diseases detection is very challenging during the rice cultivation. Here, the rice plants are damaged from biotic and abiotic factors. The biotic factors are categorized into bacterial, fungal, viral, and animals, whereas the abiotic factors are cold, drought, deficiency, heat, cold, and so on. In traditional, the farmers observe the abnormalities in rice plant and manage the diseased rice plant on the basis of personal experience and knowledge. Some of the farmers consult the experts, like agronomists, and experienced farmers for the advices. Rice plants are affected based on lighting conditions, nutrients, climatic conditions, humidity, management of water, fertilizer, and the conditions of farming. The detection of rice diseases is time-consumed with less recognition accuracy; hence, the improper diagnosis and the misuse of pesticides may happen. Rice Leaf Blast (RLB) disease is one of the biotic stresses that contribute to the reduction of rice productivity worldwide The RLB disease is mainly caused by fungus Pyricularia oryzae Cavara that attack rice plants which diminish rice quality. This disease can attack all stages of rices’ growth as early as the seedlings, vegetative, and even the harvesting stages. Therefore the rice plant disease detection and the classification in the early stages is very necessary before it gets severe. The detection of rice leaf disease needs the knowledge in plant diseases with additional time. Inappropriate plant disease identification leads to several losses, money, time, and the product quality. Hence, the farmers must concentrate the rice plants for the successful cultivation. The detection of rice plant is performed based on physical techniques, needs the experts advise to reduce the error prone. The classification of plants is on the basis of size, color, and so on. Therefore the classification of rice disease is essential to compute the agriculture products, meeting quality standards, and the increasing market value. SVM was developed for predicting the wheat stripe rust at China. Here, the prediction was carried out based on two datasets, such as meteorological and disease data. In addition, the Convolutional Neural Network (CNN) was introduced for mapping the paddy rice-based on the satellite data. The CNN classifier is compared with SVM and the RF for predicting the rice plant disease.
216
Artificial Intelligence in Data Mining
10.2.3 Leaf disease prediction Leaf diseases are caused by the pathogens, which brutally affects the crop yield. The infected leaf is identified based on their symptoms such that the infected pattern of the leaf may be rotten leaf by any of the bacteria. Blast is the disease, which is the fungal disease caused by organism magnaporte oryzae. Leaf disease of the wheat leaf are rust or brown rust due to the rust pustule are brown in color and another one is stripe rust caused due to rust pustule, and its color is yellow that is why it is called yellow rust. They are controlled usually with bactericides, fungicides, and the resistant varieties. Some of the diseases caused in leaves are leaf spot, leaf blights, rusts, and powdery Mildew. Another important leaf occurring disease is Little Leaf of Brinjal. Brinjal leaf disease caused by a virus as the name suggests that mostly new developing leaves, they become little in size, and they remain little in size. In exhaust conditions, the entire plant may be stunted and the leaves remain little in size throughout the plant. The CNN models to detect diseases based on leaves tomato images. The models used are AlexNet and SqueezeNet. Their system is designed to be used in mobile devices. The cucumber leaf diseases are detected using CNN-based system. The approach used the square crop and square deformation strategies as the preprocessing step for better prediction. Hidden Markov Model (HMM) was introduced in order to detect the grape disease. Here, this approach analyzed the input data, like relative humidity, temperature, and leaf wetness, and then, the grape diseases are classified. As anthracnose, rust, bacterial leaf spot powdery mildew, downy mildew, and the bacterial leaf cancer. The complementary strategy is developed in [5] for tomato image recovery of the tomato leaves for reinforce the disease diagnosis. This framework uses the Color Structure Descriptor, which captures color and spatial information and the Nearest Neighbor algorithm for the recovery task. The latter allows the recovery of similar images for a query image. In the paper six tomato diseases were examined: Early blight, Chlorosis, Sooty molds, Powdery mildew (PM), Necrosis and Whiteflyes. Based on the experimental results the proposed approach can partially provide specialized information in various diseases such as chlorosis, early blight and sooty molds. On the other hand, low performance is shown for powdery mildew and whiteflye.
10.2.4 Plant disease prediction Plant pathogens are mostly developed due to environmental factor. It affects plant leafs, roots or crop itself. Plant disease diagnosis is crucial for crop management and production. It can be successfully achieved through optical observation of plant leaves alterations by scouting specialist but requires a high degree of experience and specialization. Plant disease identification by visual way is more laborious task and at the same time less accurate and can be done only in limited areas. Whereas if automatic detection technique is used, it will take less efforts, less time, and more accurately. In plants, some general diseases are brown and yellow spots, or early and late scorch, and other are fungal, viral, and bacterial diseases. Image processing is the technique which is used for measuring affected area of disease and
Chapter 10 • Advancement of data mining methods for improvement
217
to determine the difference in the color of the affected area. An intelligent system based on the multisensors is introduced for detecting thee diseases or the plant stresses. Here, two classifiers, such as Quadratic Discrimination and NNs. Thus this system employed the multispectral and hyperspectral information, and the GPS. In addition, the site-specific spraying is also applied for preventing the fungal diseases [6]. Deep CNN models in order to identify plant diseases based on leaves images. These models are AlexNet, AlexNetOWTBn, GoogleNet, Overfeat, and VGG. Several crops and diseases have been integrated in the model, but their system remains applicable for 25 plants whose globality we find 58 different classes including healthy and diseases plants (plant disease). The robotic disease detection system is presented for finding different plant diseases. Several algorithms based on the Principal Component Analysis and Coefficient Variation are used in order to detect Tomato Spotted Wilt Virus and PM that threaten greenhouse bell peppers. The system has difficulties in reaching the required detection pose. In Ref. [7], the HMM is introduced or the early detection of grape diseases. This model analyzes the input data, such as relative humidity, temperature, and the leaf Wetness, and classifies the grape diseases, like anthracnose, PM, rust, downy mildew, bacterial leaf spot, and bacterial leaf cancer. The results showed that the use of HMM is highly accurate compared to the statistical method used for detection.
10.3 Pests monitoring using data mining Monitoring for pests is a fundamental first step in creating a proper integrated pest management (IPM) program. Pests are monitored through a variety of monitoring tools such as pheromone traps, light traps, colored sticky traps, pitfall traps, and suction traps. The trap capture data serves several purposes: (i) ecological studies, Crummay and Atkinson, Hirao, (ii) tracking insect migration, (iii) timing of pest arrivals into agroecosystems, (iv) initiating field scouting and sampling procedures, (v) timing of pesticide applications, (vi) starting date or biofix for phenology models, and (vii) prediction of later generations based on size of earlier generations The effect of pest and diseases on the crops is dependent on various aspects such as amount of food supply, weather conditions, the presence or absence of natural enemies, and availability of intermediary carriers. Climate and weather conditions are very significant in identifying the actual epidemiology of uprising of diseases or pests. Hence, it is very important to know the impact of weather on the pest causing diseases resulting in low productivity and need to take appropriate measures to control it. Apart from causing direct damage to the crop, the pests are well known to cause indirect damage like viral diseases such as Peanut Yellow Spot Virus, groundnut Bud and Necrosis Virus. The important weather parameters which have impact on pest/disease distribution are speed of wind, rainfall, temperature, humidity, sunshine hours, and leaf wetness. Therefore it is very essential to develop a weather-based predictive model to control the effects of pests on groundnut crops. In present situation, where the pests are not assistant to pesticides and high cost of pesticides leads to develop a prewarning model to alert the framers regarding the diseases caused
218
Artificial Intelligence in Data Mining
by the pest attacks. In this connection, there is essence to obtain high productivity. Enormous amount of agricultural data are collected and stored in databases due to technology advancements by application of appropriate data mining techniques we can construct an effective predictive models using underlying data. The effect of pest and diseases on the crops is dependent on various aspects such as amount of food supply, weather conditions, the presence or absence of natural enemies, and availability of intermediary carriers. Climate and weather conditions are very significant in identifying the actual epidemiology of uprising of diseases or pests. The important weather parameters which have impact on pest/disease distribution are speed of wind, rainfall, temperature, humidity, sunshine hours, and leaf wetness. In present situation, where the pests are not assistant to pesticides and high cost of pesticides leads to develop a prewarning model to alert the framers regarding the diseases caused by the pest attacks. In this connection, there is essence to obtain high productivity. Enormous amount of agricultural data are collected and stored in databases due to technology advancements by application of appropriate data mining techniques we can construct an effective predictive models using underlying data [8]. Precision agriculture practices can significantly reduce the amount of nutrient and other crop inputs used while boosting yields. Farmers thus obtain a return on their investment by saving on fertilizer costs. Applying the right amount of inputs in the right place and at the right time benefits crops, soils, and groundwater, and thus the entire crop cycle. Sustainable agriculture seeks to assure a continued supply of food within the ecological, economic, and social limits required to sustain production in the long-term. Precision agriculture therefore seeks to use high-tech systems in pursuit of this goal. In the present study, wireless motesbased Agrisens distributed sensing system was used to collect the micro-level weather parameters (temperature, rainfall, humidity, sunshine). Data mining techniques have been employed on these data to understand the hidden correlations and pattern among pestdisease and weather. Eventually, one-week data is been used to develop a predictive model for Thrips, based on which a Decision Support System (DSS) for multiseason data can be developed [8].
10.3.1 Pest control methods Chemical pest control is still the predominant type of pest control today. Some pesticides target specific insects, rodents, weeds, or fungi, whereas others are broad and manage a wide range of unwanted organisms. There is a smart move to reduce the use of pesticides in favor of more environment friendly methods of pest control. Although pesticides, that is chemicals that kill or manage the population of pests, have been used for many years, several alternative pest control methods have been developed. One common alternate method used for controlling pests is biological control, which is when natural predators of the pest are introduced to prey on or parasitize the pest. Biological control as a management tool dates back over 1,000 years when ancient Chinese citrus growers used ants to control caterpillar larvae infesting their trees. It is one of the safest methods of control since it is not toxic, pathogenic or injurious to humans. When using this method, farmers get natural predators
Chapter 10 • Advancement of data mining methods for improvement
219
of the pest and release them into their fields so that the predators can manage the pest population. Although, biological control can be very beneficial but are prone to cause problems as well for the environment. Sometimes, biological control organisms begin to take over an environment and can cause harm to nonpest organisms. Unfortunately, once biological control organisms are introduced into the environment, they are almost impossible to remove. One more disadvantage of biological control is, if they act differently than expected, it could be possible that they can do more damage than the existing pesticides they are supposed to eliminate. Natural chemical control is another alternate method of pest management that utilized chemical compounds found in the environment to manage pests. The most common used natural chemicals are pheromones and hormones, which are specific to the pest species being targeted and have limited influence on other species [7]. In addition to these specific alternative pest control methods, a complex managements system, known as IPM system has also been developed. IPM is an efficient and environmentally sensitive approach to insect control that relies on a mixture of common pest control practices. IPM programs use current, comprehensive information on the life cycles of pests and their growth and interaction with environment. This information, in combination with available pest control methods, is used to manage pest damage by the most economical way, and with the slightest feasible risk to people, property, and the environment. IPM is not a single pest control method but, rather, a series of pest management evaluations, decisions, and controls. IPM first sets an action threshold, a point at which pest populations conditions specify that pest control action must be taken. The level at which pests will either become an economic threat is critical to guide future pest control decisions. Not all insects, weeds, and other living organisms require control. Many organisms are harmless, and some are even beneficial for crops. IPM programs work to monitor for pests and identify them accurately, so that suitable control decisions can be made in combination with action. As a first line of pest control, IPM programs work to manage the crop to prevent pests from becoming a danger.
10.3.2 NNs algorithm for pest monitoring Boniecki et al. [9] developed the neural classifier for identify six parasites in the apple orchard. The six parasites are apple blossom weevil, apple leaf sucker, apple moth, codling moth, apple clearwing, and apple aphid. The classifier is based on 23 parameters which include 16 color characteristics and 7 form factors. They achieved good results using the multilayer perceptron NN topology. Different deep CNN models are used to detect weeds in turfgrass. They are based on weeds images taken using a digital camera. These models are VGGNet, GoogleNet, and DetectNet. The results showed that Deep CNN is highly suitable for weed detection. CNN combined with crops lines algorithm is introduced to identify weeds in beet, spinach, and bean field. High vegetable images taken about 20 m by drones are used. The results showed that the best accuracy is reached in beet field. The paper mentions some difficulties in detecting weeds especially at the early stage of plant growth or when weeds grow near the crop.
220
Artificial Intelligence in Data Mining
10.3.3 RF algorithm for pest monitoring An intelligent system was developed and applied to a case study of pest population dynamics. It describes a new decision support system for agriculture, called AgroDSS. The decision-making aspect of the proposed system relies on data mining approaches, which can excerpt useful information from big data. The implemented tools include supervised learning (predictive modeling), unsupervised learning (clustering), and time series analysis. The pest data, collected by Trap View allows monitoring of pests in the vineyards and orchards through insect traps. In order to identify valuable crops and weeds, the new system was introduced that detects beet sugar plants, extracting characteristics to obtain an accurate crop and weed estimate by combining RF classification and exploiting neighboring information through a random Markov field. The proposed approach was implemented and evaluated on a real farm robot on three different beet sugar fields. By automatically separating the two categories of plants, a robot can perform necessary spraying or remove the weed. Naive Bayes algorithm for pest monitoring Viani et al. [10] introduced fuzzy logic to predict the dose of pesticides to be applied. It takes into account weather data (soil moisture, soil temperature), the risk of infection calculated using the environmental conditions and phenological development and the stage of development of the plant. Fuzzy logic is also used for the management of weeds, pests, and diseases. The data Mining techniques are utilized to understand the relationship between disease (BNV)/ parasite and meteorological data in India’s groundnut crops. The Gaussian Naive Bayes and Rapid Association Rule Mining are used, respectively, for classification, association, and correlation analysis. A multivariate regression model is developed and used in order to develop an empirical prediction model. Some cases of use of data mining techniques for the prediction of pests in crops.
10.4 Summary The agricultural sector becomes very popular in the IT using global positioning system with the technological advances in data aggregation and sensors. Nowadays, the agricultural companies harvest growing large number of data. The large amount of information about crop and soil properties enables the higher operational efficiency in the datasets; hence, the suitable techniques are applied for finding the information. Thus this chapter presents the data mining techniques in the agriculture is discussed. Several agricultural problems present in crop control systems, such as pests, and disease management, input planning, like pesticides, and water, and the prediction of yield is also discussed.
References [1] Kukar M, Vraˇcar P, Koˇsir D, Pevec D, Bosni´c Z. AgroDSS: a decision support system for agriculture and farming. Comput Elect Agric 2019;161:26071. [2] Matei O, Rusu T, Petrovan A, Mihu¸t G. A data mining system for real time soil moisture prediction. Proced Eng 2017;181:83744.
Chapter 10 • Advancement of data mining methods for improvement
221
[3] Peng X, Wang P. An improved multi-objective genetic algorithm in optimization and its application to high efficiency and low NOx emissions combustion, In: Proceedings of Asia-pacific Power and Energy Engineering Conference, March 2009, pp. 14. [4] Veenadhari S, Misra B, Singh CD. Machine learning approach for forecasting crop yield based on climatic parameters, In: Proceedings of International Conference on Computer Communication and Informatics, January 2014, pp. 15. [5] Baquero D, Molina J, Gil R, Bojacá C, Franco H, Gómez F. An image retrieval system for tomato disease assessment, In: Proceedings of XIX Symposium on Image, Signal Processing and Artificial Vision, September 2014, pp. 15. [6] Ferentinos KP. Deep learning models for plant disease detection and diagnosis. Comput Elect Agric 2018;145:31118. [7] Patil SS, Thorat SA. Early detection of grapes diseases using machine learning and IoT, In: Proceedings of Second International Conference on Cognitive Computing and Information Processing (CCIP), 2016, pp. 15. [8] Blagojevi´c M, Blagojevi´c M, Liˇcina V. Web-based intelligent system for predicting apricot yields using artificial neural networks. Sci Horticult 2016;213:12531. [9] Boniecki P, Koszela K, Piekarska-Boniecka H, Weres J, Zaborowicz M, Kujawa S, et al. Neural identification of selected apple pests. Comput Elect Agric 2015;110:916. [10] Viani F, Robol F, Bertolli M, Polo A, Massa A, Ahmadi H, et al. A wireless monitoring system for phytosanitary treatment in smart farming applications, In: Proceedings of IEEE International Symposium on Antennas and Propagation (APSURSI), June 2016, pp. 20012002.
This page intentionally left blank
11 Advanced data mining for defense and security applications Pramod Pandurang Jadhav DEPART ME NT OF COMP UT ER ENGINEERING, G H RAISONI INSTITUTE OF ENGINEERING AND TECHNOLOGY (GHR IET), P UNE, INDIA
11.1 Military data: properties and analysis Knowledge discovery in the database has gained significant attention and interest in the recent decades, especially in the strategical fields, where the access to data is restricted many times. This chapter describes the application of data mining tools in the military databases. However, to find the massive knowledge from the large amount of operational data can only be possible if the data can be analyzed in the joint form. Moreover, the large volume of stored data is impracticable for the specialists in order to analyze the data using the traditional mining methods. The complexity to analyze the data more accurately means that the data are converted into the accumulation of data without utility. Accordingly, the possibilities of data analysis model are combined with the restrictions of the military applications that motivate the use of mining tools for searching the new knowledge in order to achieve improvement in the antimissile system. However, the data mining applications are extended into military systems, where the information is gathered and certainly exists for useful implicit data.
11.1.1 Data source The data used for the purpose of defense and security applications are usually collected from various sources. However, some of the data used for the purpose of mining in the security fields are listed as follows: • • • •
Radar data Airborne data Military communication signal data Weapon data
11.1.1.1 Radar data Radar data have the capability to continuously measure the precipitation in space and time with a radius of 200 km. The radar measures about the precipitation is indirect such that the Artificial Intelligence in Data Mining. DOI: https://doi.org/10.1016/B978-0-12-820601-0.00009-4 © 2021 Elsevier Inc. All rights reserved.
223
224
Artificial Intelligence in Data Mining
reflectivity of raw data are transformed into the rainfall units using “Z-R relationship.” Most of the radar systems are operated with the digital processors that allow not only to transform the data into rainfall data but it also enable to integrate the data into the space and time scale. Moreover, the radar data as well as the information obtained from the remote sources are typically in error due to the meteorological and operational variability.
11.1.1.2 Airborne data The airborne data exhibit the degree of consistency such that the data depends on the surface of the conductivity distribution, sampling pattern of survey, and the physics of the acquisition system. However, the conventional techniques treat the airborne samples independently but fail to capitalize the spatial coherency on data. In the holistic model, the concept of calibration model is adopted but simultaneously inverted the parameters of conductivity model using airborne data rather than the usage of geophysical data.
11.1.1.3 Military communication signal data As the military integrates the network-centric approaches, the technology and the operation become increasingly significant for monitoring and accessing the performance of teams. Moreover, there exist various challenges for effectively identifying, analyzing, tracking, and reporting the performance of team at real-time in the operational network environment. Most of the military communication is generally spoken, and hence the automatic speech recognition system (ASR) is used to convert the speech input into the text form for processing it through the communication analysis system. However, the communication analysis technology is tested effectively in order to analyze the ASR input with the number of datasets based on the spoken communication.
11.1.1.4 Weapon data In the urban environment, the sniper attacks become the major source of casualty in the asymmetric warfare. The countersniper system assists the identification of the location of shooters and the information regarding the weapon class that would minimize the potential peril of the civilian and soldier’s population. Accordingly, the countersniper system uses various physical phenomena related to the weapon data, like visual, electromagnetic, or acoustic signals. With the wide range of measurement devices, the acoustic signals like shockwave, surface vibration, and muzzle blast apparently offer the most accurate and convenient way to identify sniper shots.
11.1.2 Data protection strategies Some of the strategies used to protect the information in the defense and security domains are listed as follows: • • • •
Data obfuscation Data anonymization Data privacy preservation Data encryption
Chapter 11 • Advanced data mining for defense and security applications
225
11.1.2.1 Data obfuscation It is the mechanism used to hide the private data from the usage of ambiguous, false, or misleading data with the purpose of confusing the adversary. However, the data obfuscation model acts as the noisy data channel between the untrusted observer and user’s secret, or private data. By increasing the noisy channel, the privacy of user will get increased. Here, the user-centric approach is highly concentrated where the individual user perturbs the secret data before releasing the data. Here, the privacy of database is not concerned, but the privacy issues in releasing the sensitive data sample are considered. Let us consider the mobile user who has concerned regarding the leakage of information of the location using queries. In such a case, the data obfuscation is the procedure of randomizing the exact location in order to receive user-perturbed location using the location-based server. With the data obfuscation model, the privacy of user and the utility experience odds with other users such that the service receives by the user is the function that specifies what the user shares with the service provider. However, to design an obfuscation mechanism that protects the user privacy with minimal utility cost is a great challenge. Moreover, to guarantee the privacy of user using the privacy metric is another complex issue. Accordingly, the three major properties associated with the data obfuscation model are listed as follows: • Specification • Shift • Reversibility
11.1.2.2 Data anonymization It is the important privacy measure used while sharing and releasing the sensitive datasets. The Health Insurance Portability and Accountability Act privacy rule provides concrete measures for preventing the reidentification. The fuzziness is the degree of mechanism used to balance the usability and semantics against risk reduction. K-anonymity is understood and well-known privacy criteria that consider the quasi-identifiers. However, these attributes are used for the analysis purposes but they are associated with the risk of reidentification. A dataset is said to k-anonymous when the data item is not distinguished from other data items with regard to quasi-identifiers. By introducing the k-anonymity against the linkage attacks may lead to the identify disclosure while integrating the accessible data with the background knowledge of the attackers. Anonymization is the technique, where the enterprises can increase the data security in the public cloud while analyzing the data. However, the data anonymization is the procedure of modifying the data that is published or used in a way for preventing the identification of the original data. With the data anonymization process, the confidential data can be obscured in a way for maintaining the data privacy. However, the anonymized data will be stored in the cloud and can access the data more securely. Finally, the results can be matched with the original data securely. Some of the techniques used in the data anonymization are explained as follows:
226
• • • • • • • • •
Artificial Intelligence in Data Mining
Hiding model Hashing approach Permutation Truncation Apriori algorithm Top-down approach named partition algorithm Local recording anonymization Hierarchy-based generalization method Privacy constrained clustering-based transaction anonymization model
11.1.2.3 Data privacy protection The data mining methods are extensively used to deduct the previously unknown, potentially useful, and implicit data from large datasets using the intelligent and statistical methods. However, the deduction of conclusion or patterns may uncover the data and it sometimes compromise privacy and confidentiality obligations. Privacy protection is an important factor in data mining such that to sacrifice the privacy of user is highly demanding in the data mining applications. However, the analysis of the privacy-preserving data mining method considers the effects of data mining algorithms and privacy preserving to mine the result. Moreover, the privacy must be preserved in the data mining using the following three aspects, namely • Classifiers • Clusters • Associated rules Data mining is the method of extracting the patterns from a large volume of datasets by integrating the artificial intelligence and the statistical methods with the database management system such that the data mining is considered as the major tool in modern business for transforming the data into the business intelligence level. However, it is the nontrivial extraction of previously unknown and implicit data from the databases or datasets. Accordingly, the data mining process involves the compilation and collection of anonymous information from different sources in order to analyze the data that cause data aggregation. When the data miner can identify the individuals from data analysis that are anonymous, then the privacy of the user gets affected. However, the information can be additionally modified, and hence it remains anonymous so that the individual cannot identify the data easily. Privacy preservation is an important concept of data mining and achieves practical importance in the field of government agencies to protect the public from attacks. However, the data mining methods consider the data distribution in two forms, namely • Distributed data distribution and • Centralized data distribution Data privacy is a critical factor in the concern of mining with distributed data such that the data can be partitioned horizontally or vertically. The vertical partitioning distributes the
Chapter 11 • Advanced data mining for defense and security applications
227
mining at various sites, where each site uses portion of attributes for calculating the result. Accordingly, the distributed results are integrated at the central trusted party. In vertical partitioning, the privacy of data is protected but the classification accuracy is extended. In the horizontal partitioning, the data are distributed at various sites and each site calculates their own data, and the central party is used for integrating the result. In this case, both the data privacy and better classification accuracy can be achieved more effectively. Various privacy preservation methods are developed but most of the methods are operated based on data anonymization. Here are some of the privacy preservation methods listed below: • • • • • • • • •
L-diverse method Taxonomy tree Randomization model Cryptographic techniques K-anonymity method Data distribution model Condensation approach T-closeness Multidimensional sensitivity-based anonymization
11.1.2.3.1 K-anonymity Anonymization is the practice of altering the data before the data are given to data analytics; hence the reidentification is not possible. However, reidentifying the data can be achieved by mapping anonymized data with the external data sources. Moreover, the k-anonymity is prone to some attacks, like background knowledge and homogeneity attack. However, some of the algorithms used in the k-anonymity are Mondrian and Incognito to ensure the anonymization. 11.1.2.3.2 L-diversity The L-diversity method is introduced to address the homogeneity attack. With the L-diversity model, there exist L represented values for sensitive attributes at each equivalence class. To implement the L-diversity is not easy at every time due to the variety of information. However, the L-diversity is prone to skewness attack. When the entire data distribution is skewed into very limited equivalence classes then disclosure of attribute cannot be ensured. Also, the L-diversity may cause similarity attacks. 11.1.2.3.3 T-closeness The T-closeness measure is the improvement version of L-diversity. In T-closeness, the equivalence classes are assumed to have T-closeness when the distance between the sensitive attribute distributions in class is not more than the threshold value. However, the T-closeness is computed for each attribute based on the sensitive attribute. 11.1.2.3.4 Randomization technique It is the process of fusing noise with the data that is generally achieved using probability distribution. However, the randomization can be achieved in sentiment analysis, surveys,
228
Artificial Intelligence in Data Mining
and so on. The randomization model does not require the knowledge of other similar records in data. This process can be applied during preprocessing and data collection time. However, there exists no anonymization overhead in the randomization model. By applying the randomization model over a large database is not possible due to the data utility and time complexity. 11.1.2.3.5 Data distribution technique In the data distribution model, the data are distributed in various sites. However, the data can be distributed in two different ways, namely • Horizontal data distribution • Vertical data distribution Horizontal distribution: If the data are distributed over various sites with unique attributes then the data distribution is called horizontal distribution. It can be applied for the aggregate operations or functions without sharing the actual data. Vertical distribution: Here, the person-specific data are distributed over various sites under the guardian of various organizations, and such distribution is termed as vertical distribution. Here, the data holder will encrypt the data before the data are released to data analytics. However, encrypting the large-scale information with the traditional encryption methods is highly complex, and such a method can be applied at the time of the data collection process. Fig. 111 represents the schematic diagram of privacy preserved data mining.
11.1.2.4 Data encryption Data encryption is used to protect the data and is commonly used for various compliance requirements. However, most of the modern databases, namely, Microsoft Structured Query
FIGURE 11–1 Privacy preserved data mining.
Chapter 11 • Advanced data mining for defense and security applications
229
Language, MySQL, and oracle include the process for data encryption and data decryption. In addition to the encryption and the decryption process, the databases contain some other functions for hashing the data. Encryption is the process of converting the plaintext to the ciphertext using the key, and decryption is the process of converting the ciphertext to the plaintext form. Accordingly, the design of cryptographic methods is efficient and secure, requires less memory, low cost, and easy to implement. Encryption and hashing are related and similar but the operations are different. Hashing is the one-way function, which takes the data and offers a cryptographic fingerprint of data and uniquely finds the information using fingerprint. The decryption is the reversible process of encryption. It uses the key to lock and unlock the data.
11.2 Applying data mining for military application This section describes some of the applications of the data mining approach used in the military system. The applications of data mining model in the defense and security fields are elaborated as follows.
11.2.1 Data mining application in navy flight and maintenance data to affect flight repair The naval aircraft maintenance system is driven by the avionics-related deficiencies. However, the military avionics systems heavily rely on built-in-test (BIT) to the troubleshoot discrepancies while performing an unscheduled maintenance. Many existing studies are found in analyzing the BIT codes and scheduling the maintenance that increases the availability of operational constraints [1]. This section describes the aircraft system using the BIT codes by considering the organizational level (O-level). However, the aircraft system does not contain the information, like operating environment, historical maintenance data, and previous repair history. However, the Integrated diagnostics and Automated Test System (IDATS) team in the naval air system command (NAVAIR) examine the use of data mining for mitigating the ambiguity in the maintenance of naval avionics system with reduced cost and ineffective maintenance practices. It includes building approaches from both the BIT codes and aircraft historical data into the aircraft memory unit for identifying the trends of data. The IDATS team uses the commercial off-the-shelf software package with the data mining approach called “Think Analytics” that is used to identify the meaningful trends based on the maintenance datasets and BIT codes. The “Think Analytics” is the real-time data mining tool that offers the required data mining functionality. However, the trends that are being identified using the software packages are validated by the system Subject Matter Experts and users in order to ensure that the data are novel, nontrivial, and accurate. The knowledge of data mining is used to find how the system can operate and communicate with other systems such that the BIT is used to increase the maintenance efficiency.
230
Artificial Intelligence in Data Mining
The aircraft data from the F/A-18 model/type/series are selected as the test set. Number of flight information is available for the F/A-18 system and some of the known avionics diagnostics ambiguities exist in the subcomponents of the radar system. One of the major blocks to achieve the diagnostic maturation in the complex environment system is that the data required for analysis must reside in disparate and multiple locations. The data must be integrated with the relevant information in order to achieve analysis over maintenance data. However, there exist complex issues in fusing the maintenance data with BIT during the maintenance process. V-22 automated maintenance environment (AME) is the system that integrates the diagnostic data with the maintenance action using work order. In the aircraft platform [2], the AME is used to mature the diagnostic capabilities and can be used to fuse the maintenance data with the diagnostic data in order to substantially increase the maintenance process. However, the data used in the aircraft system are obtained from the sources, namely • Navan aviation logistics data analysis (NALDA) • Boeing F/A-18 flight data repository • Excel spreadsheet reports that correlate the recorded flight information, like flight characteristics, and BIT code sequences The NALDA offers the records of maintenance activities that include the information, like • • • • • • •
Recorded malfunction code Actions used to repair the system Weapons replaceable assembly (WRA) part number Job control number WRA serial number Action date of repair Completion date of repair
The Boeing repository contains the data of F/A-18 flights from 2014 till the present. However, the fields of interest in flight records of aircraft are BIT logic inspection codes, BIT codes, aircraft altitude, aircraft mach speed, and the time of BIT occurrence. Each flight files approximately consumed the data of 60 GB in the warehouse space. Moreover, the flight files are parsed for isolating the related data that are required for reasoning pertaining and data mining. However, the size of weekly reports and the historical data is relatively small than the size of flight data files. Recently, the IDATS database has the capacity to store more than 2 TB of files. However, the database with this size offers sufficient space to acquire data from various aircraft platforms and performs the analysis of aircraft navy wide. Once the data are collected, it is required to clean and fuse the data to perform the mining process. However, the data are cleansed to remove the unwanted fields, incorrect and null values. However, the maintenance records that do not contain the part number, serial number, and maintenance action date will be removed prior from dataset. In addition to this, the data are filtered using the work unit code. The filtered data specify that the data are particularly used in the specific aircraft platform.
Chapter 11 • Advanced data mining for defense and security applications
231
The classification and association rules are applied to the data sources with the nominal value found in the dataset. Here, the association is the process of identifying the common attribute value between two different entities. However, the “Think Analytics” tool offers the algorithms to build the association rules from the input dataset. In addition to the input data source, two more inputs are required to model the association rules. However, the additional variables of interest used in the data sources are • Rule support • Rule confidence and • Rule lift Here, the lift support and confidence are the characteristics that are used to describe the association rule. Rule support: It is the size of target group and is represented in the form of proportion and percentage. Rule confidence: The rule predictability or rule confidence is the indicator that specifies how accurately the rule is used to predict future datasets. It is calculated using the support of target group with the support of group of observations. Rule lift: It specifies the strength of association between the groups by building the rule confidence. However, it is computed using the number of observations in union with that of the total number of observations.
11.2.2 Data mining methods used in online military training Due to the wide usage of the internet, there exist various irreversible and changes in the activity fields, where the military is the one in which the knowledge and the information are disseminated and split in real-time. However, these changes lead to the development of certain instruments that are known under the term of Learning Management System (LMS). Hence, the online education is introduced in the last decades that show the way of distance teaching and learning. In most of the cases, low cost for data processing and data storage from the academic institutions raised an important issue regarding the quantity of acquired data [1]. Some of the features that make the data analysis process more complex with the existing classical methods are listed as follows: • Quantity • Heterogeneity • Data gathering speed Hence, it is required to process the data using the automatic methods in the online learning system. By gathering the data from various fields, like artificial intelligence, and statistics, the data mining approach paves the way for knowledge extraction and discovery from educational databases. The data mining methods are successfully used in the online education system due to the opportunities provided by the methods regarding the identification of patterns and the systematic relationship between the variables of structured databases.
232
Artificial Intelligence in Data Mining
However, the data are to be validated using the detected patterns with new subsets of data in order to generate potentially useful, previously unknown, and implicit information from data. Accordingly, this information is interpreted to retrieve the knowledge in order to use it in the universities for taking managerial decisions. The development of the military phenomena and processes are tightly connected with the technological evolution. The professional army implies using the advanced instrument system as well as the training technologies. In the global environment, the security forces face weaknesses and threatening and are a great priority for training the military in order to undergo operation theater to use the online learning system in the global platform. However, this idea is presented in the international forum on technology-assisted training for security, defense, and emergency services, where certain opportunities are provided to educational field and describe the importance of using e-learning technologies to build the knowledge portal and to create the educational network. The LMS allows to model and conceives educational contents in order to meet the requirements of the military field. The LMS application used to train the military at national defense university allows the course dissemination in various formats and allows the material dissemination as training support and learning using the standardized instrument. Here, the knowledge discovery process is the one that involves the presence of database store in LMS, which holds the user data. With the prior knowledge of the educational system, the relevant dataset is considered into account for analyzing the learning process. By using data mining techniques, the patterns are extracted from the data that help to get the data in order to extract the knowledge that is subsequently fused with the knowledge base of the system. Fig. 112 represents the phase of knowledge discovery in database (KDD) process for analyzing the data in online military training. The architecture of KDD of the military system is composed of various subsystems such that these systems work together for providing quality services to the students. However, the subsystems involved in the architecture are shown as below: • • • •
Evaluation subsystem Learning administration system Database management system Interfaces characteristics subsystem
The data that are stored in the LMS platform must be effectively monitored in order to have efficient administration in the educational resources. The major intention of analyzing the data is to find the patterns or models that efficiently assign the resources based on the student profile for finding the issues that the student faced at training activity.
11.2.3 A data mining approach to enhance military demand forecasting To accurately forecast and identify the demand of supply items are the major steps in planning the military operation as it facilitates to achieve better decision making strategy. However, these demand forecasts are referred as advanced demand information (ADI)
Chapter 11 • Advanced data mining for defense and security applications
233
FIGURE 11–2 Phase of the KDD process for analyzing the data in online military training. KDD, Knowledge discovery in database.
approach in supply chain management. In the context of military application, the supply items of ADI are accurately modeled based on the rate of effort in the military platforms, like aircraft and ships. • Accurate ADI model is required to undertake a better decision making strategy at the initial stage of the military operation. • However, the planning system allows the logistics planner for specifying the force elements that include personnel and platforms to be used in the military operation, routes and locations for scheduling the activities of military tasks, like supply chains and resource allocation.
234
Artificial Intelligence in Data Mining
• Apart from the equipment and troop movements, the logistics operation consists of the ADI approach to accurately distribute and procure the supplies to the military operation. • The planning system uses the Rate of effort model (ROEM) for generating the ADI scheme. Accordingly, the ADI model that is generated by the planning system facilitates various decision support constraints, like weak or risk points, logistic sustainability, and plan feasibility. Fig. 113 represents the military demand modeling process. To model the ADI approach for stock items in the military operation results in enormous effort, like time-consuming, high costs, and data-intensive due to the dimension of military inventory. To generate the automatic ROEM is complex, as the historical data do not contain the platform details. Hence, it is required to design the ADI model with an increase in accuracy for modeling in the military process. The ADI-based autocorrelation prediction is used to increase the accuracy of stock. It builds the basic premise for specific supply items that are crosscorrelated with the context of the military operation. However, the correlation in demands arises range of factors, like a complementary relationship, dependence, part of relationship, and operational circumstances [3].
11.2.4 Architecture of knowledge discovery engine for military commanders using massive runs of simulations The two applications involved in the data mining approaches to analyze Albert simulation data are • Rule discovery algorithm • Bayesian network
11.2.4.1 Rule discovery algorithm It generates the rules from the information and uses the rules for creating the descriptive model. The rule is the conjunction of conditions such that each rule represents a single
FIGURE 11–3 Military demand modeling process.
Chapter 11 • Advanced data mining for defense and security applications
235
attribute. However, each condition in the rule is the integer or the real number that ranges from the upper bound to lower bound values. The procedure is repeated for each class in order to generate the interesting rules. However, it starts from the general rule and transforms the entire space instance and performs the depth-first searching process. Moreover, this searching process is continued until the recall measure of the specialized rules is lower than the specified recall value. The final result of search is the tree of rules. Here, the root forms the general rule while the leaves form the rules in such a way that the recall is lower than the specified recall value. Most of the interesting rules are stored and ranked in the tree. The rule is represented using the specialization operator, such that the specialization operator either maximizes the lower bound or reduces the upper bound with respect to the condition. However, the rule may contain various specialization operators. For each condition, there exist several ways to reduce the upper bound and to maximize the lower bound, respectively. Hence, it is not required to conduct an exhaustive search in the rule discovery process. Here, the number of specialization operators required in the rules is bounded using the parameter called Maximum number of Specialization operators. Accordingly, the specialization operators in the rule discovery process are evaluated using the F-measure values [4]. The rule discovery algorithm uses one parameter called “generality,” which is used to control the generality of rules in the algorithm. However, this parameter takes the value ranges from 0 to infinity such that this value is used as the P-value in the F-score formula. Hence, the value less than “1” are biased to more specific rules, whereas the values larger than “1” biased more general rules. However, the default value of this parameter is set as “1,” respectively.
11.2.4.2 Bayesian network The Bayesian network provides a quantitative belief approach to achieve a decision support strategy. The key role of modeling the Bayesian network is to concentrate on various audiences rather than considering the expansion of characteristic rules. Here, the metamodels are created using the Albert simulation for modeling decision support strategy. The network is then modeled in the form of a decision context. However, the results are based on the information of simulated data. In the simulation of Albert context, the dataset is fully complete and the structure of the network is well known. However, the structure specifies the relationship between the data variables that are specified using the modeler. The dataset of Albert context is complete, as it takes a number of simulations to run the data and to generate a complete set of results. Here, the domain knowledge is used for postulating the structure that specifies the input variable as the evidence nodes. However, the arcs between the input and output nodes are represented using the probability distribution with respect to evidence nodes and other output nodes. In general, the conditional probabilities are ascertained using the tools within Netica. The metamodel offers a quantitative summary to make the decision of interest using the simulation runs. Moreover, the metamodel provides an opportunity to the analyst for working the data toward backward and postulates to generate the output.
236
Artificial Intelligence in Data Mining
11.2.5 Adaptive immune genetic algorithm for weapon system optimization in a military big data environment The military applications are generating a large volume of data because of the use of various types of sensors in the battlefield. The big data analytic approach is actively used in various fields in the recent years and is attracted more in the department of the national defense system for analyzing the military data more efficiently. For instance, the department of defense in the United States takes a number of big data programs, like XDATA in order to increase the defense capabilities. However, the military course of activities is prepared based on the intelligence groundwork that collects a huge volume of information from the surrounding battlefield environment. A major challenge faced by the military commanders is the sequence of increasing sensors usage, like tactical sensors, global positioning system for men and vehicle, and nano network sensors. Data scientists who served for armed forces are required to develop the big data software tools for exploiting the useful data to perform mission planning. Most of the important information that is urgently required by the military commanders includes: • • • •
Location of enemy targets Accurate position of his own units Location of supporting units Other miscellaneous assets
With the useful information, the military commanders conduct the operations more effectively using the available resources and firepower more optimally. The weapon system of portfolio optimization is investigated using the valuable information that is extracted from sensor data. The key role of the portfolio system is to find the suitable assignment of different weapon units that increases the damages of hostile targets by satisfying the set of constraints. Most of the researchers used the branch and bound model and implicit enumeration model for solving the portfolio problems. The military applications consider the availability of sensor data from various sources, like unmanned aerial vehicles and reconnaissance equipment. To maximize the efficiency of military mission, it requires the knowledge discovery and data processing mechanism. With the valuable knowledge, the military commanders have the facility to make combinatorial applications of various weapon systems in battlefield. However, the portfolio problem of the weapon system is considerably called as weapon target assignment problem. The weapon system problem is classified as a constrained combinatorial problem. However, the conventional enumeration methods are not feasible to solve the portfolio problem [5]. The heuristic methods used to solve the portfolio problems include: • • • • •
Simulated annealing Neural network Genetic algorithm (GA) Auction algorithm Ant colony optimization
Chapter 11 • Advanced data mining for defense and security applications
237
The heuristic methods have more benefits than the conventional enumeration methods regarding solution quality and computation time. GA is the stochastic global search algorithm that mimics the behavior of evolution and selection. GA is successfully applied to solve various problems due to its fast convergence rate. To solve the issues of slower improvement in GA, the artificial immune system (AIS) is integrated with GA to reduce the premature convergence. AIS simulate the behavior of antibody diversity in the immune system. It uses the concentration selection mechanism to achieve inhibition and promotion of antibodies in order to maintain the diversity of population. The adaptive parameter is integrated with the GA for increasing the performance of GA and hence this approach is called as hybrid approach that is exponentially used to solve the high dimension and complex search space problem.
11.2.6 Application of data mining analysis to assess military ground vehicle components A massive volume of information is available in financial, maintenance, and logistics databases. To select the relevant data sources and to analyze, query, and to transform the raw data to the actionable business intelligence poses a major challenge while accessing the components of army vehicles. With the contract of Tank-automotive and Armaments Command (TACOM) and integrated logistics support center and Tank automotive, the camber corporation studied the effects of operations on the TACOM managed system. The project named “Delayed desert damage and degradation” (4D) focused on the impacts or effects of the extremely harsh environment. The increasing usage of various vehicles results in the effect of vehicle armor protection in the vehicle structure. The camber 4D team developed the data mining and data analysis process for the following reasons: • Rapidly collect the data from various sources • Display the information in the usable format • Interpret the data more effectively An additional process used in analyzing the potential issues in 4D is normalization of the data that consider the usage levels and fleet size based on the demands. The normalization factor allows the use to isolate the environmental issues that are relevant to the component demands and eliminate the consideration of high-demand components. The usage of source of field information is the critical problem in the data analysis and data mining process. It is required to validate and investigate the issues relevant to the equipment performance and condition. It is observed that initial data analysis and mining approach is used to find the accurate and correct candidate items. However, finding the reliable candidate item require further investigation in order to make particular determinations regarding the equipment and performance issues. Various sources of fields and the user feedback are required for accessing the ground vehicle components [6].
238
Artificial Intelligence in Data Mining
11.2.6.1 Data mining and analysis process Here are the major steps involved in the process of data analysis in military ground vehicle components. 1. The pivot table is applied to the raw dataset that offers the functional groupings of cost and demand data based on the nomenclature, location, fiscal years, and national item identification number. 2. For cost-associated data, nearly $50 million price filter will be applied. However, it effectively removes various high demand and small parts, like fittings, filters, fuses, fastness, and equivalent items that cost more valuable analysis. It is desirable to use all parts in the cost-associated data with unit price. In most of the cases, low price and high volume parts, like bearings and seals are used to represent the total cost. 3. An exclusion list is used to remove some kind of categories in the vehicle parts. This is the extensive listing of accessories and common parts, like body panels, armor, antennas, cables, bulk items, wiring, gun mounts, camouflage nets, and seats. 4. The basic vehicle factors are used to compute the cost and demand. Later, the cost factor can be adjusted based on the usage of vehicle density. 5. The initial phase of data analysis includes report generated that shows the ranked parts list with their demand and cost histories.
11.2.7 Data mining model in proactive defense of cyber threats With the rapid technologies and operational practices, the requirements to drive both the public and the private sector enterprises increase toward highly technological and interconnected convergent information networks. However, the stove-piped databases and the data processing solutions provide an effective way to the integrated system and unified system and it enables to increase the potential impact of well-planned data theft, network intrusion, and denial of service attack. Therefore the government and the commercial organization develop the network defenses in order to rapidly respond to the new tactics and the attack strategies. To recognize these challenges and trends, various cyber security practitioners and the researchers focus their efforts to develop the proactive techniques of cyber defense such that future attack models are anticipated and are incorporated to design the defense system. Despite this attention, more works are concentrated to place the objective function of proactive defense in the quantitative and rigorous foundation. However, the fundamental problems associated with the predictability and dynamics of coevolutionary arms race between the defenders and attackers. Most of the recent works demonstrated that the previous defender and attacker action responses offer predictive information regarding the future behavior of attacker such that the measurable do not have enough predictive power to exploit the powerful predictions [7]. If the prediction and predictability issues are resolved, it is still an open issue on how to integrate the predictive analysis into the design of a cyber defense system. Here, the protection of enterprise-scale networks against intrusion and some other attacks are explicit
Chapter 11 • Advanced data mining for defense and security applications
239
leverages using the coevolutionary relationship between defenders and attackers in order to develop new methods for proactive network defense. Each model formulates the process as one of the behavior classification, where the malicious and the innocent network activities are distinguished such that each model considers limited prior information regarding the attack. Here, the data are modeled as the bipartite graph of instances in the network activities such that each attribute and feature effectively characterizes the instances. However, the bipartite graph is effectively used to define the machine learning model, which accurately classifies the given instances as either malicious or innocent based on the behavioral features. The bipartite graph enables the data that concern the previous attacks and transform it to be used against the novel attacks. It is noticed that previous attacks are observed from the distribution of instances that is not identically associated with the malicious behaviors. However, the transfer learning model offers an effective and simple way to extrapolate the behavior of attackers. The attacker defender coevolution is the hybrid dynamical system (HDS) used to defend against network attacks. The HDS continuous system generates the attack instances that correspond to the active mode of attack. The HDS takes the mode of attack as input and generates the synthetic data with the mode of malicious activity. Accordingly, the synthetic data are fused with the observed attacks in order to train the learning-based classifier to achieve an effective result against attacks.
11.2.7.1 Synthetic attack generation Here, the malicious and the normal network activity are distinguished and demonstrated the effectiveness of HDS model. The defender or the attacker coevolution makes some precious activity regarding the future behavior by generating the predicted attack data using the synthetic data in order to train the classifier. The stochastic HDS is modeled to defend the interaction between defender and attacker in the network activities. The schematic diagram of the S-HDS is shown in Fig. 114.
FIGURE 11–4 Schematic diagram of the stochastic-HDS feedback structure.
240
Artificial Intelligence in Data Mining
The S-HDS model is the feedback interconnection of stochastic process, like Markov chain. By integrating the continuous and the discrete dynamics the computationally tractable model offers an expensive and scalable modeling environment. In general, the S-HDS model is used to represent the dynamical process that evolves a broad range of time scales in the recent application.
11.2.8 Modeling adaptive defense against cybercrimes with real-time data mining World Wide Web and internet have a surprising quality that brings the people closer to the world. In the past decades, the scale of network usage is increased. However, the fact is that the physical world brings bigger metropolitan population and crime rates in the cyber world. Therefore the increasing usage of the internet in the crime rate increases. However, the growth of cybercrimes with respect to velocity, volume, and variety are dealt by the existing methods in the information security [8]. Some of the common types of cybercrimes are specified as follows: • • • • • • •
Denial of service attack Online identity theft Electronic financial frauds Email and spamming bombing Phishing Hacking Cyber stalking
11.2.8.1 Information security technologies Various technologies are available today to safeguard the data against threats. Some of the technologies include • • • • • • • • •
Intrusion detection system Firewalls Encryption system Intrusion prevention system Content filters Access control system Once time password generators Antivirus Antispam
11.2.8.2 Real-time data mining Data mining is the science that uses various disciplines, such as mathematics, data management, machine learning, algorithms, analytics, and artificial intelligence for identifying the hidden patterns with a large dataset. Most of the operational systems are based on the pattern finding using historical data. With the real-time data mining model, the hidden patterns
Chapter 11 • Advanced data mining for defense and security applications
241
can be revealed in real-time. However, this development involves different applications, such as feedback system, real-time intelligent control, and decision support system. Here are some of the components of real-time data mining engine: • • • • • • • •
Communication server Cybercrime pattern recognizer Events monitor Data warehouse and data stores Real-time extraction, transformation and loading Management server Core switch Threats prevention and response algorithm generator
11.2.8.3 Summary This chapter comprises the data mining techniques and the security applications of the data mining model. Various types of data, such as airborne data, military data, radar data, and weapon data are described with respect to their analysis and properties. The data protection strategies are briefly discussed in this chapter. Moreover, the data mining methods used in the military applications are elaborated.
References [1] Susnea E. Data mining techniques used in on-line military training. In: 7th international scientific conference eLearning and software for education, April 2011. pp. 15. [2] Meseroll RJ, Kirkos CJ, Shannon RA. Data mining navy flight and maintenance data to affect repair. In: IEEE Autotestcon, September 2007. pp. 476481. [3] Thiagarajan R, Rahman M, Gossink D, Calbert G. A data mining approach to improve military demand forecasting. J Artif Intell Soft Comput Res 2014;4(3):20514. [4] Barry P, Zhang J, McDonald M. Architecting a knowledge discovery engine for military commanders utilizing massive runs of simulations. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, August 2003, pp. 699704. [5] Yang S, Yang M, Wang S, Huang K. Adaptive immune genetic algorithm for weapon system portfolio optimization in military big data environment. Clust Comput 2016;19(3):135972. [6] Ortland RJ, Bissonnette LA, Miller DR. Application of data mining and analysis to assess military ground vehicle components. In: Ground vehicle systems engineering technology symposium, 2010. pp. 118. [7] Colbaugh R, Glass K. Proactive defense for evolving cyber threats. IEEE international conference on intelligence and security informatics, July 2011. pp. 125130. [8] Bhatti BM, Sami N, Building adaptive defense against cybercrimes using real-time data mining. IEEE first international conference on anti-cybercrime (ICACC), November 2015. pp. 15.
This page intentionally left blank
Index Note: Page numbers followed by “f” and “t” refer to figures and tables, respectively. A Absolute growth rate (AGR), 213 Activation functions, 9293, 93f, 97 Adenine (A), 161 ADI. See Advanced demand information (ADI) Advanced demand information (ADI) military demand forecasting, 232234, 234f Advanced Very High-Resolution Radiometer (AVHRR), 178, 179t Adverse effects of drugs, 153 Agglomerative methods, 57 Agglomerative techniques, descriptive instances of, 58 AGR. See Absolute growth rate (AGR) Agriculture data irrigation management, 204213 crop yield forecasting, 207209 field capacity (Cc), 205206 fuzzy neural network for, 206207 growth rate analysis, 212213 growth rate measurement, 212 minimization of irrigation cost, 210 N measurement, 209 permanent wilting point (Pm), 206 prediction of irrigation events, 209210 soil density, 206 types of soil, 210211 water body prediction for crop filed, 213 management, 201204 analysis and decision-making, 202 data capture, 202 data marketing, 204 data storage, 203 data transfer, 203 data transformation, 204 intervention, 202, 203f sensing and monitoring, 202 properties and analysis, 199213
representation, 200201 human sourced (HS), 201 machine generated (MG), 201 process-mediated (PM), 200201 AgroDSS, 220 AI. See Artificial intelligence (AI) Airborne data, 224 AIS, 29t. See also Artificial immune system (AIS) Albert simulation data, 234235 Bayesian network, 235 rule discovery algorithm, 234235 AlexNet, 216217 AlexNetOWTBn, 217 Alluvial soils, 211 AME. See Automated maintenance environment (AME) American Health ways, 159 ANN. See Artificial neural networks (ANN) Ant colony optimization (ACO) algorithm, 3536, 73, 75f Applications, data mining, 1215 Approximate reasoning, 135 Apriori algorithm, 28, 29t for frequent disease, 169170 Apriori ARM (association rule mining) algorithm, 25, 3334 AprioriTiD, 29t ArcMap, 182183 Army vehicles components of, 237238 Artificial immune system (AIS), 237 Artificial intelligence (AI), 133138 applications, 133, 139 artificial neural networks (ANN), 134, 134f and data mining, modeling theory based on, 139154 cemented paste backfill (CPB) methods, 152153
243
244
Index
Artificial intelligence (AI) (Continued) engine idle speed system and control optimization, 146147 environmental systems, 140143 flood defense structures, 148149 human teaching tactics, 146 intelligent manufacturing system, 149152 modeling sediment transport, 147148 offset lithographic printing process, 144146 pharmacovigilance, 153154 solar radiation data, 139 strategies for tutoring systems, 146 water quality modeling, 143144 versus data mining, 138139 expert system (ES), 136137 fuzzy logic (FL), 135 genetic algorithm (GA), 136 hybrid systems (HS), 137138 Artificial neural network (ANN), 18, 37, 109, 125126, 134, 134f advantages, 126 applications, 126 for wheat yield forecasting, 209 Artificial neurons, 17, 3839 ASR. See Automatic speech recognition system (ASR) Associate rule mining, 2530 based on optimization method, 3133 different kinds of association rules, 2830 using genetic algorithm, 3334 Association mining methods, 163 Association rule-based algorithm, 170 Association rule mining (ARM), 2526, 27f, 29t, 31 using particle swarm optimization, 34 Association rules, 157, 164165 Australian Bureau of Statistics, 168 Automated maintenance environment (AME), 230 Automatic speech recognition system (ASR), 224 AVHRR. See Advanced Very High-Resolution Radiometer (AVHRR) Avionics diagnostics, 230 B Backpropagation algorithm, 128130, 129f, 168169
Back propagation NN, 11 Backpropagation through the structure (BTS), 105 Backward chaining, 142 Band interleaved by line (BIL), 186187, 187f Band Interleaved by Pixel (BIP), 186, 187f Band Sequential, 187, 188f Basic Local Alignment Search Tool, 162 Bayesian belief networks (BBNs), 12 Bayesian decision theory, 111 Bayesian networks (BNs) model, 166167, 235 for psychiatric diseases, 170 Bees swarm optimization (BSO) algorithm, 35 Belief propagation (BP) decoding, 9091 Big data mining for remote-sensing data, 193 for social welfare application, 193197 CBSIR, 196 cloud detection in satellite images, 196197 forcasting heavy rainfall using MCS, 193194 forest fire detection system, 195196 forest wildfire prediction, 194 IDBSCAN, 195 land cover change detection, 195 Big data programs genetic algorithm for weapon system, 236237 BIL. See Band interleaved by line (BIL) Bioinformatics, 161 Bioinspired algorithms in partitional clustering, 86 Biological control, 218219 Biological networks, 165 Biological neurons, 109 Biomedical data mining, 160 BIP. See Band Interleaved by Pixel (BIP) BIT. See Built-in-test (BIT) Black soils, 211 Block-level maps, 184 BNs. See Bayesian networks (BNs) Boeing F/A-18 system, 230 Boolean association rule, 30 Boolean networks, 166, 167f Bootstrap techniques, 163 Boundary element model, 143 Breadthfirst search (level-wise search), 169 Built-in-test (BIT), 229230 Business layer, 150
Index
C Capabilities layer, 149 CAP3 algorithms, 164 Case-based reasoning (CBR), 140141 process of, 141f Cat swarm optimization (CSO), 83 CBR. See Case-based reasoning (CBR) CBSIR. See Content-based satellite image retrieval (CBSIR) Cemented paste backfill (CPB) methods constitutive modeling of, 152153 Central Dogma, 165 Centroid clustering, 59, 60t CGR. See Crop growth rate (CGR) Chaotic bacterial foraging optimization (CBFO), 171 ChEMBL, 173 Chemical pest control, 218219 Cheminformatics methods, 173 Classification and regression tree (CART) analysis, 89, 6263 Classification technique, 9, 157 Cloud-based platform, 203 Cloud detection in satellite images, 196197 Clustal W, 162 Clustering, 910, 157 gene expression patterns, 166 methods, 43f, 162 Clustering-based heuristic methodologies, 6886 Cn3D, 163 CNN. See Convolutional Neural Network (CNN) Color Structure Descriptor, 216 Components, of data mining model, 56 Computer-assisted surveillance, 158159 Consed/PHRAP algorithms, 164 Content-based satellite image retrieval (CBSIR), 196 Content layer, 143 Context-Based Association Rules, 30 Contextual information, 121 Continuous ARM, 29t Contrast set learning, 30 Control maps, 147
245
Convolutional network, 93 Convolutional Neural Network (CNN), 3738, 119121, 120f, 215216 application, 121 crops lines algorithm, 219 flowchart of, 119f models, 217 Corporate surveillance, 15 Correlation rule, 30 CPB methods. See Cemented paste backfill (CPB) methods Criminal investigation, 15 Crop disease prediction, 214215 Crop growth rate (CGR), 213 Crop yield forecasting, 207209 artificial neural networks (ANNs), 209 decision trees (DT) method, 208209 water body prediction for, 213 Crossover operator, 136 Cuckoo search algorithm (CSA), 83 Cumulative sum (CUSUM) method, 195 Customer relationship management (CRM), 14 Customer segmentation, 15 CUSUM method. See Cumulative sum (CUSUM) method Cybercrimes types of, 240241 information security technologies, 240 real-time data mining, 240241 Cyber defense proactive techniques of, 238240 Cytosine (C), 161 D DALI, 163 Data anonymization, 225226 DataBase File, 184 Database management systems, 156 Databases adapted for data mining, 2 Data capture, 202, 203f Data classification, 8788 Data clustering, 4155 definition, 41 fundamentals of cluster analysis, 42 heuristic methods for
246
Index
Data clustering (Continued) clustering-based heuristic methodologies, 6886 and formulation of exact solutions, 6668 mode seeking and mixture-resolving algorithms, 5564 Gaussian mixture models, 5556 hierarchical clustering, 5661 hierarchical divisive algorithms, 6164 needs of unsupervised learning, 4243 partitional clustering, 4355 Data distribution model, 228 Data-driven approach, 148 Data-driven data mining, 2425 Data-driven methods, 148 Data encryption, 228229 Data integration, 34, 223224, 230 Data management of agriculture data, 201204 Data marketing, 203f, 204 Data mining approach artificial intelligence (AI) versus, 138139 for modeling sediment transport, 147148 Data mining engine, 6 Data mining methods, 188193 agriculture data, 199213 irrigation management, 204213 management, 201204 representation, 200201 diagram of privacy, 228, 228f disease prediction, 214217 crop disease, 214215 leaf disease, 216 plant disease, 216217 rice plant disease, 215 military application, 229241 advanced demand information (ADI), 232234 Albert simulation data, 234235 components of army vehicles, 237238 genetic algorithm (GA) for weapon system, 236237 naval aircraft maintenance system, 229231 online military training, 231232
proactive techniques of cyber defense, 238240 types of cybercrimes, 240241 military data, 223229 data protection strategies, 224229 data source, 223224 pests monitoring, 217220 chemical pest control, 218219 NNs algorithm, 219 RF algorithm, 220 spatial data mining, 188189, 190f spatiotemporal data mining, 192193, 192f temporal data mining, 190191 Data mining techniques, 12 advantages of, 155156 for biological application, 160167 computational modeling of biological networks, 165167 DNA sequence analysis, 161 gene association analysis (GAA), 163 gene expression analysis, 162163 genome, 160 genome analysis, 164 macromolecule structure analysis, 163 metabolic pathways, 160 microarray analysis, 164165 pathway analysis, 164 protein sequence analysis, 161162 proteome, 160 regulatory pathways, 161 definition of, 155 descriptive tasks, 157 deviation detection approaches, 155156 for disease diagnosis, 167171 Apriori algorithm, 169170 Bayesian network modeling for psychiatric diseases, 170 fuzzy k-nearest neighbor approach for Parkinson's disease diagnosis, 170171 neural network for heart disease diagnosis, 168169 disproportionality assessment approaches, 155156 of drug discovery, 171176, 171f hit identification, 173174
Index
hit to lead, 174175 late-stage drug discovery and clinical trials, 176 lead optimization, 175 target identification, 171173 target validation, 173174 for healthcare, 156160, 158f applications, 157160 healthcare information system, 155156 Intelligent Data Analysis and, 156 link analysis approaches, 155156 mapping of, 157 predictive task, 157 revolution of, 156 Data modeling tools, 156 Data obfuscation, 225 Data privacy protection, 226228 Data processing, 113 Data protection, 224229 data anonymization, 225226 data encryption, 228229 data obfuscation, 225 data privacy protection, 226228 Data representation of agriculture data, 200201 of RS data, 181187 raster data type, 185187 vector data type, 182185 Data resolution types of, 180181 radiometric resolution, 180181 spatial resolution, 180 spectral resolution, 180 temporal resolution, 181 Data science, 9192 Data sources, 109 Data storage, 203, 203f Data transfer, 203, 203f Data transformation, 203f, 204 Data warehouse server, 6 Deception detection, 14 Decision-making processes, 111 Decision Support System (DSS), 218 Decision tree (DT), 1112, 153, 168 method, 208209
247
Deep-ANN, 37 Deep autoencoder, 107 Deep belief network (DBN), 100101, 102f Deep Boltzmann machine (DBM), 100101, 103 Deep convolutional neural network (deep CNN), 17, 9598, 96f, 97f Deep generative adversarial network (deep GAN), 100103 Deep learning in data mining, 3739 convolutional neural network (CNN), 38 recurrent neural networks (RNNs), 3839 Deep learning methods, data classification, 16, 90108 background and evolution, 8990 deep autoencoder, 107 deep convolutional neural network (deep CNN), 9598, 96f, 97f deep generative adversarial network (deep GAN), 100103 deep long-short-term memory (deep LSTM), 105 deep neural network (deep NN), 9395, 94f, 102103 deep recurrent neural network (deep RNN), 98100, 99f deep recursive neural network (deep RvNN), 105 deep reinforcement learning (DRL), 103105, 104f fully connected neural network (FCNN), 9293, 92f hierarchical deep learning for text (HDLTex), 106, 106f random multimodel deep learning (RDML), 108 steps involved in data mining process, 89 Delaunay graph (DG), 55 “Delayed desert damage and degradation” (4D), 237 DEMs. See Digital Elevation Models (DEMs) Density-based techniques, 56 Deoxyribonucleic acid (DNA) data, 188 Design process for mining data, 25, 3f DetectNet, 219 Determinate growth, 212 Digital Elevation Models (DEMs), 185186 Digital Line Graph (DLG), 184
248
Index
Digital Raster Graphics (DRGs), 185186 Direct Internet Message Encapsulation, 184 Disadvantages of data mining, 78 Disease prediction, 214217 crop disease, 214215 leaf disease, 216 plant disease, 216217 rice plant disease, 215 Distribution-free method, 43 DLG. See Digital Line Graph (DLG) DNA data. See Deoxyribonucleic acid (DNA) data DNA sequence analysis, 161 Domain-driven data mining, 25 Double gray bar, 144 DRGs. See Digital Raster Graphics (DRGs) DrugBank, 173 Drug discovery, 171f, 173 data mining techniques of, 171176 DSS. See Decision Support System (DSS) DT. See Decision tree (DT) Dynamic database, 136137 E Earth Resources Technology Satellites (ERTS-1), 178179 EcoCyc/MetaCyc, 164 Edman degradation chemistry, 161162 Educational data mining (EDM), 13 Electronic control unit (ECU), 146 Electronic laboratory notebook (ELN), 175 Electronic Medical Record, 177178 ELN. See Electronic laboratory notebook (ELN) Engine idle speed system and control optimization, AI for, 146147 Environmental degradation, 199200 Environmental parameters, 207208 Environmental systems, modeling of artificial intelligence in, 140143 Environment friendly methods of pest control, 218219 Enzyme, catalytic activity of, 165 Erl Wood Cheminformatics node repository, 175 ERTS-1. See Earth Resources Technology Satellites (ERTS-1) ES. See Expert system (ES)
ESRI shapefile vector file format, 182184, 183f ET. See Evapotranspiration (ET) Euclidean distance, 144 Eulerian-Lagrangian model, 143 European public health Alliance, 168 Evapotranspiration (ET), 207 Evolutionary computations, 18 Evolutionary programming, 12 Expectations of data mining, 19 Expert system (ES), 18, 136137 components of, 137 control mechanism, 137 knowledge base, 137 Exploratory data analysis (EDA), 10 Expressed cDNA sequence tag sequencing techniques, 162 Extended ARM, 29t Extended Kalman filter (EKF), 125 eXtensibleMarkup Language (XML), 173 F FDA. See Food and drugs administration (FDA) Federated agent, 143 Feedback loop, 17 Feedforward neural network (feedforward NN), 114115 applications, 115 multilayered feedforward neural network, 115 Feedforward NN classifier, 114, 115f, 116f Field capacity (Cc), 205206 Financial banking, 15 Finite difference method, 143 Finite element method, 148 Finite element model, 143 Finite volume method, 148 Firefly algorithm (FA), 84 FITRA, 207 FKNN approach. See Fuzzy k-nearest neighbor (FKNN) approach FL. See Fuzzy logic (FL) Food and drugs administration (FDA) empirical Bayes method for, 153 Forest fire detection system on spatial data mining, 195196
Index
Forest wildfire prediction spatial reinforcement learning for, 194 Forward chaining, 142 FP (frequent pattern)-growth algorithm, 28 FP-tree, 29t Fraud detection, 14 Fraud reduction, 160 FS. See Fuzzy system (FS) F-score formula, 235 Fully connected neural network (FCNN), 9293, 92f Fuzzy c-means (FCM), 4849 Fuzzy k-nearest neighbor (FKNN) approach for PD diagnosis, 170171 Fuzzy logic (FL), 18, 135, 195196 predictions of pesticides, 220 Fuzzy neural network (FNN), 126 irrigation management for, 206207 Fuzzy sets, 12 Fuzzy sets theory, 135, 206207 Fuzzy system (FS), 18, 135 components of, 135f G GA. See Genetic algorithm (GA) GAA. See Gene association analysis (GAA) Gated recurrent unit (GRU), 100 Gaussian mixture models, 5556 EM for, 56 Gaussian Naive Bayes, 220 Gene association analysis (GAA), 163 Gene expression, 85 Gene expression analysis, 162163 GeneMark software tools, 164 Generalized disjunctive association rule (d-rules) mining, 29t Generative adversarial network (GAN), 100101, 103 Generative model-based representations, 191 GeneSpring, 164165 Genetic algorithm (GA), 130, 136, 169 life cycle of population, 136f for weapon system, 236237 Genetic algorithm, associate rule mining using, 3334
249
Genetic network, 165 Genetic operators, 136 GenMAPP pathway building tool, 164 Genome, 160 analysis, 164 Geographical Information System (GIS) RS data, 177 structure of, 181182 Geostationary Positioning System (GPS), 201 GIS. See Geographical Information System (GIS) Glimmer software tools, 164 Global Positioning System (GPS), 192193 Global representation, 191 Gomory-Hu algorithm, 5354 GoogleNet, 217, 219 GPS. See Geostationary Positioning System (GPS); Global Positioning System (GPS) Gradient descent optimization model, 109110 Graphical user interface (GUI), 6 Graph-theoretic clustering, 5253 Gravitational Search Algorithm (GSA), 8485 Growth index, 213 Growth rate analysis, 212213 absolute growth rate (AGR), 213 crop growth rate (CGR), 213 growth index, 213 relative growth rate (RGR), 213 Growth rate measurement, 212 determinate growth, 212 indeterminate growth, 212 Guanine (G), 161 Guided local search (GLS), 7173 H Hadoop Distributed File System, 203 Harmony search algorithm, 85 HDS. See Hybrid dynamical system (HDS) Healthcare, data mining for, 156160, 158f Healthcare information system, 155156 Health data analysis, 156157 The Health Insurance Portability and Accountability Act, 225 Health system, enhancing, 13 Heart disease diagnosis neural network for, 168169
250
Index
Heavy rainfall forecasting using MCS, 193194 Heuristic approximation methods, 166167 Heuristic method, 6586 clustering-based heuristic methodologies, 6886 constructive strategy, 68 decomposition strategy, 68 dividing strategy, 67 and formulation of exact solutions, 6668 inductive strategy, 67 local search strategy, 68 reduction strategy, 67 Hidden Markov Model (HMM), 162, 191, 216217 Hierarchical clustering, 5661, 162 Hierarchical deep learning for text (HDLTex), 106, 106f Hierarchical divisive algorithms, 6164 High-content screening (HCS), 173174 High dimensionality, 8 High-order pattern discovery, 30 High-risk patients, identification of, 159 High-throughput screening exploration environment (HiTSEE) chemistry, 173175 Hill climbing, 68 HIS. See Hospital Information System (HIS) Hit identification, 173174 HMM. See Hidden Markov Model (HMM) HMMER, 162 Hospital infection control, 158159 Hospital Information System (HIS), 169 Hospital management, 13 Hospital resource management, 160 HS. See Human sourced (HS); Hybrid system (HS) Human genome, 161 Human interaction, 7 Human sourced (HS), 201 Hybrid dynamical system (HDS) attacker defender coevolution, 239240, 239f Hybridization, 74 Hybrid metaheuristic, 74 Hybrid system (HS), 137138 Hybrid technique, 74
I IDAMAP2, 156 IDATS. See Integrated diagnostics and Automated Test System (IDATS) IDBSCAN. See Improved Density Based Spatial Clustering of Application of Noise (IDBSCAN) IFOV. See Instantaneous field of view (IFOV) If-then principle, 135 IGV browser, 172 Improved Density Based Spatial Clustering of Application of Noise (IDBSCAN), 195 Indeterminate growth, 212 Indigo packages, 175 Industrialized engineering, 14 Inference Problem, 31 Information processing techniques, 9192 Information security technologies, 240 Information Technology (IT) model, 21, 200201 Inking system, 145f Instantaneous field of view (IFOV), 180 Insurance abuse, 160 Integrated diagnostics and Automated Test System (IDATS), 229230 Integrated pest management (IPM) program, 217219 Intelligence layer, 150 Intelligence methods for data mining task, 2122 ant colony optimization (ACO) algorithm, 3536 associate rule mining, 2530 different kinds of association rules, 2830 associate rule mining using genetic algorithm, 3334 association rule mining, 31 association rule mining using particle swarm optimization, 34 bees swarm optimization (BSO) algorithm, 35 deep learning in data mining, 3739 convolutional neural network (CNN), 38 recurrent neural networks (RNNs), 3839 intelligent methods for associate rule mining, 3133 associate rule mining based on the optimization method, 3133, 32f
Index
penguins search optimization algorithm for association rules mining Pe-ARM, 36 procedure for intelligent data mining, 2225, 23f data-driven data mining, 2425 interest-driven data mining, 2324 Intelligent Data Analysis, 156 Intelligent data mining, procedure for, 2225, 23f data-driven data mining, 2425 interest-driven data mining, 2324 Intelligent machines, 138 Intelligent manufacturing platform technology, 151 Intelligent manufacturing system artificial intelligence (AI) in, 149152 technology, 151f Intelligent methods for associate rule mining, 3133 associate rule mining based on optimization method, 3133, 32f Intelligent support function, 150 Intelligent techniques of data mining, 1519 Interest-driven data mining, 2324 Internet of Things (IoT), 201 Intrusion detection, 14 Invasive weed optimization algorithm (IWO), 84 IoT. See Internet of Things (IoT) IPM program. See Integrated pest management (IPM) program Irrigation management agriculture data for, 204213 crop yield forecasting, 207209 field capacity (Cc), 205206 fuzzy neural network for, 206207 growth rate analysis, 212213 growth rate measurement, 212 minimization of irrigation cost, 210 N measurement, 209 permanent wilting point (Pm), 206 prediction of irrigation events, 209210 soil density, 206 types of soil, 210211 water body prediction for crop filed, 213 ISODATA algorithm, 44 IT. See Information Technology (IT)
251
K K-anonymity, 225, 227 KBES. See Knowledge-based expert system (KBES) KDD. See Knowledge discovery in database (KDD) KEGG, 173 KEGG database, 164 Kernel density, 168 k-means, 164165 algorithm, 44, 45f, 169 clustering, algorithmic steps for, 44 k-medoids clustering, 48 k-modes techniques, 48 K-nearest neighbors (KNN), 2122, 209 KNime4Bio project, 172 KNIME UGM 2012, 173 KNN. See K-nearest neighbors (KNN) Knowledge base, 6 Knowledge-based expert system (KBES), 136137 architecture of, 137f Knowledge-based retrieval approaches, 141 Knowledge discovery, 223 engine for military commanders, 234235 process, 232 Knowledge Discovery Assistant (KDA), 22 Knowledge Discovery in Databases (KDDs), 1, 24, 31, 232, 233f Knowledge discovery process, 138 Knowledge Query and Manipulation Language, 142 Konstanz Information Miner (KNIME), 176 clustering algorithms, 172173 extensions, 172 framework, 175 integration, 171 nodes, 173175 UGM 2012, 173 K-optimal pattern discovery, 30 L Land cover change detection, 195 Landsat-1, 177179 Landsat-4, 179 Landsat-7, 179181 Landsat Enhanced Thematic Mapper Plus, 180 Landsat Multi-Spectral Scanner, 178179, 179t
252
Index
Landsat Thematic Mapper, 179, 179t Laterite soils, 211 Late-stage drug discovery and clinical trials, 176 L-diversity, 227 Lead optimization, 175 Leaf disease prediction, 216 Learning Management System (LMS), 231232 Learning management system (LMS), 21 Levenberg-Marquardt (LM) algorithm, 130131 Line vector file format, 184 LMS. See Learning Management System (LMS) Local search versus global search, 68 Long-short-term memory (LSTM), 100, 105, 122 M Machine generated (MG), 201 Machine learning, 5253, 74, 86 Macromolecule structure analysis, 163 MAFIA algorithm, 169 Market basket analysis, 13 Markov Decision Process, 194 Markov model (MM), 220 MAS. See Multiagent systems (MAS) Massive datasets, 78 Massively parallel signature sequencing techniques, 162 Master unit, 145 Mathematical programming, 74 Maximum number of Specialization operators, 235 Maximum-Score Diversity Selection (MSDS), 174175 MCA. See Medical control agency (MCA) MCS. See Mesoscale Convective Systems (MCS) Medical control agency (MCA), 153 Medical Literature Analysis and Retrieval System Online, 156 Membership function, 135 Memetic algorithm-based data clustering, 85 Merits of data mining, 67 Mesoscale Convective Systems (MCS) forcasting heavy rainfall using, 193194 Message layer, 143 Message-passing interface (MPI), 7778
Metabolic network, 165 Metabolic pathways, 160 Mfold, 163 MG. See Machine generated (MG) Microarray, 162 analysis, 164165 Microsoft Structured Query Language, 228229 Military application, 229241 Albert simulation data, 234235 components of army vehicles, 237238 genetic algorithm (GA) for weapon system, 236237 naval aircraft maintenance system, 229231 online military training, 231232 proactive techniques of cyber defense, 238240 types of cybercrimes, 240241 Military communication signal data, 224 Military data, 223229 data protection strategies, 224229 data anonymization, 225226 data encryption, 228229 data obfuscation, 225 data privacy protection, 226228 data source, 223224 airborne data, 224 military communication signal data, 224 radar data, 223224 weapon data, 224 Mixture model-enabled techniques, 163 MM. See Markov model (MM) Model-based approach, 148149 Model trees (MTs), 147148 Moderate Resolution Imaging Spectroradiometer (MODIS), 181 Mode seeking and mixture-resolving algorithms, 5564 MODIS. See Moderate Resolution Imaging Spectroradiometer (MODIS) Modular neural network, 122125 advantages of, 124125 Monothetic divisive methods, 6162 Mountain soils, 211 Move acceleration model, 7778 MSDS. See Maximum-Score Diversity Selection (MSDS)
Index
MTs. See Model trees (MTs) Multiagent blackboard, 143 Multiagent systems (MAS), 142143 Multidimensional association rule, 30 Multilayered feedforward neural network, 115 Multilayer perceptron, 117119 Multilayer perceptron (MLP), 17, 93, 115116, 118f Multiobjective Evolutionary Algorithm (MOEA), 31 Multiobjective optimization method, 31 Multiple sequence alignment tools, 162 Multi-Relation Association Rules (MRAR), 30 Multispectral Scanning System, 178179 Mutation operator, 136 N Naive Bayes (NB), 2122 algorithm, 220 techniques, 168 NALDA. See Navan aviation logistics data analysis (NALDA) NASA. See National Aeronautics and Space Administration (NASA) National Aeronautics and Space Administration (NASA), 192193 National Cancer Institute, 192193 National Geospatial-Intelligence Agency, 192193 National Institute of Justice, 192193 National Oceanic and Atmospheric Administration, 178 Natural chemical control, 218219 Nature-inspired techniques, 82 NAVAIR. See Naval air system command (NAVAIR) Naval aircraft maintenance system data mining methods in, 229231 Naval air system command (NAVAIR), 229 Navan aviation logistics data analysis (NALDA), 230 Nearest Neighbor (NN) algorithm, 216217 pest monitoring for, 219 Nearest neighbor method, 140 Network-centric approaches, 224 Network flow theory-based clustering, 5354
253
Neural network (NN), 89, 11, 109113, 112f, 164165 artificial neural network (ANN), 125126 advantages, 126 applications, 126 background and evolution of, 110111 characteristics of, 113 convolution neural network (CNN), 119121, 119f, 120f application, 121 flowchart of, 119f for data classification, 111112 advantages, 112 applications, 112 feedforward neural network (feedforward NN), 114115 applications, 115 fuzzy neural network (FNN), 126 for heart disease diagnosis, 168169 modular neural network, 122125 advantages of modular neural networks, 124125 multilayer perceptron, 117119 probabilistic neural network (PNN), 127, 127f radial basis function neural network (RBFNN), 115117, 116f recurrent neural networks (RNN), 121122, 121f, 123f techniques, 168 training algorithms for data classification, 128131 backpropagation algorithm, 128130, 129f genetic algorithm (GA), 130 Levenberg-Marquardt (LM) algorithm, 130131 training of, 127128 working of, 111 Next-generation sequencers (NGS), 172 NGS. See Next-generation sequencers (NGS) NGS package, 172 N measurement of crop yields, 209 NN algorithm. See Nearest Neighbor (NN) algorithm Nonlinear parametric function, 147
254
Index
Nonparametric unsupervised learning, 43 Nosocomial infection, 158159 Nucleotides, 161 O Offset lithographic printing process, 144146 O-level. See Organizational level (O-level) One-way clustering methods, 162 Online military training data mining methods in, 231232 Online Transactional Processing, 156 Optimization method, 18 associate rule mining based on, 3133, 32f Optimum search method, 136 Oracle data mining (ODM), 5 Organizational level (O-level), 229 Outliers, 7 Overfeat, 217 Overfitting, 7 P Pairwise alignment tool, 162 Parallel fuzzy c-means algorithm (PFCM), 79 Parallel k-means (PKM) clustering technique, 7778, 81 Parallel metaheuristic data clustering model, 79t Parallel moves model, 76 Parallel multistart model, 7677 Parametric unsupervised learning, 4243 Pareto optimality, 31 Pareto-optimal set, 31 Parkinson's disease (PD) diagnosis, fuzzy k-nearest neighbor approach for, 170171 Particle swarm optimization (PSO), 7172, 171 association rule mining using, 34 Partitional clustering, bioinspired algorithms in, 86 Partitioning-based clustering techniques, 4344 Pathway analysis, 164 Pattern evaluation, module for, 6 Pattern identification, 231232, 240241 Penguins search optimization algorithm for association rules mining Pe-ARM, 36
Permanent wilting point (Pm), 206 Perturbations, 171172 Pests monitoring, 217220 chemical pest control, 218219 NNs algorithm, 219 RF algorithm, 220 Pharmaceutical industry, 13 Pharmacovigilance, 153154 Phenyl isothiocyanate, 161162 Physical layer, 150 Physical model, 149 Piecewise representation, 191 Plaid models, 163 Plant disease prediction, 216217 PM. Powdery mildew (PM);. See Processmediated (PM) Polythetic divisive techniques, 63 Prediction, 8 Predictive modeling technique, 159 Printing process, 145146 Privacy preservation methods, 227 Probabilistic neural network (PNN), 127, 127f Processing of data, 8 Process-mediated (PM), 200201 Process of data mining, 810 Process of mining data, 25, 3f Product life cycle manufacturing technology, 152 Proportional reporting ratio (PRR), 153 Protein network, 165 Protein-protein interactions, 165 Protein sequence analysis, 161162 Protein structure databases, 163 Proteome, 160 PRR. See Proportional reporting ratio (PRR) Psychiatric diseases, BN model for, 170 PubChem BioAssay database, 173 PubChem Substance database, 173 P-value, 235 Q Quantitative association rule, 30 Quantitative Sequence Activity Models, 172 Quasi-identifier, 225 Query processing, 156
Index
R Radar data, 223224 Radial basis function neural network (RBFNN), 115117, 116f Radiometric resolution, 180181 Random forests (RF) algorithm for pest monitoring, 220 Randomization technique, 227228 Random multimodel deep learning (RDML), 107f, 108 Rapid ARM, 29t Rapid Association Rule Mining, 220 Raster data type, 185187 advantages of, 187 Band interleaved by line (BIL), 186187, 187f Band Interleaved by Pixel (BIP), 186, 187f Band Sequential, 187, 188f Rate of effort model (ROEM), 234 RBS. See Rule-based system (RBS) RDKit packages, 175 Reaction Generation package, 175 Real-time data mining, 240241 Recurrent neural networks (RNNs), 3839, 121122, 121f, 123f Recurrent NN (RNN), 17 Red Green Blue (RGB) values, 144 Red soils, 211 Regression, 89 Regulatory network, 165 Regulatory pathways, 161 Reinforcement learning (RL), 142 Relative growth rate (RGR), 213 Remote-sensing (RS) data, 84, 177197 big data mining, 193 for social welfare application, 193197 data mining, 188193 data representation, 181187 data resolution characteristics, 180181 satellite sensors, 177180 Research analysis, 15 Resource Description Framework, 173 Resources layer, 149150 Restricted Boltzmann machines (RBMs), 17, 9394 Return Beam Vidicon, 178179
255
RF algorithm. See Random forests (RF) algorithm RGR. See Relative growth rate (RGR) Rice plant disease prediction, 215 RL. See Reinforcement learning (RL) R-language, 5 Robotic disease detection system, 217 ROEM. See Rate of effort model (ROEM) Root system architecture (RSA), 210211 Rough sets, 12 Rough set theory (RST), 1819 RSA. See Root system architecture (RSA) Rule-based system (RBS), 141142 Rule discovery algorithm, 234235 Rule induction, 12 Rule learning, 30 S SAE model, 16 Satellite data remote sensing (RS) data. See Remote sensing (RS) data Satellite sensors, 177180 Advanced Very High-Resolution Radiometer (AVHRR), 178, 179t Landsat Enhanced Thematic Mapper Plus, 180 Landsat Multi-Spectral Scanner, 178179, 179t Landsat Thematic Mapper, 179, 179t Scientific data, 236 Seasat-1, 177 Security issues, 8 Selection operator, 136 Self-organizing map, 164165 Semantic Web, 173 Sensing and monitoring of agriculture data, 202 Sensing layer, 150 Sequence alignment tool, 162 Sequime package, 172 Serial analysis of gene expression (SAGE) tag sequencing techniques, 162 Service platform layer, 150 S-HDS model. See Stochastic-HDS (S-HDS) model Shuffled frog-leaping (SFL) algorithm, 85 Sigmoid activation functions, 114 Simulated annealing (SA), 7172
256
Index
Single-dimensional association rule, 2829 Single program multiple data (SPMD) model, 79 Soil density, 206 types of, 210211 alluvial soils, 211 black soils, 211 laterite soils, 211 mountain soils, 211 red and yellow soils, 211 Solar radiation data, 139 South Africa, statistics of, 168 Spatial data, 182 Spatial data mining, 188189, 190f Spatially spreading processes (SSPs) forest wildfire problem in, 194 Spatial resolution, 180 Spatiotemporal data mining, 192193, 192f Spectral clustering, 54 Spectral resolution, 180 Spontaneous Reporting System, 153 Spotfire software packages, 164165 SqueezeNet, 216 SSPs. See Spatially spreading processes (SSPs) Static database, 136137 Stochastic-HDS (S-HDS) model, 239240, 239f Structure-activity relationships (SAR), 173175 STULONG dataset, 169 Subsampling techniques, 163 Supporting technology, 152 Support vector machine (SVM), 2122, 164165, 209, 215 Surface waters (SWs), 213 Surveillance system, 158159 SVM. See Support vector machine (SVM) SWs. See Surface waters (SWs) T Tabu search, 6970, 71t TACOM. See Tank-automotive and Armaments Command (TACOM) Tank-automotive and Armaments Command (TACOM), 237 Target identification, 171173 Target validation, 173174
T-closeness, 227 Techniques, data mining, 1012 Temporal data mining, 190191 representations of, 190191 Temporal resolution, 181 Terra MODIS data for land cover change detection, 195 Texas Medicaid Fraud and Abuse detection system, 160 Thematic data, 182 Theoretical model, 146 Thestochastic model, 135 Think Analytics, 229, 231 Thymine (T), 161 Time domain-based representations, 191 Time stamping method, 143 TLID vector file format. See Topologically Integrated Geographic Encoding and Referencing/Line (TLID) vector file format Tools in data mining, 5 Topologically Integrated Geographic Encoding and Referencing (TIGER)/Line vector file format, 184 Training samples, 8788, 109 Transformation-based representations, 191 Trap capture data, 217218 Treatment effectiveness, predicting, 13 Tutorial system, strategies for, 146 2001 Massachusetts Institute of Technology, 156 Two-way clustering methods, 162 U UAVs. See Unmanned Aerial Vehicles (UAVs) Ubiquitous network layer, 150 Ubiquitous network technology, 151 UI See User interface (UI) United Data Management (UDM), 7778 Units, 134 Unmanned Aerial Vehicle RS, 177 Unmanned Aerial Vehicles (UAVs), 201 data capture by, 202 Unsupervised learning methods for data clustering mode seeking and mixture-resolving algorithms, 5564
Index
US census Bureau territories, 184 hierarchical view of, 185f US Department of Transportation, 192193 User-centric approach, 225 USGS, 184 V Variable neighborhood search (VNS), 7172 Variational autoencoder (VAE), 100101, 103 Vector data type, 182185 advantages of, 185 data description, 184 ESRI shapefile vector file format, 182184, 183f software functionality, 184 Topologically Integrated Geographic Encoding and Referencing (TIGER), 184 VGG, 217 VGGNet, 219 vHTS. See Virtual high-throughput screening (vHTS) Virtual high-throughput screening (vHTS), 173 Virtual intelligent capacities, 150 Virtual layer, 150 Visual data mining, 15 W Water and energy-based sectoring operation (WEBSOM), 210
257
Water body prediction for crop filed, 213 Water sector, 147 Weapon data, 224 Weapon target assignment problem, 236 Web Ontology Language, 173 WEBSOM. See Water and energy-based sectoring operation (WEBSOM) Weighted class learning, 30 Well-conducted management, 205 Wheat yield forecasting artificial neural networks (ANNs) for, 209 WHO. See World Health Organization (WHO) Widrow-Hoff rule, 110111 World Health Organization (WHO) database information component (IC) for, 153 World Wide Web, 240 X XDATA, 236 XML. See eXtensibleMarkup Language (XML) Y Yangtze River Basin, in China, 193194 Yellow soils, 211 Z Z-R relationship, 223224