352 46 53MB
English Pages 420 Year 2022
Data Analysis and Information Processing
DATA ANALYSIS AND INFORMATION PROCESSING
Edited by: Jovan Pehcevski
ARCLER
P
r
e
s
s
www.arclerpress.com
Data analysis and Information Processing Jovan Pehcevski
Arcler Press 224 Shoreacres Road Burlington, ON L7L 2H2 Canada www.arclerpress.com Email: [email protected] e-book Edition 2023 ISBN: 978-1-77469-579-1 (e-book) This book contains information obtained from highly regarded resources. Reprinted material sources are indicated. Copyright for individual articles remains with the authors as indicated and published under Creative Commons License. A Wide variety of references are listed. Reasonable efforts have been made to publish reliable data and views articulated in the chapters are those of the individual contributors, and not necessarily those of the editors or publishers. Editors or publishers are not responsible for the accuracy of the information in the published chapters or consequences of their use. The publisher assumes no responsibility for any damage or grievance to the persons or property arising out of the use of any materials, instructions, methods or thoughts in the book. The editors and the publisher have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission has not been obtained. If any copyright holder has not been acknowledged, please write to us so we may rectify. Notice: Registered trademark of products or corporate names are used only for explanation and identification without intent of infringement.
© 2023 Arcler Press ISBN: 978-1-77469-526-5 (Hardcover)
Arcler Press publishes wide variety of books and eBooks. For more information about Arcler Press and its products, visit our website at www.arclerpress.com
DECLARATION Some content or chapters in this book are open access copyright free published research work, which is published under Creative Commons License and are indicated with the citation. We are thankful to the publishers and authors of the content and chapters as without them this book wouldn’t have been possible.
ABOUT THE EDITOR
Jovan currently works as a presales Technology Consultant at Dell Technologies. He is a result-oriented technology leader with demonstrated subject matter expertise in planning, architecting and managing ICT solutions to reflect business objectives and achieve operational excellence. Jovan has broad deep technical knowledge in the fields of data center and big data technologies, combined with consultative selling approach and exceptional client-facing presentation skills. Before joining Dell Technologies in 2017, Jovan spent nearly a decade as a researcher, university professor and IT business consultant. In these capacities, he served as a trusted advisor to a multitude of customers in financial services, health care, retail, and academic sectors. He holds a PhD in Computer Science from RMIT University in Australia and worked as a postdoctoral visiting scientist at the renowned INRIA research institute in France. He is a proud father of two, an aspiring tennis player, and an avid Science Fiction/Fantasy book reader.”
TABLE OF CONTENTS
List of Contributors........................................................................................xv
List of Abbreviations..................................................................................... xxi
Preface................................................................................................... ....xxiii Section 1: Data Analytics Methods Chapter 1
Data Analytics in Mental Healthcare.................................... 3 Abstract...................................................................................................... 3 Introduction................................................................................................ 4 Literature Review........................................................................................ 5 Mental Illness and its Type.......................................................................... 8 Effects of Mental Health on User Behavior................................................ 12 How Data Science Helps to Predict Mental Illness?.................................. 14 Conclusions.............................................................................................. 20 Acknowledgments.................................................................................... 20 References................................................................................................ 21
Chapter 2
Case Study on Data Analytics and Machine Learning Accuracy............... 25 Abstract.................................................................................................... 25 Introduction.............................................................................................. 26 Research Methodology............................................................................. 27 Cyber-Threat Dataset Selection................................................................. 29 Ml Algorithms Selection............................................................................ 35 Accuracy of Machine Learning................................................................. 49 Conclusion............................................................................................... 51 Acknowledgements.................................................................................. 52 References................................................................................................ 53
Chapter 3
Data Modeling and Data Analytics: A Survey from a Big Data Perspective................................................................................ 55 Abstract.................................................................................................... 55 Introduction.............................................................................................. 56 Data Modeling.......................................................................................... 58 Data Analytics.......................................................................................... 67 Discussion................................................................................................ 74 Related Work............................................................................................ 77 Conclusions.............................................................................................. 78 Acknowledgements.................................................................................. 80 References................................................................................................ 81
Chapter 4
Big Data Analytics for Business Intelligence in Accounting and Audit..... 87 Abstract.................................................................................................... 87 Introduction.............................................................................................. 88 Machine Learning..................................................................................... 90 Data Analytics.......................................................................................... 93 Data Visualization..................................................................................... 97 Conclusion............................................................................................... 98 Acknowledgements.................................................................................. 99 References.............................................................................................. 100
Chapter 5
Big Data Analytics in Immunology: A Knowledge-Based Approach....... 101 Abstract.................................................................................................. 101 Introduction............................................................................................ 102 Materials and Methods........................................................................... 106 Results and Discussion........................................................................... 110 Conclusions............................................................................................ 116 References.............................................................................................. 118 Section 2: Big Data Methods
Chapter 6
Integrated Real-Time Big Data Stream Sentiment Analysis Service........ 125 Abstract.................................................................................................. 125 Introduction............................................................................................ 126 Related Works........................................................................................ 130 Architecture of Big Data Stream Analytics Framework............................. 131 x
Sentiment Model.................................................................................... 133 Experiments............................................................................................ 142 Conclusions............................................................................................ 146 Acknowledgements................................................................................ 147 References.............................................................................................. 148 Chapter 7
The Influence of Big Data Analytics in the Industry............................... 153 Abstract.................................................................................................. 153 Introduction............................................................................................ 154 Status Quo Overview.............................................................................. 155 Big-Data Analysis................................................................................... 157 Conclusions............................................................................................ 165 References.............................................................................................. 166
Chapter 8
Big Data Usage in the Marketing Information System............................ 169 Abstract.................................................................................................. 169 Introduction............................................................................................ 170 The Use of Information on the Decision-Making Process in Marketing.... 171 Big Data................................................................................................. 171 Use of Big Data in the Marketing Information System............................. 173 Limitations.............................................................................................. 178 Final Considerations............................................................................... 180 References.............................................................................................. 183
Chapter 9
Big Data for Organizations: A Review.................................................... 187 Abstract.................................................................................................. 187 Introduction............................................................................................ 188 Big Data for Organizations..................................................................... 188 Big Data in Organizations and Information Systems................................ 192 Conclusion............................................................................................. 196 Acknowledgements................................................................................ 197 References.............................................................................................. 198
Chapter 10 Application Research of Big Data Technology in Audit Field................. 201 Abstract.................................................................................................. 201 Introduction............................................................................................ 202
xi
Overview of Big Data Technology........................................................... 202 Requirements on Auditing in the Era of Big Data..................................... 203 Application of Big Data Technology in Audit Field.................................. 205 Risk Analysis of Big Data Audit............................................................... 210 Conclusion............................................................................................. 211 References.............................................................................................. 212 Section 3: Data Mining Methods Chapter 11 A Short Review of Classification Algorithms Accuracy for Data Prediction in Data Mining Applications................................................. 215 Abstract.................................................................................................. 215 Introduction............................................................................................ 216 Methods in Literature.............................................................................. 218 Results and Discussion........................................................................... 224 Conclusions and Future Work................................................................. 227 References.............................................................................................. 228 Chapter 12 Different Data Mining Approaches Based Medical Text Data................ 231 Abstract.................................................................................................. 231 Introduction............................................................................................ 232 Medical Text Data................................................................................... 232 Medical Text Data Mining....................................................................... 233 Discussion.............................................................................................. 246 Acknowledgments.................................................................................. 247 References.............................................................................................. 248 Chapter 13 Data Mining in Electronic Commerce: Benefits and Challenges............. 257 Abstract.................................................................................................. 257 Introduction............................................................................................ 258 Data Mining........................................................................................... 260 Some Common Data Mining Tools.......................................................... 261 Data Mining in E-Commerce.................................................................. 262 Benefits of Data Mining in E-Commerce................................................. 264 Challenges of Data Mining in E-Commerce............................................ 267 Summary and Conclusion....................................................................... 270 References.............................................................................................. 271 xii
Chapter 14 Research on Realization of Petrophysical Data Mining Based on Big Data Technology............................................................... 273 Abstract.................................................................................................. 273 Introduction............................................................................................ 274 Analysis of Big Data Mining of Petrophysical Data.................................. 274 Mining Based on K-Means Clustering Analysis........................................ 278 Conclusions............................................................................................ 283 Acknowledgements................................................................................ 284 References.............................................................................................. 285 Section 4: Information Processing Methods Chapter 15 Application of Spatial Digital Information Fusion Technology in Information Processing of National Traditional Sports........................... 289 Abstract.................................................................................................. 289 Introduction............................................................................................ 290 Related Work.......................................................................................... 291 Space Digital Fusion Technology............................................................ 293 Information Processing of National Traditional Sports Based on Spatial Digital Information Fusion............................................ 305 Conclusion............................................................................................. 309 References.............................................................................................. 310 Chapter 16 Effects of Quality and Quantity of Information Processing on Design Coordination Performance.................................................... 313 Abstract.................................................................................................. 313 Introduction............................................................................................ 314 Methods................................................................................................. 317 Data Analysis.......................................................................................... 320 Discussion.............................................................................................. 321 Conclusion............................................................................................. 322 References.............................................................................................. 324 Chapter 17 Neural Network Optimization Method and its Application in Information Processing...................................................................... 327 Abstract.................................................................................................. 327 Introduction............................................................................................ 328
xiii
Neural Network Optimization Method and its Research in Information Processing.................................................................. 330 Neural Network Optimization Method and its Experimental Research In Information Processing............................................... 337 Neural Network Optimization Method and its Experimental Research Analysis in Information Processing................................. 338 Conclusions............................................................................................ 347 Acknowledgments.................................................................................. 348 References.............................................................................................. 349 Chapter 18 Information Processing Features Can Detect Behavioral Regimes of Dynamical Systems................................................................................ 353 Abstract.................................................................................................. 353 Introduction............................................................................................ 354 Methods................................................................................................. 356 Results.................................................................................................... 365 Discussion.............................................................................................. 380 Acknowledgments.................................................................................. 383 References.............................................................................................. 384 Index...................................................................................................... 389
xiv
LIST OF CONTRIBUTORS Ayesha Kamran Ul haq National University of Computer and Emerging Sciences, Islamabad, Pakistan Amira Khattak Prince Sultan University, Riyadh, Saudi Arabia Noreen Jamil National University of Computer and Emerging Sciences, Islamabad, Pakistan M. Asif Naeem National University of Computer and Emerging Sciences, Islamabad, Pakistan Auckland University of Technology, Auckland, New Zealand Farhaan Mirza Auckland University of Technology, Auckland, New Zealand Abdullah Z. Alruhaymi Department of Electrical Engineering and Computer Science, Howard University, Washington D.C, USA. Charles J. Kim Department of Electrical Engineering and Computer Science, Howard University, Washington D.C, USA. André Ribeiro INESC-ID/Instituto Superior Técnico, Lisbon, Portugal Afonso Silva INESC-ID/Instituto Superior Técnico, Lisbon, Portugal Alberto Rodrigues da Silva INESC-ID/Instituto Superior Técnico, Lisbon, Portugal Mui Kim Chu Singapore Institute of Technology, 10 Dover Drive, Singapore
Kevin Ow Yong Singapore Institute of Technology, 10 Dover Drive, Singapore Guang Lan Zhang Department of Computer Science, Metropolitan College, Boston University, Boston, MA 02215, USA Jing Sun Cancer Vaccine Center, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA 02115, USA Lou Chitkushev Department of Computer Science, Metropolitan College, Boston University, Boston, MA 02215, USA Vladimir Brusic Department of Computer Science, Metropolitan College, Boston University, Boston, MA 02215, USA Sun Sunnie Chung Department of Electrical Engineering and Computer Science, Cleveland State University, Cleveland, USA Danielle Aring Department of Electrical Engineering and Computer Science, Cleveland State University, Cleveland, USA Haya Smaya Mechanical Engineering Faculty, Institute of Technology, MATE Hungarian University of Agriculture and Life Science, Gödöllo”, Hungary Alexandre Borba Salvador Faculdade de Administra??o, Economia e Ciências Contábeis, Universidade de S?o Paulo, S?o Paulo, Brazil Ana Akemi Ikeda Faculdade de Administra??o, Economia e Ciências Contábeis, Universidade de S?o Paulo, S?o Paulo, Brazil Pwint Phyu Khine School of Information and Communication Engineering, University of Science and Technology Beijing (USTB), Beijing, China
xvi
Wang Zhao Shun School of Information and Communication Engineering, University of Science and Technology Beijing (USTB), Beijing, China Beijing Key Laboratory of Knowledge Engineering for Material Science, Beijing, China Guanfang Qiao WUYIGE Certified Public Accountants LLP, Wuhan, China Ibrahim Ba’abbad Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, KSA. Thamer Althubiti Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, KSA. Abdulmohsen Alharbi Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, KSA. Khalid Alfarsi Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, KSA. Saim Rasheed Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, KSA. Wenke Xiao School of Medical Information Engineering, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China Lijia Jing School of Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China Yaxin Xu School of Medical Information Engineering, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China Shichao Zheng School of Medical Information Engineering, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
xvii
Yanxiong Gan School of Medical Information Engineering, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China Chuanbiao Wen School of Medical Information Engineering, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China Mustapha Ismail Management Information Systems Department, Cyprus International University, Haspolat, Lefkoşa via Mersin, Turkey Mohammed Mansur Ibrahim Management Information Systems Department, Cyprus International University, Haspolat, Lefkoşa via Mersin, Turkey Zayyan Mahmoud Sanusi Management Information Systems Department, Cyprus International University, Haspolat, Lefkoşa via Mersin, Turkey Muesser Nat Management Information Systems Department, Cyprus International University, Haspolat, Lefkoşa via Mersin, Turkey Yu Ding School of Computer Science, Yangtze University, Jingzhou, China Key Laboratory of Exploration Technologies for Oil and Gas Resources (Yangtze University), Ministry of Education, Wuhan, China Rui Deng Key Laboratory of Exploration Technologies for Oil and Gas Resources (Yangtze University), Ministry of Education, Wuhan, China School of Geophysics and Oil Resource, Yangtze University, Wuhan, China Chao Zhu The Internet and Information Center, Yangtze University, Jingzhou, China Xiang Fu School of Physical Education Guangdong Polytechnic Normal University, Guangzhou 510000, China Ye Zhang School of Physical Education Guangdong Polytechnic Normal University, Guangzhou 510000, China xviii
Ling Qin School of Physical Education Guangdong Polytechnic Normal University, Guangzhou 510000, China R. Zhang Department of Quantity Survey, School of Construction Management and Real Estate, Chongqing University, Chongqing, China. A. M. M. Liu Department of Real Estate and Construction, Faculty of Architecture, The University of Hong Kong, Hong Kong, China I. Y. S. Chan Department of Real Estate and Construction, Faculty of Architecture, The University of Hong Kong, Hong Kong, China Pin Wang School of Mechanical and Electrical Engineering, Shenzhen Polytechnic, Shenzhen 518055, Guangdong, China Peng Wang Garden Center, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, Guangdong, China En Fan Department of Computer Science and Engineering, Shaoxing University, Shaoxing 312000, Zhejiang, China Rick Quax Computational Science Lab, University of Amsterdam, Amsterdam, Netherlands Gregor Chliamovitch Department of Computer Science, University of Geneva, Geneva, Switzerland Alexandre Dupuis Department of Computer Science, University of Geneva, Geneva, Switzerland Jean-Luc Falcone Department of Computer Science, University of Geneva, Geneva, Switzerland Bastien Chopard Department of Computer Science, University of Geneva, Geneva, Switzerland
xix
Alfons G. Hoekstra Computational Science Lab, University of Amsterdam, Amsterdam, Netherlands ITMO University, Saint Petersburg, Russia Peter M. A. Sloot Computational Science Lab, University of Amsterdam, Amsterdam, Netherlands ITMO University, Saint Petersburg, Russia Complexity Institute, Nanyang Technological University, Singapore
xx
LIST OF ABBREVIATIONS EMC:
The U.S.A EMC company
NLP:
Natural language processing
ANN:
Artificial neural network
BP:
Backpropagation
ROC:
Receiver operating characteristic
TPR:
True positive rate
FPR:
False positive rate
AUC:
Area under the curve
RF:
Random forest
BN:
Bayesian network
EHR:
Electronic health record
ICD:
International classification of diseases
SNOMED CT: The systematized nomenclature of human and veterinary medicine clinical terms CPT:
Current procedural terminology
DRG:
Diagnosis-related groups
Mesh:
Medical subject headings
LOINC:
Logical observation identifiers names and codes
UMLS:
Unified medical language system
MDDB:
Main drug database
SVM:
Support vector machine
RNN:
Recurrent neural network
ID3:
Iterative Dichotomiser 3
KCHS:
Korean community health survey
ADR:
Adverse drug reactions
ECG:
electrocardiographic
FNN:
Factorization machine-supported neural network.
PREFACE
Over the past few decades, the development of information systems in larger enterprises was accompanied with the development of data storage technology. Initially, the information systems of individual departments were developed independently of each other, so that, for example, the finance department had a separate information system from the human resources department. The so-called ‘information islands’ were created, among which the flow of information was not established. If a company has offices in more than one country, until recently it was the practice for each country to have a separate information system, which was necessary due to differences in legislation, local customs and the problem of remote customer support. Such systems often had different data structures. The problem arose with reporting, as there was no easy way to aggregate data from diverse information systems to get a picture of the state of the entire enterprise. The main task of information engineering was to merge separate information systems into one logical unit, from which unified data can be obtained. The first step in the unification process is to create a company model. This consists of the following steps: defining data models, defining process models, identifying participants, and determining the flow of information between participants and systems (data flow diagram). The problem of unavailability of information in practice is bigger than it may seem. Certain types of businesses, especially non-profit-oriented ones, can operate in this way. However, a large company, which sells its main product on the market for a month at the wrong price, due to inaccurate information obtained by the management from a poor information system, will surely find itself in trouble. The organization’s dependence on quality information from the business system grows with its size and the geographical dislocation of its offices. Full automation of all business processes is now the practical standard in some industries. Examples are airline reservation systems, or car manufacturer systems where it is possible to start the production of car models with accessories directly from the showroom according to the customer’s wishes. The fashion industry, for example, must effectively follow fashion trends (analyze sales of different clothing models by region) in order to respond quickly to changes in consumer habits. This edition covers different topics from data analysis and information processing, including: data analytics methods, big data methods, data mining methods, and information processing methods. Section 1 focuses on data analytics methods, describing data analytics in mental healthcare, a case study on data analytics and machine learning accuracy, a survey from a big data perspective on data modeling and data analytics, big data analytics for
business intelligence in accounting and audit, and a knowledge-based approach for big data analytics in immunology. Section 2 focuses on big data methods, describing integrated real-time big data stream sentiment analysis service, the influence of big data analytics in the industry, big data usage in the marketing information system, a review of big data for organizations, and application research of big data technology in audit field. Section 3 focuses on data mining methods, describing a short review of classification algorithms accuracy for data prediction in data mining applications, different data mining approaches based on medical text data, the benefits and challenges of data mining in electronic commerce, and research on realization of petrophysical data mining based on big data technology. Section 4 focuses on information processing methods, describing application of spatial digital information fusion technology in information processing of national traditional sports, the effects of quality and quantity of information processing on design coordination performance, a neural network optimization method and its application in information processing, and information processing features that can detect behavioral regimes of dynamical systems.
xxiv
SECTION 1: DATA ANALYTICS METHODS
Chapter 1
Data Analytics in Mental Healthcare
Ayesha Kamran Ul haq1, Amira Khattak2, Noreen Jamil1, M. Asif Naeem1,3-, and Farhaan Mirza3 1
National University of Computer and Emerging Sciences, Islamabad, Pakistan
2
Prince Sultan University, Riyadh, Saudi Arabia
3
Auckland University of Technology, Auckland, New Zealand
ABSTRACT Worldwide, about 700 million people are estimated to suffer from mental illnesses. In recent years, due to the extensive growth rate in mental disorders, it is essential to better understand the inadequate outcomes from mental health problems. Mental health research is challenging given the perceived limitations of ethical principles such as the protection of autonomy, consent, threat, and damage. In this survey, we aimed to investigate studies where big data approaches were used in mental illness and treatment. Firstly, different types of mental illness, for instance, bipolar disorder, depression, and Citation: Ayesha Kamran Ul Haq, Amira Khattak, Noreen Jamil, M. Asif Naeem, Farhaan Mirza, “Data Analytics in Mental Healthcare”, Scientific Programming, vol. 2020, Article ID 2024160, 9 pages, 2020. https://doi.org/10.1155/2020/2024160. Copyright: © 2020 by Authors. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
4
Data Analysis and Information Processing
personality disorders, are discussed. The effects of mental health on user’s behavior such as suicide and drug addiction are highlighted. A description of the methodologies and tools is presented to predict the mental condition of the patient under the supervision of artificial intelligence and machine learning.
INTRODUCTION Recently the term “big data” has become exceedingly popular all over the world. Over the last few years, big data has started to set foot in healthcare system. In this context, scientists have been working on improving the public health strategies, medical research, and the care provided to patients by analyzing big datasets related to their health. Data is coming from different sources like providers (pharmacy and patient’s history) and nonproviders (cell phone and internet searches). One of the outstanding possibilities available from huge data utilization is evident inside the healthcare industry. Healthcare organizations have a big quantity of information available to them and a big portion of it is unstructured and clinically applicable. The use of big data is expected to grow in the medical field and it will continue to pose lucrative opportunities for solutions that can help in saving lives of patients. Big data needs to be interpreted correctly in order to predict future data so that final result can be estimated. To solve this problem, researchers are working on AI algorithms that have a high impact on analysis of huge quantities of raw data and extract useful information from it. There are varieties of AI algorithms that are used to predict patient disease by observing past data. A variety of wearable sensors have been developed to deal with both physical and social interactions practically. Mental health of a person is measured by a high grade of affective disorder which results in major depression and different anxiety disorders. There are many conditions which are recognized as mental disorders including anxiety disorder, depressive disorder, mood disorder, and personality disorder. There are lots of mobile apps, smart devices like smartwatches, and smart bands which increase healthcare facilities in mobile mental healthcare systems. Personalized psychiatry also plays an important role in predicting bipolar disorder and improving diagnosis and optimized treatment. Most of the smart techniques are not pursued due to lack of resources especially in underdeveloping countries. Like, in Pakistan, 0·1% of the government health budget is being spent on the mental health system. There is a need
Data Analytics in Mental Healthcare
5
for an affordable solution to detect depression in Pakistan so that everyone could be able to pay attention to it. Researchers are working on many machine learning algorithms to analyze raw data to deduce meaningful information. It is now impossible to manage data in healthcare with traditional database management tools as data is in terabytes and petabytes now. In this survey, we analyzed different issues related to mental healthcare by usage of big data. We analyze different mental disorders like bipolar disease, opioid use disorder, personality disorder, different anxiety disorders, and depression. Social media is one of the biggest and most powerful resources for data collection as every 9 out of 10 people use social networking sites nowadays. Twitter is the main focus of interest for most researchers as people write 500,000 tweets on average per minute. Twitter is being used for sentimental analyses and opinion mining in the business field in order to check the popularity of a product by observing customer tweets. We have a lot of structure and unstructured data in order to reach any decision; data must be processed and stored in such a manner that follows the same structure. We analyzed and compared the working of different storage models under different conditions like mongo DB and Hadoop which are two different approaches to store large amounts of data. Hadoop works on cloud computing that helps to accomplish different operations on distributed data in a systematic manner. In this survey we discuss the mental health problems with big data into further four sections. The second section describes related work regarding mental healthcare and the latest research on it. The third section describes different types of mental illness and their solutions within the data science. The fourth section describes the different illegal issues faced by the mental patients and early detection of these types of activities. The fifth section describes different approaches of data science towards mental healthcare systems such as different training and testing methods of health data for early prediction like supervised and unsupervised learning methods and artificial neural network (ANN).
LITERATURE REVIEW There are a lot of mental disorders like bipolar one, depression, and different forms of anxieties. Bauer et al. [1] conducted a paper-based survey in which 1222 patients from 17 countries were participated to detect bipolar disorder in adults. This survey was translated into 12 different languages with some limitation that it did not contain any question about technology usage in
6
Data Analysis and Information Processing
older adults. According to Bauer et al. [1], digital treatment is not suitable for the older adults with bipolar disorder. Researchers are working on the most interesting and unique method of tremendous interest to check the personality of a person just by looking at the way he or she is using the mobile phone. De Montjoye [2] collected dataset from US Research University and created a framework that analyzed phone call and text messages to check the personality of the user. Participants who did 300 calls or text per year failed to complete personality measures. They choose optimal sample size that is 69 with mean age = 30.4, S. D. = 6.1, and 1 missing value. Similarly, Bleidorn and Hopwood [3] adopted a comprehensive machine learning approach to test the personality of the user using social media and digital records. Main 9 recommendations for how to amalgamate machine learning techniques provided by the researcher enhance the big five of the personality assessments. Focusing on minor details of the user comprehends and validates the result. Digital mental health has been revolutionized and its innovations are growing at a high rate. The National Health Service (NHS) has recognized its importance in mental healthcare and is looking for innovations to provide services at low cost. Hill et al. [4] presented a study of challenges and considerations in innovations in digital mental healthcare. They also suggested collaboration between clinicians, industry workers, and service users so that these challenges can be overcome and successful innovations of e-therapies and digital apps can be developed. There are lots of mobile apps, smart devices like smartwatches, smart bands, and shirts which increase healthcare facilities in the mobile healthcare system. A variety of wearable sensors have been developing to deal with both physical and social interactions practically. Combining artificial intelligence with healthcare systems extends the healthcare facilities up to the next level. Dimitrov [5] conducted a systematic survey on mobile internet of things in the devices which allow business to emerge, spread productivity improvements, lock down the cost, and intensify customer experience and change in a positive way. Similarly, Monteith et al. [6] performed a paperbased survey on clinical data mining to analyze different data sources to get psychiatry data and optimized precedence opportunities for psychiatry. One of the machine learning algorithms named artificial neural network (ANN) is based on three-layer architecture. Kellmeyer [7] introduced a way to secure big brain data from clinical and consumer-directed neurotechnological devices using ANN. But this model needs to be trained on a huge amount of data to get accurate results. Jiang et al. [8] designed
Data Analytics in Mental Healthcare
7
and developed a wearable device with multisensing capabilities including audio sensing, behavior monitoring, and environment and physiological sensing that evaluated speech information and automatically deleted raw data. Tested students were split into two groups, those with excessive scores or in excessive score. Participants were required to wear the device to make sure of the authenticity of the data. But one of the major challenges to enable IoT in the device is safe communication. Yang et al. [9] invented an IoT enabled wearable device for mental wellbeing and some external equipment to record speech data. This portable device would be able to recognize motion, pressure, monitoring, and physiological status of a person. There are lots of technologies that produce tracking data, such as smartphones, credit cards, websites, social media, and sensors offering benefits. Monteith and Glenn [10] elaborated some kind of generated data using human made algorithm, searching for disease symptoms, hit disease websites, sending/receiving healthcare e-mail, and sharing health information on social media. Based on perceived data, the system predicted automated decision-making without the involvement of user to maintain security. Considering all the above issues, there is a need for proper treatment of a disordered person. Mood of the patient is one of the parameters to detect his/her mental health. Public mood is hugely reflected in the social media as almost everyone uses social media in this modern era. Goyal [11] introduced a procedure in which tweets are filtered out for specific keywords from saved databases regarding food price crisis. Data is trained using two algorithms, K nearest neighbor and Naïve Bayes for unsupervised and supervised learning, respectively. Cloud storage is the best option to store huge amounts of unstructured data. Kumar and Bala [12] proposed functionalities of Hadoop for automatic processing and repository of big data. MongoDB is a big data tool for analyzing statistics related to world mental healthcare. Dhaka, P., and Johari [13] presented a way of implementation of big data tool ‘MongoDB’ for analyzing statistics related to world mental healthcare. The data is further analyzed using genetic algorithms for different mental disorders and deployed again in MongoDB for extracting final data. But all of the above methods are useless without the user involvement. De Beurs et al. [14] introduced expert-driven method, intervention mapping, and scrum methods which may help to increase the involvement of the users. This approach tried to develop user-focused design strategies for the growth of web-based mental healthcare under finite resources. Turner et al. [15]
8
Data Analysis and Information Processing
elaborated in their article that the availability of the big data is increasing twice in size every two year for use in automated decision-making. Passos et al. [16] believed that the long-established connection between doctor and patient will change with the establishment of big data and machine learning models. ML algorithm can allow an affected person to observe his fitness from time to time and can tell the doctor about his current condition if it becomes worst. Early consultation with the doctor could prevent any bigger loss for the patient. If the psychiatric disease is not predicted or handled earlier, then it enforces the patient to involve into many illegal activities like suicide as most of the suicide attempts are related to mental disorder. Kessler et al. [17] proposed meta-analysis that focused on suicide incidence within 1 year of the self-harm using machine learning algorithm. They analyzed the past reports of suicide patients and concluded that any prediction was impossible to be made due to short duration of psychiatric hospitalizations. Although a number of AI algorithms are used to estimate patient disease by observing past data, the focus of all studies was related to suicide prediction by setting up a threshold. Defining a threshold is a very crucial point or sometimes even impossible to be predicted. Cleland et al. [18] reviewed many studies but were unable to discover principles to clarify threshold. Authors used a random-effects model to generate a meta-analytic ROC. On the basis of correlation results, it is stated that depression prevalence is mediating factor between economic deprivation and antidepressant prescribing. Another side effect of mental disease is drug addiction. Early drug prediction is possible by analyzing user data. Opioid is a swear type of drug. Hasan et al. [19] explored the Massachusetts All Payer Claim Data (MA APCD) dataset and examined how naïve users develop opioid use disorder. A popular machine learning algorithm is tested to predict the risk of such type of dependency of patent. Perdue et al. [20] predicted ratio of drug abusers by comparing Google trends data with monitoring the future (MTF) data; a well-structured study was made. It is concluded that Google trends and MTF data provided combined support for detecting drug abuse.
MENTAL ILLNESS AND ITS TYPE Depression and Bipolar Disorder Bipolar disorder is also known as the worst form of depression. In Table 1, Bauer et al. [1] conducted a survey to check the bipolar disorder in
Data Analytics in Mental Healthcare
9
adults. Data is collected from 187 older adults and 1021 younger adults with excluded missing observations. The survey contained 39 questions which took 20 minutes to complete. Older adults with bipolar disorder were addicted to the internet less regularly than the younger ones. As most of the healthcare services are available only online and most digital tools and devices are evolved, the survey has some limitations that it did not contain any question about technology usage in older adults. There is a need for proper treatment of a disordered person. Mood of the patient is one of the parameters to detect his/her mental health. Table 1 describes another approach of personality assessment using machine learning algorithm that focused on other aspects like systematic fulfillment and argued to enhance the validity of machine learning (ML) approach. Coming with technological advancement in the medical field will promote personalized treatments. A lot of work has been done in the field of depression detection using social networks. Table 1: Types of mental illness and role of big data
Authors
Discipline(s) Keywords used to reviewed identify papers for review
Methodology Number Primary findings of papers reviewed
Bauer et al. Bipolar [1] disorder
Bipolar disorder, Paper-based mental illness, and survey health literacy
Dhaka and Mental Johari [13] disorder
Mental health, Genetic 19 disorders, and using algorithm and MongoDB MongoDB tool
Analyzing and storing a large amount of data on MongoDB
Hill et al. [4]
Mental health, col- (i) Online 33 laborative comput- CBT platform ing, and e-therapies
(i) Developing smartphone application
Mental disorder
(ii) Collaborative computing
68
47% of older adults used the internet versus 87% of younger adults having bipolar disorder
(ii) For mental disorder (iii) For improving e-therapies
10
Data Analysis and Information Processing
Kumar and Depression Big data, Hadoop, Sentimental Bala, [12] detection sentiment analysis, analysis and through so- social networks, and save data on cial media Twitter Hadoop
14
Analyzing twitter users’ view on a particular business product
Kellmeyer [7]
77
(i) Maximizing medical knowledge
Big brain data
Brain data, neurotechnology, big data, privacy, security, and machine learning
(i) Machine learning
(ii) Enhancing the security of devices and sheltering the privacy of personal brain data
(ii) Consumerdirected neurotechnological devices (iii) Combining expert with a bottomup process
De Montjoye [2]
Mobile phone and user personality
Personality prediction, big data, big five personality prediction, Carrier’s log, and CDR
(i) Entropy: detecting different categories
31
Analyzing phone calls and text messages under a five-factor model
(ii) Interevent time: frequency of call or text between two users (iii) AR coefficients: to convert list of call and text into time series
Furnham [21]
Personality Dark side, big five, disorder facet analysis, dependence, and dutifulness
Hogan ‘dark 34 side’ measure (HDS) concept of dependent personality disorder (DPD)
All of the personality disorders are strongly negatively associated with agreeableness (a type of kind, sympathetic, and cooperative personality)
Data Analytics in Mental Healthcare Bleidorn and Hopwood [3]
Personality Machine learning, (i) Machine 65 assessment personality aslearning sessment, big five, (ii) Prediction construct validation, models and big data (iii) K-fold validation
11
Focusing on other aspects like systematic fulfillment and arguing to enhance the validity of machine learning (ML) approach
The main goal of personalized psychiatry is to predict bipolar disorder and improve diagnosis and optimized treatment. To achieve these goals, it is necessary to combine the clinical variables of a patient as Figure 1 describes the integration of all these variables. It is now impossible to manage data in mental healthcare with database management traditional tools as data is in terabytes and petabytes now. So, there is a high need to introduce big data analytics tools and techniques to deal with such big data in order to improve the quality of treatment so that overall cost of treatment can be reduced throughout the world.
Figure 1: Goals of personalized treatment in bipolar disorder [22].
MongoDB is one of the tools to handle big data. The data is further analyzed using genetic algorithms for different mental disorders and deployed again in MongoDB for extracting final data. This approach of mining data and extracting useful information reduced overall cost of treatment. It provides the best results for clinical decisions. It helps doctors to give more accurate treatment for several mental disorders in less time and
12
Data Analysis and Information Processing
at low cost using useful information extracted by big data tool Mongo DB and genetic algorithm. In Table 1, some of the techniques are handled and stored huge amount of data. Using MongoDB tool, researchers are working to predict mental condition before severe mental stage. So, some devices introduced a complete detection process to tackle the present condition of the user by analyzing his/her daily life routine. There is a need for reasonable solutions that detect disable stage of a mental patient more precisely and quickly.
Personality Disorder Dutifulness is a type of personality disorder in which patients are overstressed about the disease that is not actually much serious. People with this type of disorder tend to work hard to impress others. A survey was conducted to find the relationship between normal and dutifulness personalities. Other researchers are working on the most interesting and unique method of tremendous interest to check the personality of a person just by looking at the way he or she is using the mobile phone. This approach provides costeffective and questionnaire-free personality detection through mobile phone data that performs personality assessment without conducting any digital survey on social media. To perform all nine main aspects of the constructed validation in real time is not easy for the researchers. This examination, like several others, has limitations. This is just a sample that has implications for generalization when it is used in the near-real-time scenario which may be tough for the researchers.
EFFECTS OF MENTAL HEALTH ON USER BEHAVIOR Mental illness is upswing in the feelings of helplessness, danger, fear, and sadness in the people. People do not understand the current situation so this thing imposes psychiatric patients to illegal activities. Table 2 described some issues that appear because of mental disorder like suicide, drug abuse, and opioid use as follows.
Data Analytics in Mental Healthcare
13
Table 2: Side effects of mental illness and their solution through data science
Authors
Side effects of mental disorder
Tools/techniques
Primary findings
Kessler et al. [17]
Suicide and mental illness
Machine learning algorithm
Predicting suicide risk at hospitalization
Cleland et al. Antidepressant [18] usage
Clustering analysis based on behavior and disease
Identifying the correlation between antidepressant usage and deprivation
Perdue et al. [20]
(i) Google search history
Providing real time data that may allow us to predict drug abuse tendency and respond more quickly
Drug abuse
(ii) Monitoring the future (MTF)
Hasan et al. [19]
Opioid use disorder
(iii) Feature engineering (iv) Logistic regression (v) Random forest
Suppressing the increasing rate of opioid addiction using machine learning algorithms
(vi) Gradient boosting
Suicide Suicide is very common in underdeveloped countries. According to researchers, someone dies because of suicide in every 40 seconds all over the world. There are some areas in the world where mental disorder and suicide statistics are relatively larger than other areas. Psychiatrists say that 90% of people who died by suicide faced a mental disorder. Electronic medical records and big data generate suicide through machine learning algorithm. Machine learning algorithms can be used to predict suicides in depressed persons; it is hard to estimate how accurately it performs, but it may help a consultant for pretreating patients based on early prediction. Various studies depict the fact that there are a range of factors such as high level of antidepressant prescribing that caused such prevalence of illness. Some people started antidepressant medicine to overcome mental affliction. In Table 1, Cleland et al. [18] explored three main factors, i.e.,
14
Data Analysis and Information Processing
economic deprivation, depression prevalence, and antidepressant prescribing and their correlations. Several statistical tools could be used like Jupyter Notebook, Pandas, NumPy, Matplotlib, Seaborn, and ipyleaflet for creation of pipeline. Correlations are analyzed using Pearson’s correlation and values. The analysis shows strong correlation between economic deprivation and antidepressant prescribing whereas it shows weak correlations between economic deprivation and depression prevalence.
Drug Abuse People voluntarily take drugs but most of them are addicted to them in order to get rid of all their problems and feel relaxed. Adderall divinorum, Snus, synthetic marijuana, and bath salts are the novel drugs. Opioid is a category of drug that includes the illegitimate drug heroin. Hasan et al. [19] compared four machine learning algorithms: logistic regression, random forest, decision tree, and gradient boosting to predict the risk of opioid use disorder. Random forest is one of the best methods of classification in machine learning algorithms. It is found that in such types of situations random forest models outperform the other three algorithms specially for determining the features. There is another approach to predict drug abusers using the search history of the user. Perdue et al. [20] predicted ratio of drug abusers by comparing Google trends data with monitoring the future (MTF) data; a well-structured study was made. It is concluded that Google trends and MTF data provided combined support for detecting drug abuse. Google trends appear to be a particularly useful data source regarding novel drugs because Google is the first place where many users especially adults go for information on topics of which they are unfamiliar. Google tends not to predict heroin abuse; the reason may be that heroin is a relatively uniquely dangerous than other drugs. According to Granka [23], internet searches can be understood as behavioral measures of an individual’s interest in an issue. Unfortunately, this technique was not going to be very convenient as drug abuse researchers are unable to predict drug abuse successfully because of sparse data.
HOW DATA SCIENCE HELPS TO PREDICT MENTAL ILLNESS? Currently, there are numerous mobile clinical devices which are established in patients’ personal body networks and medical devices. They receive and transmit massive amounts of heterogeneous fitness records to healthcare
Data Analytics in Mental Healthcare
15
statistics structures for patient’s evaluation. In this context, system learning and data mining strategies have become extremely crucial in many reallife problems. Many of those techniques were developed for health data processing and processing on cellular gadgets. There is a lot of data in the world of medicine as data is coming from different sources like pharmacy and patient’s history and from nonproviders (cell phone and internet searches). Big data needs to be interpreted in order to predict future data, estimate hypothesis, and conclude results. Psychiatrists should be able to evaluate results from research studies and commercial analytical products that are based on big data.
Artificial Intelligence and Big Data Big data collected from wearable tracking devices and electronic records help to store accumulating and extensive amounts of data. Smart mobile apps support fitness and health education, predict heart attack, and calculate ECG, emotion detection, symptom tracking, and disease management. Mobile apps can improve connection between patients and doctors. Once a patient’s data from different resources is organized into a proper structure, artificial intelligence (AI) algorithm can be used. After all, AI recognizes patterns, finds similarity between them, and makes predictive recommendations about what happened with those in that condition. Techniques used for healthcare data processing can be widely categorized into two classes: nonartificial intelligence systems and artificial intelligence systems. Although non-AI techniques are less complex, but they are suffering from a lack of convergence that gives inaccurate results as compared to AI techniques. Contrary to that, AI methods are preferable then non-AI techniques. In Table 3, Dimitrov [5] combined artificial intelligence with IoT technology in existing healthcare apps so that connection between doctors and patients remains balanced. Disease prediction is also possible through machine learning. Figure 2 shows hierarchical structure of AI, ML, and neural networks.
16
Data Analysis and Information Processing
Table 3: Data analytics and predicting mental health Authors
Tool/technology
Methodology
Purpose of finding
Dimitrov [5]
(i) Sensing technology
Emergence of medical internet of things (mIoT) in existing mobile apps
Providing benefits to (i) Achieving Adding up the customers improved mental garbage data to health the sensors (i) Avoiding chronic and diet-related illness
(ii) Artificial intelligence
Monteith et al. [6]
Survey based approach
Clinical data mining
Kellmeyer [7]
Neurotechnology
(i) Machine learning
Strength
Weakness
(ii) improving cognitive function
(ii) improving lifestyles in real time decisionmaking
Analyzing different data sources to get psychiatry data
Optimized precedence opportunities for psychiatry
N/A
Maximizing medical knowledge
Model needs huge amount of training data as brain disease is rarely captured
Enhancing the security of devices (ii) Consumer-directed and sheltering the neurotechnological privacy of personal devices brain (iii) Combining expert with a bottom-up process
Yang et al. [9]
Long-term Well-being monitoring wearable questionnaires with a device with internet group of students of things
Monteith and Glenn [10]
Automated decision- Hybrid algorithm that Tracking day-tomaking combines the statistical day behavior of the focus and data mining user by automatic decision-making
Automatically detecting human decision without any input
How to ignore irrelevant information is a key headache
De Beurs et al. [14]
Online intervention Expert-driven method Intervention mapping Scrum
Standardizing the level of user involvement in the web-based healthcare system
Deciding threshold for user involvement is problematic
Developing appPerfectly working Offline data based devices linked on long-term data transfer instead to android phones of real time and servers for data visualization monitoring and environment sensing
Increasing user involvement under limited resources
Kumar and Bala Hadoop [12]
Doing sentimental Analyzing twitter analysis and saving data users’ view on a on Hadoop particular business product
Goyal [11]
Text mining and hybrid Opinion mining of Cost-effective tweets related to food way to predict approach combining KNN and Naïve Bayes price crisis prizes
KNN and Naïve Bayes classifier
Checking out Usage of two popularity of a programming particular service languages needs experts Data needs to be cleaned before training
Data Analytics in Mental Healthcare
17
Figure 2: AI and ML [24].
One of the machine learning algorithms named artificial neural network (ANN) is based on three-layer architecture. Kellmeyer [7] introduced a way to secure big brain data from neurotechnological devices using ANN. This algorithm was working on a huge amount of data (train data) to predict accurate results. But patients’ brain diseases are rare so training models on small data may produce imprecise results. Machine learning models are data hungry. To obtain accurate results as an output, there is a need of training more data with distinct features as an input. These new methods cannot be applicable on clinical data due to the limited economy resources.
Prediction through Smart Devices Various monitoring wearable devices (Table 3) are available that continuously capture the finer details of behaviors and provide important cues about fear and autism. This information is helpful to recognize mental issues of the user of those devices. Victims were monitored continuously for a month. High level computation performed on the voice requires high complexity data as well as high computational power which leads to a huge pressure on the small chip. In order to overcome power issues, relatively low frequency was chosen. Yang et al. [9] invented an audio well-being device and conducted a survey in which participants have to speak more than 10 minutes in a quiet room. The first step is to choose the validity of the sample by completing some questions (including STAI, NEO-FFI, and AQ) to the participants. In order to determine whether they are suitable for the experiment or not, a test was conducted based on an AQ question. There was a classification algorithm applied on the AQ data. This type of device has one advantage; it
18
Data Analysis and Information Processing
perfectly worked on long-term data instead of low-term one but they used offline data transfer instead of real time. Although it has different sensors, adding up garbage data to the sensors is a very obvious thing. This is an application that offers on-hand record management using mobile/tablet technology once security and privacy are confirmed. To increase the reliability of IoT devices, there is a need to increase the sample size with different age groups in real time environment to check the validity of the experiment. There are a lot of technologies that effectuate tracking data like smartphones, credit cards, social media, and sensors. This paper discussed some of the existing work to tackle such data. In Table 3, one of the approaches is human made algorithm; searching for disease symptoms hits disease websites, sending/receiving healthcare e-mail, and sharing health information on social media through this kind of data. These are some examples of activities that perform key rules to produce medical data.
Role of Social Media to Predict Mental Illness Constant mood of the patient is one of the parameters to detect his/her mental health. According to Lenhart, A. et al. [25] studid almost four out of five internet users of social media. In Table 3, researchers used twitter data to get online user review that helps the seeker to check out popularity of a particular service or purchase a product. In order to collect opinion of people on Airtel, they did analysis on it. Filter of the keyword is done using Filter by content and Filter by location. First of all, special character, URL, spam, and short words are removed from the tweets. Secondly, remaining words from the tweets are then tokenized and TF-IDF score is calculated for all the keywords. After cleaning of data, classification algorithm named K nearest neighbor and Naïve Bayes algorithm were applied on the text in order to extract feature. Location filters work on specific bounding filter. Although hybrid recommendation system is providing 76.31% accuracy of the result, then Naïve Bayes is 66.66%. At the end, automated system is designed for opinion mining. There is another point of consideration that Tweeter has unstructured data so handling such a huge amount of unstructured data is a tedious task to take up. Due to lack of schema structure, it is difficult to handle and store unstructured data. There is a need for storage devices to store an insignificant amount of data for processing. Cloud storage is the best option for such a material. The entire program is designed in Python so that it could be able to
Data Analytics in Mental Healthcare
19
catch all possible outcomes. Hadoop works on cloud computing that helps to accomplish different operations on distributed data in a systematic manner. Success rate of the above approach was around 70% but authors have done these tasks using two programming languages. Python code for extraction tweets and Java is used to train the data which required expert programmers on each language. It will help doctors to give more accurate treatment for several mental disorders in less time and at low cost. Infecting this approach provides predetection of depression that may preserve the patient to face the worst stage of mental illness.
Key Challenges to Big Data Approach (i)Big data has many ethical issues related to privacy, reusability without permission, and involvement of the rival organization.(ii)To work in diverse areas, big data requires collaboration with expert people in the relative field including physicians, biologists, and developers that is crucial part of it. Data mining algorithms can be used to observe or predict data more precisely than traditional clinical trials.(iii)People may feel hesitant to describe all things to the doctors. One of the solutions to estimate the bad mental illness before time is automated decision-making without human input as shown in Table 3 . It collects data from our behavior that is unsophisticated to the digital economy. Key role of digital providence must be inferred in order to understand the difficulties that technology may be responsible for people with mental illness.(iv)There are many security issues while discussing sensitive information online as data may be revealed so a new approach to provide privacy protections as well as decision-making from the big data through new technologies needs to be introduced.(v)Also, if online data is used to predict user personality, then keeping data secured and protected from hacker is a big challenge. A lot of cheap solutions exist but they are not reliable from a user’s perspective especially.(vi)Major challenges for enabling IoT in the device is communication; all of the above methods are useless without the user involvement. User is one of the main parts of the experiment especially if the user’s personal or live data is required. Although many web-based inventions related to mental health are being released, the actual problem of active participation by end users is limited. In Table 3, an expert-driven method is introduced that is based on intervention mapping and scrum methods. It may help to increase the involvement of the users. But if all the users are actively involved in the web-based healthcare system, then it becomes problematic.(vii)When deciding on the level of user involvement, there is a need to decide about user input with the accessibility
20
Data Analysis and Information Processing
of resources. It required an active role of technological companies and efficient time consumption. Further research should provide direction on how to select the best and optimized user-focused design strategies for the development of web-based mental health under limited resources.
CONCLUSIONS Big data are being used for mental health research in many parts of the world and for many different purposes. Data science is a rapidly evolving field that offers many valuable applications to mental health research, examples of which we have outlined in this perspective. We discussed different types of mental disorders and their reasonable, affordable, and possible solution to enhance the mental healthcare facilities. Currently, the digital mental health revolution is amplifying beyond the pace of scientific evaluation and it is very clear that clinical communities need to catch up. Various smart healthcare systems and devices developed that reduce the death rate of mental patients and avert the patient to associate in any illegal activities by early prediction. This paper examines different prediction methods. Various machine learning algorithms are popular to train data in order to predict future data. Random forest model, Naïve Bayes, and k-mean clustering are popular ML algorithms. Social media is one of the best sources of data gathering as the mood of the user also reveals his/her psychological behavior. In this survey, various advances in data science and its impact on the smart healthcare system are points of consideration. It is concluded that there is a need for a cost-effective way to predict intellectual condition instead of grabbing costly devices. Twitter data is utilized for the saved and live tweets accessible through application program interface (API). In the future, connecting twitter API with python, then applying sentimental analysis on ‘posts,’ ‘liked pages’, ‘followed pages,’ and ‘comments’ of the twitter user will provide a cost-effective way to detect depression for target patients.
ACKNOWLEDGMENTS The authors are thankful to Prince Sultan University for the financial support towards the publication of this paper.
Data Analytics in Mental Healthcare
21
REFERENCES 1.
R. Bauer, T. Glenn, S. Strejilevich et al., “Internet use by older adults with bipolar disorder: international survey results,” International Journal of Bipolar Disorders, vol. 6, no. 1, p. 20, 2018. 2. Y.-A. De Montjoye, J. Quoidbach, F. Robic, and A. Pentland, “Predicting personality using novel mobile phone-based metrics,” in Proceedings of the International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction, pp. 48–55, Berlin, Heidelberg, April 2013. 3. W. Bauer and C. J. Hopwood, “Using machine learning to advance personality assessment and theory,” Personality and Social Psychology Review, vol. 23, no. 2, pp. 190–203, 2019. 4. C. Hill, J. L. Martin, S. Thomson, N. Scott-Ram, H. Penfold, and C. Creswell, “Navigating the challenges of digital health innovation: considerations and solutions in developing online and smartphoneapplication-based interventions for mental health disorders,” British Journal of Psychiatry, vol. 211, no. 2, pp. 65–69, 2017. 5. D. V. Dimitrov, “Medical internet of things and big data in healthcare,” Healthcare Informatics Research, vol. 22, no. 3, pp. 156–163, 2016. 6. S. Monteith, T. Glenn, J. Geddes, and M. Bauer, “Big data are coming to psychiatry: a general introduction,” International Journal of Bipolar Disorders, vol. 3, no. 1, p. 21, 2015. 7. P. Kellmeyer, “Big brain data: on the responsible use of brain data from clinical and consumer-directed neurotechnological devices,” Neuroethics, vol. 11, pp. 1–16, 2018. 8. L. Jiang, B. Gao, J. Gu et al., “Wearable long-term social sensing for mental wellbeing,” IEEE Sensors Journal, vol. 19, no. 19, 2019. 9. S. Yang, B. Gao, L. Jiang et al., “IoT structured long-term wearable social sensing for mental wellbeing,” IEEE Internet of Things Journal, vol. 6, no. 2, pp. 3652–3662, 2018. 10. S. Monteith and T. Glenn, “Automated decision-making and big data: concerns for people with mental illness,” Current Psychiatry Reports, vol. 18, no. 12, p. 112, 2016. 11. S. Goyal, “Sentimental analysis of twitter data using text mining and hybrid classification approach,” International Journal of Advance Research, Ideas and Innovations in Technology, vol. 2, no. 5, pp. 2454–132X, 2016.
22
Data Analysis and Information Processing
12. M. Kumar and A. Bala, “Analyzing twitter sentiments through big data,” in Proceedings of the 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), pp. 2628–2631, New Delhi, India, March 2016. 13. P. Dhaka and R. Johari, “Big data application: study and archival of mental health data, using MongoDB,” in Proceedings of the 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 3228–3232, Chennai, India, March 2016. 14. D. De Beurs, I. Van Bruinessen, J. Noordman, R. Friele, and S. Van Dulmen, “Active involvement of end users when developing webbased mental health interventions,” Frontiers in Psychiatry, vol. 8, p. 72, 2017. 15. V. Turner, J. F. Gantz, D. Reinsel, and S. Minton, “The digital universe of opportunities: rich data and the increasing value of the internet of things,” IDC Analyze the Future, vol. 16, 2014. 16. I. C. Passos, P. Ballester, J. V. Pinto, B. Mwangi, and F. Kapczinski, “Big data and machine learning meet the health sciences,” in Personalized Psychiatry, pp. 1–13, Springer, Cham, Switzerland, 2019. 17. R. C. Kessler, S. L. Bernecker, R. M. Bossarte et al., “The role of big data analytics in predicting suicide,” in Personalized Psychiatry, pp. 77–98, Springer, Cham, Switzerland, 2019. 18. B. Cleland, J. Wallace, R. Bond et al., “Insights into antidepressant prescribing using open health data,” Big Data Research, vol. 12, pp. 41–48, 2018. 19. Hasan M. M., M. Noor-E-Alam, Patel M. R., Modestino A. S., Young G. Sanchez L. D., A Novel Big Data Analytics Framework to Predict the Risk of Opioid Use Disorder. 2019. 20. R. T. Perdue, J. Hawdon, and K. M. Thames, “Can big data predict the rise of novel drug abuse?” Journal of Drug Issues, vol. 48, no. 4, pp. 508–518, 2018. 21. A. Furnham, “A big five facet analysis of sub-clinical dependent personality disorder (dutifulness),” Psychiatry Research, vol. 270, pp. 622–626, 2018. 22. E. Salagre, E. Vieta, and I. Grande, “Personalized treatment in bipolar disorder,” in Personalized Psychiatry, pp. 423–436, Academic Press, Cambridge, MA, USA, 2020.
Data Analytics in Mental Healthcare
23
23. L. Granka, “Inferring the public agenda from implicit query data,” in Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Boston, MA, USA, July 2009. 24. V. Sinha, 2019, https://www.quora.com/What-are-the-maindifferences-between-artificial-intelligence-and-machine-learning-Ismachine-learning-a-part-of-artificial-intelligence. 25. Lenhart A., Purcell K., Smith A., Zickuhr K., Social media & mobile internet use among teens and young adults. Millennials, Pew Internet & American Life Project, Washington, DC, USA, 2010.
Chapter 2
Case Study on Data Analytics and Machine Learning Accuracy
Abdullah Z. Alruhaymi, Charles J. Kim Department of Electrical Engineering and Computer Science, Howard University, Washington D.C, USA.
ABSTRACT The information gained after the data analysis is vital to implement its outcomes to optimize processes and systems for more straightforward problem-solving. Therefore, the first step of data analytics deals with identifying data requirements, mainly how the data should be grouped or labeled. For example, for data about Cybersecurity in organizations, grouping can be done into categories such as DOS denial of services, unauthorized access from local or remote, and surveillance and another probing. Next, after identifying the groups, a researcher or whoever carrying out the data analytics goes out into the field and primarily collects the data. The data collected is then organized in an orderly fashion to enable easy analysis; we aim to study different articles and compare performances for each algorithm to choose the best suitable classifies. Citation: Alruhaymi, A. and Kim, C. (2021), “Case Study on Data Analytics and Machine Learning Accuracy”. Journal of Data Analysis and Information Processing, 9, 249-270. doi: 10.4236/jdaip.2021.94015. Copyright: © 2021 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0
26
Data Analysis and Information Processing
Keywords: Data Analytics, Machine Learning, Accuracy, Cybersecurity, Performance
INTRODUCTION Data Analytics is a branch of data science that involves the extraction of insights from data to gain a better understanding. It entails all the techniques, data tools, and processes involved in identifying trends and measurements that would otherwise be lost in the enormous amount of information available and always getting generated in the world today. Grouping the dataset into categories is an essential step of the analysis. Then, we go ahead and clean up the data by removing any instances of duplication and errors done during its collection. In this step, there is also the identification of complete or incomplete data and the implementation of the best technique to handle incomplete data. The impact of missing values leads to an incomplete dataset in machine learning (ML) algorithms’ performance causes inaccuracy and misinterpretations. Machine learning has emerged as a problem-solver for many existing situation problems. Advancement in this field helps us with Artificial intelligence (AI) in many applications we use daily in real life. However, statistical models and other technologies failed to remedy our modern luxury and were unsuccessful in holding categorical data, dealing with missing values and significant data points [1]. All these reasons arise the importance of Machine Learning Technology. Moreover, ML plays a vital role in many applications, e.g., cyber detection, data mining, natural language processing, and even disease diagnostics and medicine. In all these domains, we look for a clue by which ML offers possible solutions. Since ML algorithms do training with part of a dataset and tests with the rest of the other dataset, unless missingness is entirely random and this is rarely happening, missing elements in especially training dataset can alter to insufficient capture of the entire population of the complete dataset. Therefore, in turn, it would lead to lower performance with the test dataset. However, if reasonably close values somehow replace the missing elements, the performance for the imputed dataset would be restored correctly to the level of the same as that of the intact, complete dataset. Therefore, this research intends to investigate the performance variation under different numbers of missing elements and under two other missingness mechanisms, missing completely at random (MCAR) and missing at random (MAR).
Case Study on Data Analytics and Machine Learning Accuracy
27
Therefore, the objectives of this research dissertation are: 1)
Investigation of the data analytic algorithms’ performance under full dataset and incomplete dataset. 2) Imputation of a missing element in the incomplete dataset by multiple imputation approaches to make imputed datasets. 3) Evaluation of the algorithms’ performance with imputed datasets. The general distinction between most ML applications is deep learning, and the data itself is the fuel of the process. Data analytics has evolved dramatically over the last two decades. Hence more research is in this area is inevitable. Since its importance, the background of the analytics field and ML is promising and shiny. Much research was conducted on ML accuracy measurements and data analysis, but few were done on incomplete data and imputed data and the comparison outcome. We aim to highlight the results drawn from different dataset versions and the missingness observed in the dataset. To create an incomplete dataset, we shall impose two types of missingness, namely, MCAR and MAR into the complete dataset. MCAR missingness will be created by inputting N/A into some variable entries to create an impression of data missing by design. MAR missingness will also be generated by inputting N/A into some cells of variables in the dataset. This will create an impression that these incomplete variables are related to some other variables within the dataset that are complete and hence this will bring about the MAR missingness. These incomplete datasets will then be imputed using a multiple imputation by chained equations (MICE) to try and make it complete by removing the two types of missingness. The imputed dataset will then be used to train and test machine learning algorithms. And lastly, the performance of the algorithms with the imputed datasets will be duly compared to performance of the same algorithms with the initially complete dataset.
RESEARCH METHODOLOGY This article is one chapter of the whole dissertation work, and to achieve the objectives of the dissertation, the following tasks are performed: 1) 2) 3)
cyber-threat dataset selection ML algorithms selection investigation and literature review on the missingness mechanisms of MCAR and MAR
Data Analysis and Information Processing
28
4) 5) 6)
investigation of imputation approaches application of multiple imputation approaches evaluation of the algorithms under different missing levels (10%, 20%, and 30%) and two missingness mechanisms. The first two items are discussed below, and the other items are detailed in the succeeding chapters. The proposed workflow of the research is as shown in the following Figure 1. The methodology for this paper is through analyzing many online internets article and compare for accuracy to use later with the selected dataset performance, and from many algorithms used we select the best that we think are suitable for cybersecurity dataset analysis. Four major machine learning algorithms will be utilized to train and test using portions of the imputed dataset to measure its performance in comparison with the complete dataset.
Figure 1: Dissertation workflow.
These will include decision tree, random forest, support vector machines and naïve bayes. These will be discussed in depth later in the dissertation. To create the imputed dataset the multiple imputation by chained equations (MICE) method will be used for we consider it a robust method in handling
Case Study on Data Analytics and Machine Learning Accuracy
29
missingness and therefore appropriate for this research. This method ideally fills in the missing data values by using the iteration of prediction models methodology where in each iteration a missing value is imputed by using the already complete variables to predict it.
CYBER-THREAT DATASET SELECTION The dataset selected for this research is KDDsubset dataset from Cyber security Domain, a sample from KDDCUP’99 which consisted of 494,021 records (or instances). As shown in Figure 2, Attack types represent more than 80% of the cyber dataset. Denial of Service (Dos) is the most dangerous kind. It is columned in 42 features and the last feature in the data (42nd) is labeled as either normal or attack for the different sub-types of attacks (count 22) as shown in Figure 3 below. Smurf has the highest count compared to other attacks labelled with only one specific attack in each instance, we can visualize the classes from the figure above that have a different number of attacks and observe that smurf is the most frequent attack. The simulated attacks fall in one of the following four categories: Denial of Service (DoS), Probe, U2R, or R2L. The column is a connection type and is either an attack or normal. It is publicly available and widely used in academic research, researchers often use the KDDsubset as a sample of the whole KDDCUP99 dataset which consists of nearly five million records. It covers all attack types and its much easier to make experimental analysis with it. The 41 features are divided into four categories: basic, host, traffic, and content. Feature number 2 for example, named protocol consists of only 3 kinds and the most used protocol type is ICMP. Most of its records are of the attack type. As shown below in Figure 4.
Figure 2: KDDsubset count of attack and normal.
30
Data Analysis and Information Processing
Figure 3: Attack sub-types of count.
Figure 4: Protocol type has a lot of attacks in the ICMP.
The main problem of the KDDsubset is that it might contain of redundant records which is not ideal when we try to build a data analysis algorithm as it will make the model biased. Cyber security experts have developed advanced techniques to explore technologies that detect cyber-attacks using all of DARPA 1998 dataset used for intrusion detection. Improved versions of this are the 10% KDDCUP’99, NSL-KDD Cup, and Gure KDDCUP databases. The KDDCUP’99 dataset was used in the Third International
Case Study on Data Analytics and Machine Learning Accuracy
31
Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDDCUP’99 and the Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between “bad” connections, called intrusions or attacks, and “good’’ normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment [2]. The KDDCUP’99 dataset contains around 4,900,000 single connection vectors, every one of which includes 41 attributes and includes categories such as attack or normal [3], with precisely one specified attacktype of four main type of attacks: 1)
Denial of service (DoS): The use of excess resources denies legit requests from legal users on the system. 2) Remote to local (R2L): Attacker having no account gains a legal user account on the victim’s machine by sending packets over the networks. 3) User to root (U2R): Attacker tries to access restricted privileges of the machine. 4) Probe: Attacks that can automatically scan a network of computers to gather information or find any unknown vulnerabilities. All the 41 features are also labeled into four listed types: 1) Basic features: These characteristics tend to be derived from packet headers which are no longer analyzing the payload. 2) Content features: Aanalyzing the actual TCP packet payload, and here domain knowledge is used, and this encompasses features that include the large variety of unsuccessful login attempts. 3) Time-based traffic features: These features are created to acquire properties accruing over a 2-second temporal window. 4) Host-based traffic features: Make use of an old window calculated over the numerous connections. Thus, host-based attributes are created to analyze attacks, with a timeframe longer than 2 seconds [4]. Most of the features are of a continuous variable type, where MICE use multiple regression to account for uncertainty of the missing data, a standard error is added to linear regression and in calculating stochastic regression, a similar method of MICE called predictive mean matching was used. However, some variables (is_guest_login, flag, land, etc.) are of binary type
32
Data Analysis and Information Processing
or unordered categorical variables (discrete), MICE use a logistic regression algorithm to squash the predicted values between (0 and 1) by using the sigmoid function:
(1)
Below is an illustrative table data of the KDDCUP’99 with 41 features from source: (https://kdd.ics.uci.edu/databases/kddcup99/task.html). As shown in Table 1, we remove two features number 20 and 21 because their values in the data are zeros. The dataset used for testing the proposed regression model is the KDDsubset network intrusion cyber database, and since this dataset is quite a bit large and causes a time delay and slow execution of the R code due to limited hardware equipment, the data is therefore cleaned. Table 1: Attributes of the cyber dataset total 41 features Nr
Name
Description
1
duration
Duration of connnection
2
Protocol_type
Connection protocol (tcp, udp, icmp)
3
service
Dst port mapped to service
4
flag
Normal or error status flag of connection
5
Src_bytes
Number of data bytes from src to dst
6
dst_bytes
Bytes from dst to src
7
land
1 if connection is from/to the same host/port; else 0
8
wrong_fragment
Number of “wrong” fragments (values 0, 1, 3)
9
urgent
Number of urgent packets
10
hot
Number of “hot” indicators
11
number_failed_logins
Number of failed login attempts
12
logged_in
1 if successfully logged in: else 0
13
num_compromised
number of “compromised” conditions
14
root_shell
1 if root shell is obtained; else 0
Case Study on Data Analytics and Machine Learning Accuracy 15
su_attempted
1 if “su root” command attempted; else 0
16
num_root
Number of “root” accesses
17
num_file__creations
Number of file creation operations
18
num_shells
Number of shell prompts
19
num_access_files
Number of operations on access control files
20
num_outbound_cmds
Number of outbound commands in and ftp session
21
Is_hot_login
1 if login belongs to “hot” list; else 0
22
Is_guest_login
1 if login is “guest” login else 0
23
count
number of connections to same host as current connection in the past two seconds
24
srv_count
Number of connections to same service as current connection in the past two seconds
25
serror_rate
% of connections that have “SYN” errors
26
srv_serror_rate
% of connections that have “SYN” errors
27
rerror_rate
% of connections that have “REJ” errors
28
srv_rerror_rate
% of connections that have “REJ” errors
29
same_srv_rate
% of connections to the same service
30
diff_srv_rate
% of connections to different services
31
Srv_diff_host_rate
% of connections to different hosts
32
dst_host_count
Count of connections having same dst host
33
dst_host__srv_count
Count of connections having same des host and using same service
34
des host same srv rate
% of connections having same dst host and using the same servce
35
dst_host_diff_srv_rate
% of different services on current host
36
dst_host_samesrc_port_ rate
% of connections to current host having same src port
37
dst_host_srv_diff_host_ rate
% of connections to same service coming from diff hosts
38
dst_host_serror rate
% of connections to current host that have an SO error
33
34
Data Analysis and Information Processing
39
dst_host_srv_serror_rate
% of connections to current host and specified service that have an SO error
40
dst_host_rerror_rate
% of connections to current host that have an RST error
41
dst_host_srv_rerror_rate
% of connections to current host and specified service that have an RST error
42
connection_type
N or A
N = normal, A = attack, c = continuous, d = discrete. Features numbered 2, 3, 4, 7, 12, 14 and 15 are discrete types, and the others are of continuous type. The used, cleaned dataset has 145,585 connections and 40 numerical features where three features are categorical, which are converted to numeric. For the other 39 features describing the various connections, the dataset is scaled and normalized by letting the mean be zero and the standard deviation be equal to one. But the dataset is skewed to the right because of the categorical variables as shown in Figure 5 below. To make the data short, we exclude the label variables and add them later for testing by the machine learning algorithms. The dataset is utilized to test and evaluate intrusion detection with both normal and malicious connections labeled as either attack or normal. After much literature review on the same dataset and how the training and testing data are divided, we found that the best method could be letting the training data be 66% and testing data be 34%. The cleaned dataset contains 94,631 connections as training data and 50,954 connections as testing data. The training data employed for missingness mechanisms is the MCAR and MAR. The assumptions are done cell wise, so the total number of cells is 3,690,609 and sampling from it random. The labels were excluded from the training data to perform the missingness identification for two kinds of missingness. Then when the data is fed to the classifier models, we return the labels. The test data is taken from the original clean data and the label left unchanged. So, the test data is tested for all experiments of Machine Learning algorithms. Measurement for the accuracies of the cleaned data is posted before doing any treatment to it for the purpose of comparing missing data and imputed data with MICE with the baseline accuracy looked at for any useful data extracted from the main data analyzed. Now we feed the missing data and the imputed data to the classifiers. We analyze the performance of the four best-chosen classifiers on the dataset. The classifiers selected from literature were considered the best in the evaluation of performance.
Case Study on Data Analytics and Machine Learning Accuracy
35
ML ALGORITHMS SELECTION A huge substantial amount of data is available to organizations from a diverse log, application logs, intrusion detection systems, and others. Year over year, the data volume increases significantly and is produced by every person every second, with the number of devices connected to the internet being three times more than the world population.
Figure 5: KDDsubset normal distribution.
A large part of which works within the framework of the internet of things (IoT) and these results to more than 600 ZB of data each year. These facts show the significant development witnessed by the data in terms of type, size, and speed. This vast data is attributed to several reasons. The most important is digitization processes produced by companies and institutions in the past years and the widespread social media and applications of conversations and the internet of things. This growth in various technology fields has made the internet a tempting target for misuse and anomaly intrusion. Many researchers are thus engaged in analyzing the KDDCUP’99 for detecting intrusions. Analysis of the KDDsubset to test for the accuracy and misclassification of the data we find out of the 24 articles reviewed, 13 were straightforward with this dataset. In contrast, the others dealt with a modified versions of NSL-KDD and GureKDD. Some themes that are related directly to the KDDsubset were summarized in an Excel sheet and observed below. Not all algorithms fail when there is missing data. Some algorithms use the missing value as a unique and different value when building the predictive model, such as classification and regression trees.
Data Analysis and Information Processing
36
Summary review of 10 Articles. To find the most popular Algorithms to apply for the analysis. Article 1. Summarized as shown in Table 2 below. No accuracy was posted to this article that uses fuzzy inferences. It uses an effective set of fuzzy rules for inference approach which is identified automatically by making use of the fuzzy rule learning strategy and seems to be more effective for detecting intrusions in a computer system. Then rules were given to sugeno fuzzy system which classify the test data. Article 2. Summarized as shown in Table 3 below. Accuracy for this study was low with classifier called rule based enhanced genetic (RBEGEN). This is an enhancement of genetic algorithm. Article 3. Summarized as shown in Table 4 below. In this study, 11 algorithms were used, and we can note that we got high accuracies in decision tree and random forest. Article 4. Summarized as shown in Table 5 below. Table 2: Fuzzy Inference system Article
1: Intrusion Detection system using fuzzy inference system
Classification Technique
Results Accu- TT (sec) predic- TP TN FP FN Error Preci- Recall DR racy tion rate sion
. Sugenofuzzy inference system for generation of fuzzy rules and best first methods under select
.
.
.
.
.
.
.
.
.
.
FAR ,
Case Study on Data Analytics and Machine Learning Accuracy
37
Table 3: Enhance algorithm Article
Classification Technique
2: Intrusion Detection over networking KDD dataset using Enhance mining algorithm
Results Accuracy
TT predic- TP TN FP FN Error Precision Recall DR FAR (sec) tion rate
Rule based 86.30% . enhance genetic (RBEGEN)
.
.
.
.
.
.
83.21% #### .
.
Table 4: Features extraction Article
3: Detecting anomaly based network intrusion using feature extraction and classification techniques
ClasResults sification Technique Accuracy TT (sec)
prediction TP
TN
FP
FN
Error Precision Recall rate
100 4.9
DR
Decision Tree
95.09%
1.032
0.003
4649 2702 279
MLP
92.46%
20.59
0.004
4729
KNN
92.78%
82.956 13.24
4726 2446 535
23
7.22 89.83%
99.52%
Linear SVM
92.59%
78.343 2.11
4723 2434 547
26
7.41 89.62%
99.45%
Passive ag- 90.34 gressive
0.275
0.001
4701 2282 699
48
9.66 89.62%
99.45%
RBF SVM 91.67%
99.47
2.547
4726 2960 621
23
8.33 89.39%
99.52%
Random Forest
93.62%
1.189
0.027
4677 2560 621
23
6.38 91.74%
98.48%
AdaBoost
93.52%
29.556 0.225
4676 2553 428
73
6.48 91.61%
98.46%
Gausian NB
94.35%
244
0.006
4642 2651 330
107 5.65 93.36%
97.75% -
-
MultionmINB
91.71%
0.429
0.001
4732 2357 624
17
8.29 88.35%
99.64% -
-
Adratic 93.23% Discriminat Ana
1.305
0.0019
4677 2530 451
72
6.77 91.20%
84.87% -
-
2419 562 29
94.34%
97.34% .
7.54
89.38
FAR
,
99.56%
Data Analysis and Information Processing
38
Table 5: Outlier detection Article
4: Frature Classification and outlier detection to increased accuracy in intrusion detection system.
ClasResults sification Technique Accuracy TT (sec)
predic- TP TN FP FN Error Precision Recall DR FAR tion rate
C45
99.94% 199.33 be- . fore 23.14 after
.
.
.
.
.
.
.
.
.
KNN
99.90% 0.37 before . 0.23 After
.
.
.
.
.
.
.
.
.
Naïve bayes
96.16% 5.63 before . 1.36 after
.
.
.
.
.
.
.
.
.
Random forest
99.94% 554.63 be- . fore 205.97 after
.
.
.
.
.
.
.
.
.
SVM
99.94% 699.07 be- . fore 186.53 after
.
.
.
.
.
.
.
.
.
Three datasets were used in this study and one of them is the KDDCUP’99 to compare accuracy and execution time before and after dimensionality reduction. Article 5. Summarized as shown in Table 6 below. Five algorithms were used. Article 6. Summarized as shown in Table 7 below. Singulars Values Decomposition (SVD) is eigenvalues method used to reduce a high-dimensional dataset into fewer dimensions while retaining important information and uses improved version of the algorithm (ISVD). Article 7. Summarized as shown in Table 8 below. In article number 7 above two classifiers were used and for J48 we have high accuracy results.
Case Study on Data Analytics and Machine Learning Accuracy
39
Table 6: Three-based data mining Article
Classification Results Technique Accuracy TT pre(sec) diction
5: Intrusion Detection with Three based Data Mining classification techniques by using KDDdataset
TP TN FP FN Error rate
Preci- Recall DR FAR sion
Hoeffding Tree
97.05%
.
.
.
.
.
.
2.9499 .
.
.
.
J48
98.04%
.
.
.
.
.
.
1.9584 .
.
.
.
Random Forest
98.08%
.
.
.
.
.
.
1.1918 .
.
.
,
Random Tree 98.03%
.
.
.
.
.
.
1.9629 .
.
.
.
Req Tree
.
.
.
.
.
.
1.9738 .
.
.
.
98.02%
Table 7: Data reduction Article
ClasResults sification Technique Accuracy TT (sec) prediction TP
SVD 6: Using an imputed Data detec- ISVD tion method in intrusion detection system
43.73%
45.154
10.289
94.34%
189.232 66.72
TN
FP
FN
Error Preci- Recall DR FAR rate sion
43.67% 56.20% 53.33% 43.8 0.5 92.86
.
.
.
.
.
.
95.82 0.55
Table 8: Comparative analysis Article
ClasResults sification Accuracy TT predic- TP TN FP FN Error Precision Recall Tech(sec) tion rate nique
7: Compara- J48 tive analysis Naïve of clasbayes sification algorithms on KDD’99 Dataset
DR FAR
99.80%
1.8
.
.
.
.
.
.
99.80%
99.80% .
.
84.10%
47
.
.
.
.
.
.
97.20%
77.20% .
.
Article 8 and 9. Summarized as shown in Table 9 below. We have 10 classifuers used to take measurement metrics for the dataset preprocessed and non-preprocessed and results are better with the processed dataset.
Data Analysis and Information Processing
40
Accuracies Percentage for the above Articles: As shown in Figure 6 below, the percent of accuracy for each classifier is highlighted. These tables represent a paper that group major attack types and separates the 10% KDDCUP’99 into five files according to the attack types (DoS, Probe, R2L, U2R, and normal). Summarized as shown in Table 10 below. Based on the attack type, DoS and Probe attacks involve several records, while R2L and U2R are embedded in the data portions of packets and usually involve only a single instance. The above articles were summarized in measuring metrics, and had nearly 61 classifiers results with the best applicable algorithms. Table 9: Problems in dataset Article
8: Problems of KDD Cup 99 Dataset existed and data preprocessing
ClasResults sification Accuracy TT (sec) Technique
prediction TP
TN FP FN Error Pre- Re- DR FAR rate ci- call sion
Naïve Bayes
96.31%
6.16
381.45
.
.
.
.
.
.
.
.
.
Bayes Net
99.65%
43.08
57.38
.
.
.
.
.
.
.
.
.
Liblinear 99.00%
3557.76
6.1
.
.
.
.
.
.
.
.
.
MLP
92.92%
22,010.2 25.4
.
.
.
.
.
.
.
.
.
IBK
100.00% 0.39
79,304.62 .
.
.
.
.
.
.
.
.
Vote
56.84%
0.17
3
.
.
.
.
.
.
.
.
.
OneR
98.14%
5.99
6.38
.
.
.
.
.
.
.
.
.
J48
99.98%
99.43
5.8
.
.
.
.
.
.
.
.
.
Random 100.00% 122.29 forest
4.23
.
.
.
.
.
.
.
.
.
Random 100.00% 14.79 Tree
19.09
.
.
.
.
.
.
.
.
.
Case Study on Data Analytics and Machine Learning Accuracy 9: Problems of KDD Cup 99 Dataset existed and data preprocessing
41
Naïves Bayes
90.45%
3.21
116.06
.
.
.
.
.
.
.
.
.
Bayes Net
99.13%
26.65
35.04
.
.
.
.
.
.
.
.
.
Liblinear 98.95%
708.36
4.27
.
.
.
.
.
.
.
.
.
MLP
99.66%
11,245.8 53.31
.
.
.
.
.
.
.
.
.
IBK
100.00% 0.19
48,255.78 .
.
.
.
.
.
.
.
.
Vote
56.84
0.13
3.45
.
.
.
.
.
.
.
.
.
OneR
98.98%
3.76
5.35
.
.
.
.
.
.
.
.
.
J48
99.97%
99.56
5.26
.
.
.
.
.
.
.
.
.
Random 99.99% Forest
10.89
5.63
.
.
.
.
.
.
.
.
.
Random 100.00% 10.45 Tree
4.26
.
.
.
.
.
.
.
.
.
Table 10: Dataset grouping Article
Classification Class Technique name
Results AA
10: Application Bayes Net of Data mining to Network Intrusion detection: Classifier selection model
TN
FP
FN
Error Precision rate
Recall DR FAR
94.60%
.
0.20%
.
.
.
.
.
.
Probe
88.80%
.
0.12%
.
.
.
.
.
.
U2R
30.30%
.
0.30%
.
.
.
.
.
.
R2L
5.20%
.
0.60%
.
.
.
.
.
.
79.20%
.
1.70%
.
.
.
.
.
.
Probe
94.80%
.
13.30% .
.
.
.
.
.
U2R
12.20%
.
0.90%
.
.
.
.
.
.
R2L
0.10%
.
0.30%
.
.
.
.
.
.
96.80%
.
1.00%
.
.
.
.
.
.
Probe
75.20%
.
0.20%
.
.
.
.
.
.
U2R
12.20%
.
0.10%
.
.
.
.
.
.
R2L
0.10%
.
0.50%
.
.
.
.
.
.
97.40%
.
1.20%
.
.
.
.
.
.
Dos
Naïves Bayes Dos
J48
NB Tree
Dos
Dos
TT (sec) Test time TP (sec)
90.62% 628
78.32% 557
92.06% 1585
92.28% 295.88
.
.
.
.
Data Analysis and Information Processing
42
Decision Table
Jrip
One R
MLP
Probe
73.30%
.
1.10%
.
.
.
.
.
.
U2R
1.20%
.
0.10%
.
.
.
.
.
.
R2L
0.10%
.
0.50%
.
.
.
.
.
.
97%
.
10.70% .
.
.
.
.
.
Probe
57.60%
.
40%
.
.
.
.
.
.
U2R
32.80%
.
0.30%
.
.
.
.
.
.
R2L
0.30%
.
0.10%
.
.
.
.
.
.
97.40%
.
0.30%
.
.
.
.
.
.
Probe
83.80%
.
0.10%
.
.
.
.
.
.
U2R
12.80%
.
0.10%
.
.
.
.
.
.
R2L
0.10%
.
0.40%
.
.
.
.
.
.
94.20%
.
6.80%
.
.
.
.
.
.
Probe
12.90%
.
0.10%
.
.
.
.
.
.
U2R
10.70%
.
2.00%
.
.
.
.
.
.
R2L
10.70%
.
0.10%
.
.
.
.
.
.
96.90%
.
1.47%
.
.
.
.
.
.
Probe
74.30%
.
0.10%
.
.
.
.
.
.
U2R
20.10%
.
0.10%
.
.
.
.
.
.
R2L
0.30%
.
0.50%
.
.
.
.
.
.
Dos
Dos
Dos
Dos
91.66% 6624
0.923
0.8931
0.9203
207.47
375
350.15
.
.
.
.
As a result of the literature review, and observations, a conclusion was made to select the following four most influential and widely used algorithms which are; decision tree, random forest, support vector machine and naïve bayes. To apply for the dissertation study. 1)
Decision tree: [5] This is a flowchart like tree system that is the working area of this classifier, and it is divided into subparts via identifying lines. It uses entropy and information gain to build the decision tree by selecting features that increase the IG and reduce the entropy. Classification and Regression Trees, abbreviated CART is a useful tool that works for classification or regression of the predictive modeling problems.
Case Study on Data Analytics and Machine Learning Accuracy
43
To build a decision tree model, we follow these steps: a)
The Entropy should be calculated before the splitting for the Target Columns. b) Select a feature with the Target column and calculate the IG (Information Gain) and the Entropy. c) Then, the largest information gain should be selected. d) The selected features are set to be the root of the tree that then splits the rest of the features. e) The algorithm repeats from 2 to 4 until the leaf has a decision target.
44
Data Analysis and Information Processing
Figure 6: Accuracies for different algorithms depicted.
Case Study on Data Analytics and Machine Learning Accuracy
45
In short, entropy measures the homogeneity of a sample of data. The value is between zero and one, zero only when the data is completely homogenous, and one only when data is non-homogenous. (2) Information gain measures the reduction in entropy or surprise by splitting a dataset according to a given value of a random variable. A larger information gain suggests a lower entropy group or groups of samples, and hence less surprise. (3) where: f: feature split on Dp: dataset of the parent mode
Dleft: dataset of the left child node
Dright: dataset of the right child node
I: impurity criterion (Gini index or Entropy) N: total number of samples Nleft: number of samples at left child node
Nright: number of samples at right child node [6].
Figure 7 below indicates how the Decision Tree works. 2)
Random Forest: [7] The random forest algorithm is a supervised classification algorithm like a decision tree and instead of one tree, this classifier uses multiple trees and merges them to obtain better accuracy and prediction. In random forests each tree in the ensemble is built from a sample drawn with a replacement from a training set called bagging (Bootstrapping) and this improves stability of the model. Figure 8 below shows the mechanism of this algorithm.
Data Analysis and Information Processing
46
Figure 7: Decision Tree algorithm.
3)
Support vector machines (SVM): [8] SVM is memory efficient and uses a subset of training points in the decision. It is a set of supervised machine learning procedures used for classification, regression, and outlier detection. Different kernel functions can be specified for a decision function—example of supervised learning algorithms which belong to both the regression and classification categories of machine learning algorithms. Regarding limitations of data dimensionality and limited samples, this classifier does not suffer from these mentioned limitations. It contains three functions linear, polynomial, and sigmoid and so the user can select any one of the functions to classify the data. Kernel Functions
Case Study on Data Analytics and Machine Learning Accuracy
47
Figure 8: Three decision trees consist Random Forest model.
One of the most important features of SVM is that it utilizes the kernel trick to extend the decision functions to be used for linear and non-linear cases. Linear Kernel. It can be used as a dot product between any two observations and it’s the simplest kernel function. (4) Polynomial kernel: It is a more complex function that can be used to distinguish between non-linear inputs. And it can be represented as:
(5)
where p is the polynomial degree. Radial basis function is a kernel function that helps in non-linear cases, as it computes the similarity that depends on the distance between the origin or from some points: RBF (Gaussian): (6)
Data Analysis and Information Processing
48
Figure 9 below shows how SVM work. 4)
Naïve Bayes: [10] Naïve Bayes classifiers are a family of simple “probabilistic classifiers” based on applying bayes’ theorem with strong (naïve) independence assumptions between the features. This classifier provides a simple approach, with precise semantics, to represent and learn probabilistic knowledge. Naïve Bayes works on the principle of conditional probability as given by the bayes theorem and formulates linear regression using probability distribution. The posterior probability of the model parameters is conditional upon the training inputs and outputs: The aim is to determine the posterior distribution for the model parameters. The Bayes Rule is shown in Equation (7) below: (7) (8) (9)
Figure 9. Support Vector Machine algorithm [9]. Figure 9: Support Vector Machine algorithm [9].
Case Study on Data Analytics and Machine Learning Accuracy
49
ACCURACY OF MACHINE LEARNING The aim is not only to hypothesize and prove them. But also, to answer the following scientific questions; first, whether scientific inquiry is about missing data and how the classifiers perform. Which one works better? and how missingness impacts Machine Learning Performance? Does the accuracy increase or decrease if we impute data correctly with the correct imputation number? Can the restoration of the missing data accuracy be one metric for evaluating the performance? The number of instances a model can predict correctly. Accuracy is an essential measurement tool that measures values from the experiment, how close to the actual value? How relative a measured value is to the “true” value, and precision of how close the data is to each other? Although the accuracy in some models may be achieved in high results, it is still not significantly accurate enough, and we need to check for other factors. Because the model might give good results and still be biased toward some frequent records, some classification algorithms will also consume extended hours or even more to get the required products. Certainly, classification accuracy alone can be misleading if we have an unequal number of observations in each class or more than two classes in the dataset. Suppose we import specific libraries to solve the mentioned problem above by balancing types called oversampling or down sampling. In that case, that means the library will create equal samples fed to the ML algorithm to classify. Also, the best evaluation metrics for such is the confusion Matrix; as shown in Figure 10 below, is it appropriate enough for data to be analyzed? A confusion matrix is a technique for summarizing the performance of a classification algorithm. Accuracy = (correctly predicted class/total testing class) × 100%, the accuracy can be defined as the percentage of correctly classified instances Acc = (TP + TN)/(TP + TN + FP + FN). Where TP, FN, FP, and TN represent the number of true positives, false negatives, false positives, and true negatives, respectively. Also, you can use standard performance measures: Sensitivity = TP/TP + FN; R = TP/(TP + FN) Specificity = TN/TN + FP
50
Data Analysis and Information Processing
Precision = TP/TP + FP; P = TP/(TP + FP) True-Positive Rate = TP/TP + FN False-Positive Rate = FP/FP + TN True-Negative Rate = TN/TN + FP False-Negative Rate = FN/FN + TP (T for true and F for false, N for negative and P for positive). For good classifiers, TPR and TNR both should be nearer to 100%. Similar is the case with precision and accuracy parameters. On the contrary to: FPR and FNR both should be as close to 0% as possible. Detection Rate (DR) = number of instances notified/total number of instances estimated. FPR = FP/ALL POSITIVE; TPR = TP/ALL NEGATIVE
Figure 10: Confusion Matrix for attack (A) and normal (N).
On Missing Data The more accurate the measurement, the better it will be. The missing values are imputed with best guesses and used to work if the missing values are small then drop the records with missing values if the data is large enough. However, this was the case before the multivariate approach, but now this is not the case anymore. Accuracy with missing data and because all the rows and columns are of numerical values so, when we make the data incomplete with R code, the replacement for empty cells is done using NA. And the classifiers may not work with the NA, instead, we can substitute with mean for each variable or median or mode or just with constant number -9999 and we run the code and this works with a constant number.
Case Study on Data Analytics and Machine Learning Accuracy
51
We conclude that the outcome of accuracy may decrease, or maybe some algorithms will not respond.
On Imputed Data We assume that with reasonable multivariate imputation procedures, the accuracy will be close enough to the baseline accuracy of the original dataset before we make it incomplete, then we impute for both mechanisms. The results will be shown in chapter 5.
CONCLUSION This paper provides a survey of different machine learning techniques for measuring accuracy for different ML classifiers used to detect intrusions in the KDDsubset dataset. Many algorithms have shown promising results because they identify the attribute accurately. The best algorithms were chosen to test our dataset and results posted in a different chapter. The performance of four machine learning algorithms has been analyzed using complete and incomplete versions of the KDDCUP’99 dataset. From this investigation it can be concluded that the accuracy of these algorithms is greatly affected when the dataset containing missing data is used to train these algorithms and they cannot be relied upon in solving real world problems. However, after using the multiple imputation by chained equation (MICE) to remove the missingness in the dataset the accuracy of the four algorithms increases exponentially and is even almost equal to that of the original complete dataset. This is clearly indicated by the confusion matrix in section 5 where TNR and the TPR are both close to 100% while the FNR and FPR are both close to zero. This paper has clearly indicated that the performance of machine learning algorithms decreases greatly when a dataset contains missing data, but this performance can be increased by using MICE to get rid of the missingness. Some classifiers have better accuracy than others, so we should be careful to choose the suitable algorithms for each independent case. We conclude from the survey and observation that the chosen classifiers work best with cybersecurity systems, while others are not and may be helpful in different domains. A Survey of many articles provides a beneficial chance for analyzing detecting attacks and offers an opportunity for improved decision-making in which model is the best to use.
52
Data Analysis and Information Processing
ACKNOWLEDGEMENTS The authors would like to thank the anonymous reviewers for their valuable suggestions and notes; thanks, extended to Scientific Research/Journal of Data Analysis and Information Processing.
Case Study on Data Analytics and Machine Learning Accuracy
53
REFERENCES 1.
Fatima, M. and Pasha, M. (2017) Survey of Machine Learning Algorithms for Disease Diagnostic. Journal of Intelligent Learning Systems and Applications, 9, 1-16. https://doi.org/10.4236/ jilsa.2017.91001 2. Kim, D.S. and Park, J.S. (2003) Network-Based Intrusion Detection with Support Vector Machines. In: International Conference on Information Networking. Springer, Berlin, Heidelberg, 747-756. https://doi.org/10.1007/978-3-540-45235-5_73 3. Tavallaee, M., Bagheri, E., Lu, W. and Ghorbani, A.A. (2009) A Detailed Analysis of the KDD CUP 99 Data Set. 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, Ottawa, 8-10 July 2009, 1-6. https://doi.org/10.1109/ CISDA.2009.5356528 4. Sainis, N., Srivastava, D. and Singh, R. (2018) Feature Classification and Outlier Detection to Increased Accuracy in Intrusion Detection System. International Journal of Applied Engineering Research, 13, 7249-7255. 5. Sharma, H. and Kumar, S. (2016) A Survey on Decision Tree Algorithms of Classification in Data Mining. International Journal of Science and Research (IJSR), 5, 2094-2097. https://doi.org/10.21275/ v5i4.NOV162954 6. Singh, S. and Gupta, P. (2014) Comparative Study ID3, Cart and C4. 5 Decision Tree Algorithm: A Survey. International Journal of Advanced Information Science and Technology (IJAIST), 27, 97-103. 7. Chen, J., Li, K., Tang, Z., Bilal, K., Yu, S., Weng, C. and Li, K. (2016) A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment. IEEE Transactions on Parallel and Distributed Systems, 28, 919-933. https://doi.org/10.1109/TPDS.2016.2603511 8. Suthaharan, S. (2016) Support Vector Machine. In: Machine Learning Models and Algorithms for Big Data Classification. Springer, Boston, 207-235. https://doi.org/10.1007/978-1-4899-7641-3_9 9. Larhman (2018) Linear Support Vector Machines. https://en.wikipedia. org/wiki/Support-vector_machine 10. Chen, S., Webb, G.I., Liu, L. and Ma, X. (2020) A Novel Selective Naïve Bayes Algorithm. Knowledge-Based Systems, 192, Article ID: 105361. https://doi.org/10.1016/j.knosys.2019.105361
Chapter 3
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
André Ribeiro, Afonso Silva, Alberto Rodrigues da Silva INESC-ID/Instituto Superior Técnico, Lisbon, Portugal
ABSTRACT These last years we have been witnessing a tremendous growth in the volume and availability of data. This fact results primarily from the emergence of a multitude of sources (e.g. computers, mobile devices, sensors or social networks) that are continuously producing either structured, semi-structured or unstructured data. Database Management Systems and Data Warehouses are no longer the only technologies used to store and analyze datasets, namely due to the volume and complex structure of nowadays data that degrade their performance and scalability. Big Data is one of the recent challenges, since it implies new requirements in terms of data storage, processing and visualization. Despite that, analyzing properly Big Data can constitute great advantages because it allows discovering patterns and correlations in datasets. Users can use this processed information to gain deeper insights and to get business advantages. Thus, data modeling and data analytics are Citation: Ribeiro, A. , Silva, A. and da Silva, A. (2015), “Data Modeling and Data Analytics: A Survey from a Big Data Perspective”. Journal of Software Engineering and Applications, 8, 617-634. doi: 10.4236/jsea.2015.812058. Copyright: © 2015 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0
56
Data Analysis and Information Processing
evolved in a way that we are able to process huge amounts of data without compromising performance and availability, but instead by “relaxing” the usual ACID properties. This paper provides a broad view and discussion of the current state of this subject with a particular focus on data modeling and data analytics, describing and clarifying the main differences between the three main approaches in what concerns these aspects, namely: operational databases, decision support databases and Big Data technologies. Keywords: Data Modeling, Data Analytics, Modeling Language, Big Data
INTRODUCTION We have been witnessing to an exponential growth of the volume of data produced and stored. This can be explained by the evolution of the technology that results in the proliferation of data with different formats from the most various domains (e.g. health care, banking, government or logistics) and sources (e.g. sensors, social networks or mobile devices). We have assisted a paradigm shift from simple books to sophisticated databases that keep being populated every second at an immensely fast rate. Internet and social media also highly contribute to the worsening of this situation [1] . Facebook, for example, has an average of 4.75 billion pieces of content shared among friends every day [2] . Traditional Relational Database Management Systems (RDBMSs) and Data Warehouses (DWs) are designed to handle a certain amount of data, typically structured, which is completely different from the reality that we are facing nowadays. Business is generating enormous quantities of data that are too big to be processed and analyzed by the traditional RDBMSs and DWs technologies, which are struggling to meet the performance and scalability requirements. Therefore, in the recent years, a new approach that aims to mitigate these limitations has emerged. Companies like Facebook, Google, Yahoo and Amazon are the pioneers in creating solutions to deal with these “Big Data” scenarios, namely recurring to technologies like Hadoop [3] [4] and MapReduce [5] . Big Data is a generic term used to refer to massive and complex datasets, which are made of a variety of data structures (structured, semi- structured and unstructured data) from a multitude of sources [6] . Big Data can be characterized by three Vs: volume (amount of data), velocity (speed of data in and out) and variety (kinds of data types and sources) [7] . Still, there are added some other Vs for variability, veracity and value [8] .
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
57
Adopting Big Data-based technologies not only mitigates the problems presented above, but also opens new perspectives that allow extracting value from Big Data. Big Data-based technologies are being applied with success in multiple scenarios [1] [9] [10] like in: (1) e-commerce and marketing, where count the clicks that the crowds do on the web allow identifying trends that improve campaigns, evaluate personal profiles of a user, so that the content shown is the one he will most likely enjoy; (2) government and public health, allowing the detection and tracking of disease outbreaks via social media or detect frauds; (3) transportation, industry and surveillance, with real-time improved estimated times of arrival and smart use of resources. This paper provides a broad view of the current state of this area based on two dimensions or perspectives: Data Modeling and Data Analytics. Table 1 summarizes the focus of this paper, namely by identifying three representative approaches considered to explain the evolution of Data Modeling and Data Analytics. These approaches are: Operational databases, Decision Support databases and Big Data technologies. This research work has been conducted in the scope of the DataStorm project [11] , led by our research group, which focuses on addressing the design, implementation and operation of the current problems with Big Data- based applications. More specifically, the goal of our team in this project is to identify the main concepts and patterns that characterize such applications, in order to define and apply suitable domain-specific languages (DSLs). Then these DSLs will be used in a Model-Driven Engineering (MDE) [12] -[14] approach aiming to ease the design, implementation and operation of such data-intensive applications. To ease the explanation and better support the discussion throughout the paper, we use a very simple case study based on a fictions academic management system described below:
The outline of this paper is as follows: Section 2 describes Data Modeling and some representative types of data models used in operational databases, decision support databases and Big Data technologies. Section 3 details the type of operations performed in terms of Data Analytics for these three approaches. Section 4 compares and discusses each approach in terms of the Data Modeling and Data Analytics perspectives. Section 5 discusses our
58
Data Analysis and Information Processing
research in comparison with the related work. Finally, Section 6 concludes the paper by summarizing its key points and identifying future work.
DATA MODELING This section gives an in-depth look of the most popular data models used to define and support Operational Databases, Data Warehouses and Big Data technologies. Table 1: Approaches and perspectives of the survey
Databases are widely used either for personal or enterprise use, namely due to their strong ACID guarantees (atomicity, consistency, isolation and durability) guarantees and the maturity level of Database Management Systems (DBMSs) that support them [15] . The data modeling process may involve the definition of three data models (or schemas) defined at different abstraction levels, namely Conceptual, Logical and Physical data models [15] [16] . Figure 1 shows part of the three data models for the AMS case study. All these models define three entities (Person, Student and Professor) and their main relationships (teach and supervise associations). Conceptual Data Model. A conceptual data model is used to define, at a very high and platform-independent level of abstraction, the entities or concepts, which represent the data of the problem domain, and their relationships. It leaves further details about the entities (such as their attributes, types or primary keys) for the next steps. This model is typically used to explore domain concepts with the stakeholders and can be omitted or used instead of the logical data model. Logical Data Model. A logical data model is a refinement of the previous conceptual model. It details the domain entities and their relationships, but standing also at a platform-independent level. It depicts all the attributes that characterize each entity (possibly also including its unique identifier, the primary key) and all the relationships between the entities (possibly
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
59
including the keys identifying those relationships, the foreign keys). Despite being independent of any DBMS, this model can easily be mapped on to a physical data model thanks to the details it provides. Physical Data Model. A physical data model visually represents the structure of the data as implemented by a given class of DBMS. Therefore, entities are represented as tables, attributes are represented as table columns and have a given data type that can vary according to the chosen DBMS, and the relationships between each table are identified through foreign keys. Unlike the previous models, this model tends to be platform-specific, because it reflects the database schema and, consequently, some platform-specific aspects (e.g. database-specific data types or query language extensions). Summarizing, the complexity and detail increase from a conceptual to a physical data model. First, it is important to perceive at a higher level of abstraction, the data entities and their relationships using a Conceptual Data Model. Then, the focus is on detailing those entities without worrying about implementation details using a Logical Data Model. Finally, a Physical Data Model allows to represent how data is supported by a given DBMS [15] [16] .
Operational Databases Databases had a great boost with the popularity of the Relational Model [17] proposed by E. F. Codd in 1970. The Relational Model overcame the problems of predecessors data models (namely the Hierarchical Model and the Navigational Model [18] ). The Relational Model caused the emergence of Relational Database Management Systems (RDBMSs), which are the most used and popular DBMSs, as well as the definition of the Structured Query Language (SQL) [19] as the standard language for defining and manipulating data in RDBMSs. RDBMSs are widely used for maintaining data of daily operations. Considering the data modeling of operational databases there are two main models: the Relational and the Entity-Relationship (ER) models. Relational Model. The Relational Model is based on the mathematical concept of relation. A relation is defined as a set (in mathematics terminology) and is represented as a table, which is a matrix of columns and rows, holding information about the domain entities and the relationships among them. Each column of the table corresponds to an entity attribute and specifies the attribute’s name and its type (known as domain). Each row of the table (known as tuple) corresponds to a single element of the represented domain entity.
60
Data Analysis and Information Processing
Physical Data Model
Figure 1: Example of three data models (at different abstraction levels) for the Academic Management System.
In the Relational Model each row is unique and therefore a table has an attribute or set of attributes known as primary key, used to univocally identify those rows. Tables are related with each other by sharing one or more common attributes. These attributes correspond to a primary key in the referenced (parent) table and are known as foreign keys in the referencing (child) table. In one-to-many relationships, the referenced table corresponds to the entity of the “one” side of the relationship and the referencing table corresponds to the entity of the “many” side. In many- to-many relationships,
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
61
it is used an additional association table that associates the entities involved through their respective primary keys. The Relational Model also features the concept of View, which is like a table whose rows are not explicitly stored in the database, but are computed as needed from a view definition. Instead, a view is defined as a query on one or more base tables or other views [17] . Entity-Relationship (ER) Model. The Entity Relationship (ER) Model [20] , proposed by Chen in 1976, appeared as an alternative to the Relational Model in order to provide more expressiveness and semantics into the database design from the user’s point of view. The ER model is a semantic data model, i.e. aims to represent the meaning of the data involved on some specific domain. This model was originally defined by three main concepts: entities, relationships and attributes. An entity corresponds to an object in the real world that is distinguishable from all other objects and is characterized by a set of attributes. Each attribute has a range of possible values, known as its domain, and each entity has its own value for each attribute. Similarly to the Relational Model, the set of attributes that identify an entity is known as its primary key. Entities can be though as nouns and correspond to the tables of the Relational Model. In turn, a relationship is an association established among two or more entities. A relationship can be thought as a verb and includes the roles of each participating entities with multiplicity constraints, and their cardinality. For instance, a relationship can be of one-to-one (1:1), one-to-many (1:M) or many-to-many (M:N). In an ER diagram, entities are usually represented as rectangles, attributes as circles connected to entities or relationships through a line, and relationships as diamonds connected to the intervening entities through a line. The Enhanced ER Model [21] provided additional concepts to represent more complex requirements, such as generalization, specialization, aggregation and composition. Other popular variants of ER diagram notations are Crow’s foot, Bachman, Barker’s, IDEF1X and UML Profile for Data Modeling [22] .
Decision Support Databases The evolution of relational databases to decision support databases, hereinafter indistinctly referred as “Data Warehouses” (DWs), occurred with the need of storing operational but also historical data, and the need of analyzing that data in complex dashboards and reports. Even though a
62
Data Analysis and Information Processing
DW seems to be a relational database, it is different in the sense that DWs are more suitable for supporting query and analysis operations (fast reads) instead of transaction processing (fast reads and writes) operations. DWs contain historical data that come from transactional data, but they also might include other data sources [23] . DWs are mainly used for OLAP (online analytical processing) operations. OLAP is the approach to provide report data from DW through multi-dimen- sional queries and it is required to create a multi-dimensional database [24] . Usually, DWs include a framework that allows extracting data from multiple data sources and transform it before loading to the repository, which is known as ETL (Extract Transform Load) framework [23] . Data modeling in DW consists in defining fact tables with several dimension tables, suggesting star or snowflake schema data models [23] . A star schema has a central fact table linked with dimension tables. Usually, a fact table has a large number of attributes (in many cases in a denormalized way), with many foreign keys that are the primary keys to the dimension tables. The dimension tables represent characteristics that describe the fact table. When star schemas become too complex to be queried efficiently they are transformed into multi-dimen- sional arrays of data called OLAP cubes (for more information on how this transformation is performed the reader can consult the following references [24] [25] ). A star schema is transformed to a cube by putting the fact table on the front face that we are facing and the dimensions on the other faces of the cube [24] . For this reason, cubes can be equivalent to star schemas in content, but they are accessed with more platform-specific languages than SQL that have more analytic capabilities (e.g. MDX or XMLA). A cube with three dimensions is conceptually easier to visualize and understand, but the OLAP cube model supports more than three dimensions, and is called a hypercube. Figure 2 shows two examples of star schemas regarding the case study AMS. The star schema on the left represents the data model for the Student’s fact, while the data model on the right represents the Professor’s fact. Both of them have a central fact table that contains specific attributes of the entity in analysis and also foreign keys to the dimension tables. For example, a Student has a place of origin (DIM_PLACEOFORIGIN) that is described by a city and associated to a country (DIM_COUNTRY) that has a name and an ISO code. On the other hand, Figure 3 shows a cube model with three
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
63
dimensions for the Student. These dimensions are repre- sented by sides of the cube (Student, Country and Date). This cube is useful to execute queries such as: the students by country enrolled for the first time in a given year. A challenge that DWs face is the growth of data, since it affects the number of dimensions and levels in either the star schema or the cube hierarchies. The increasing number of dimensions over time makes the management of such systems often impracticable; this problem becomes even more serious when dealing with Big Data scenarios, where data is continuously being generated [23] .
Figure 2: Example of two star schema models for the Academic Management System.
64
Data Analysis and Information Processing
Figure 3: Example of a cube model for the Academic Management System.
Big Data Technologies The volume of data has been exponentially increasing over the last years, namely due to the simultaneous growth of the number of sources (e.g. users, systems or sensors) that are continuously producing data. These data sources produce huge amounts of data with variable representations that make their management by the traditional RDBMSs and DWs often impracticable. Therefore, there is a need to devise new data models and technologies that can handle such Big Data. NoSQL (Not Only SQL) [26] is one of the most popular approaches to deal with this problem. It consists in a group of non-relational DBMSs that consequently do not represent databases using tables and usually do not use SQL for data manipulation. NoSQL systems allow managing and storing large-scale denormalized datasets, and are designed to scale horizontally. They achieve that by compromising consistency in favor of availability and partition-tolerance, according to Brewer’s CAP theorem [27] . Therefore, NoSQL systems are “eventually consistent”, i.e. assume that writes on the data are eventually propagated over time, but there are limited guarantees that different users will read the same value at the same time. NoSQL provides BASE guarantees (Basically Available, Soft state and Eventually consistent) instead of the traditional ACID guarantees, in order to greatly improve performance and scalability [28] .
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
65
NoSQL databases can be classified in four categories [29] : Key-value stores, (2) Document-oriented databases, (3) Wide-column stores, and (4) Graph databases. Key-value Stores. A Key-Value store represents data as a collection (known as dictionary or map) of key- value pairs. Every key consists in a unique alphanumeric identifier that works like an index, which is used to access a corresponding value. Values can be simple text strings or more complex structures like arrays. The Key-value model can be extended to an ordered model whose keys are stored in lexicographical order. The fact of being a simple data model makes Key-value stores ideally suited to retrieve information in a very fast, available and scalable way. For instance, Amazon makes extensive use of a Key-value store system, named Dynamo, to manage the products in its shopping cart [30] . Amazon’s Dynamo and Voldemort, which is used by Linkedin, are two examples of systems that apply this data model with success. An example of a key-value store for both students and professors of the Academic Managements System is shown in Figure 4. Document-oriented Databases. Document-oriented databases (or document stores) were originally created to store traditional documents, like a notepad text file or Microsoft Word document. However, their concept of document goes beyond that, and a document can be any kind of domain object [26] . Documents contain encoded data in a standard format like XML, YAML, JSON or BSON (Binary JSON) and are univocally identified in the database by a unique key. Documents contain semi-structured data represented as name-value pairs, which can vary according to the row and can nest other documents. Unlike key-value stores, these systems support secondary indexes and allow fully searching either by keys or values. Document databases are well suited for storing and managing huge collections of textual documents (e.g. text files or email messages), as well as semi-struc- tured or denormalized data that would require an extensive use of “nulls” in an RDBMS [30] . MongoDB and CouchDB are two of the most popular Document-oriented database systems. Figure 5 illustrates two collections of documents for both students and professors of the Academic Management System.
66
Data Analysis and Information Processing
Figure 4. Example of a key-value store for the Academic Management System.
Figure 5: Example of a documents-oriented database for the Academic Management System.
Wide-column Stores. Wide-column stores (also known as columnfamily stores, extensible record stores or column-oriented databases) represent and manage data as sections of columns rather than rows (like in RDBMS). Each section is composed of key-value pairs, where the keys are rows and the values are sets of columns, known as column families. Each row is identified by a primary key and can have column families different of the other rows. Each column family also acts as a primary key of the set
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
67
of columns it contains. In turn each column of column family consists in a name-value pair. Column families can even be grouped in super column families [29] . This data model was highly inspired by Google’s BigTable [31] . Wide-column stores are suited for scenarios like: (1) Distributed data storage; (2) Large-scale and batch-oriented data processing, using the famous MapReduce method for tasks like sorting, parsing, querying or conversion and; (3) Exploratory and predictive analytics. Cassandra and Hadoop HBase are two popular frameworks of such data management systems [29] . Figure 6 depicts an example of a wide-column store for the entity “person” of the Academic Managements System. Graph Databases. Graph databases represent data as a network of nodes (representing the domain entities) that are connected by edges (representing the relationships among them) and are characterized by properties expressed as key-value pairs. Graph databases are quite useful when the focus is on exploring the relationships between data, such as traversing social networks, detecting patterns or infer recommendations. Due to their visual representation, they are more user-friendly than the aforementioned types of NoSQL databases. Neo4j and Allegro Graph are two examples of such systems.
DATA ANALYTICS This section presents and discusses the types of operations that can be performed over the data models described in the previous section and also establishes comparisons between them. A complementary discussion is provided in Section 4.
Operational Databases Systems using operational databases are designed to handle a high number of transactions that usually perform changes to the operational data, i.e. the data an organization needs to assure its everyday normal operation. These systems are called Online Transaction Processing (OLTP) systems and they are the reason why RDBMSs are so essential nowadays. RDBMSs have increasingly been optimized to perform well in OLTP systems, namely providing reliable and efficient data processing [16] . The set of operations supported by RDBMSs is derived from the relational algebra and calculus underlying the Relational Model [15] . As mentioned before, SQL is the standard language to perform these operations. SQL can be divided in two parts involving different types of operations:
68
Data Analysis and Information Processing
Data Definition Language (SQL-DDL) and Data Manipulation Language (SQL-DML). SQL-DDL allows performing the creation (CREATE), update (UPDATE) and deletion (DROP) of the various database objects.
Figure 6: Example of a wide-column store for the Academic Management System.
First it allows managing schemas, which are named collections of all the database objects that are related to one another. Then inside a schema, it is possible to manage tables specifying their columns and types, primary keys, foreign keys and constraints. It is also possible to manage views, domains and indexes. An index is a structure that speeds up the process of accessing to one or more columns of a given table, possibly improving the performance of queries [15] [16] . For example, considering the Academic Management System, a system manager could create a table for storing information of a student by executing the following SQL-DDL command:
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
69
On the other hand, SQL-DML is the language that enables to manipulate database objects and particularly to extract valuable information from the database. The most commonly used and complex operation is the SELECT operation, which allows users to query data from the various tables of a database. It is a powerful operation because it is capable of performing in a single query the equivalent of the relational algebra’s selection, projection and join operations. The SELECT operation returns as output a table with the results. With the SELECT operation is simultaneously possible to: define which tables the user wants to query (through the FROM clause), which rows satisfy a particular condition (through the WHERE clause), which columns should appear in the result (through the SELECT clause), order the result (in ascending or descending order) by one or more columns (through the ORDER BY clause), group rows with the same column values (through the GROUP BY clause) and filter those groups based on some condition (through the HAVING clause). The SELECT operation also allows using aggregation functions, which perform arithmetic computation or aggregation of data (e.g. counting or summing the values of one or more columns). Many times there is the need to combine columns of more than one table in the result. To do that, the user can use the JOIN operation in the query. This operation performs a subset of the Cartesian product between the involved tables, i.e. returns the row pairs where the matching columns in each table have the same value. The most common queries that use joins involve tables that have one-to-many relationships. If the user wants to include in the result the rows that did not satisfied the join condition, then he can use the outer joins operations (left, right and full outer join). Besides specifying queries, DML allows modifying the data stored in a database. Namely, it allows adding new rows to a table (through the INSERT statement), modifying the content of a given table’s rows (through the UPDATE statement) and deleting rows from a table (through the DELETE statement) [16] . SQLDML also allows combining the results of two or more queries into a single
70
Data Analysis and Information Processing
result table by applying the Union, Intersect and Except operations, based on the Set Theory [15] . For example, considering the Academic Management System, a system manager could get a list of all students who are from G8 countries by entering the following SQL-DML query:
Decision Support Databases The most common data model used in DW is the OLAP cube, which offers a set of operations to analyze the cube model [23] . Since data is conceptualized as a cube with hierarchical dimensions, its operations have familiar names when manipulating a cube, such as slice, dice, drill and pivot. Figure 7 depicts these operations considering the Student’s facts of the AMS case study (see Figure 2). The slice operation begins by selecting one of the dimensions (or faces) of the cube. This dimension is the one we want to consult and it is followed by “slicing” the cube to a specific depth of interest. The slice operation leaves us with a more restricted selection of the cube, namely the dimension we wanted (front face) and the layer of that dimension (the sliced section). In the example of Figure 7 (top-left), the cube was sliced to consider only data of the year 2004. Dice is the operation that allows restricting the front face of the cube by reducing its size to a smaller targeted domain. This means that the user produces a smaller “front face” than the one he had at the start. Figure 7 (topright) shows that the set of students has decreased after the dice operation. Drill is the operation that allows to navigate by specifying different levels of the dimensions, ranging from the most detailed ones (drill down) to the most summarized ones (drill up). Figure 7 (bottom-left) shows the drill down so the user can see the cities from where the students of the country Portugal come from. The pivot operation allows changing the dimension that is being faced (change the current front face) to one that is adjacent to it by rotating the cube. By doing this, the user obtains another perspective of the data, which requires the queries to have a different structure but can be more beneficial
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
71
for specific queries. For instance, he can slice and dice the cube away to get the results he needed, but sometimes with a pivot most of those operations can be avoided by perceiving a common structure on future queries and pivoting the cube in the correct fashion [23] [24] . Figure 7 (bottom-right) shows a pivot operation where years are arranged vertically and countries horizontally. The usual operations issued over the OLAP cube are about just querying historical events stored in it. So, a common dimension is a dimension associated to time.
Figure 7: Representation of cube operations for the Academic Management System: slice (top-left), dice (top-right), drill up/down (bottom-left) and pivot (bottom-right).
The most popular language for manipulating OLAP cubes is MDX (Multidimensional Expressions) [32] , which is a query language for OLAP databases that supports all the operations mentioned above. MDX is exclusively used to analyze and read data since it was not designed with SQL-DML in mind. The star schema and the OLAP cube are designed a priori with a specific purpose in mind and cannot accept queries that differentiate much from the ones they were design to respond too. The benefit in this, is that queries are much simpler and faster, and by using a cube it is even quicker to detect patterns, find trends and navigate around the data while “slicing and dicing” with it [23] [25] . Again considering the Academic Management System example, the following query represents an MDX select statement. The SELECT clause sets the query axes as the name and the gender of the Student dimension and the year 2015 of the Date dimension. The FROM clause indicates the data source, here being the Students cube, and the WHERE clause defines
72
Data Analysis and Information Processing
the slicer axis as the “Computer Science” value of the Academic Program dimension. This query returns the students (by names and gender) that have enrolled in Computer Science in the year 2015.
Big Data Technologies Big Data Analytics consists in the process of discovering and extracting potentially useful information hidden in huge amounts of data (e.g. discover unknown patterns and correlations). Big Data Analytics can be separated in the following categories: (1) Batch-oriented processing; (2) Stream processing; (3) OLTP and; (4) Interactive ad-hoc queries and analysis. Batch-oriented processing is a paradigm where a large volume of data is firstly stored and only then analyzed, as opposed to Stream processing. This paradigm is very common to perform large-scale recurring tasks in parallel like parsing, sorting or counting. The most popular batch-oriented processing model is MapReduce [5] , and more specifically its open-source implementation in Hadoop1. MapReduce is based on the divide and conquer (D&C) paradigm to break down complex Big Data problems into small sub-problems and process them in parallel. MapReduce, as its name hints, comprises two major functions: Map and Reduce. First, data is divided into small chunks and distributed over a network of nodes. Then, the Map function, which performs operations like filtering or sorting, is applied simultaneously to each chunk of data generating intermediate results. After that, those intermediate results are aggregated through the Reduce function in order to compute the final result. Figure 8 illustrates an example of the application of MapReduce in order to calculate the number of students enrolled in a given academic program by year. This model schedules computation resources close to data location, which avoids the communication overhead of data transmission. It is simple and widely applied in bioinformatics, web mining and machine learning. Also related to Hadoop’s environment, Pig2 and
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
73
Hive3 are two frameworks used to express tasks for Big Data sets analysis in MapReduce programs. Pig is suitable for data flow tasks and can produce sequences of MapReduce programs, whereas Hive is more suitable for data summarization, queries and analysis. Both of them use their own SQL-like languages, Pig Latin and Hive QL, respectively [33] . These languages use both CRUD and ETL operations. Streaming processing is a paradigm where data is continuously arriving in a stream, at real-time, and is analyzed as soon as possible in order to derive approximate results. It relies in the assumption that the potential value of data depends on its freshness. Due to its volume, only a portion of the stream is stored in memory [33] . Streaming processing paradigm is used in online applications that need real-time precision (e.g. dashboards of production lines in a factory, calculation of costs depending on usage and available resources). It is supported by Data Stream Management Systems (DSMS) that allow performing SQL-like queries (e.g. select, join, group, count) within a given window of data. This window establishes the period of time (based on time) or number of events (based on length) [34] . Storm and S4 are two examples of such systems.
Figure 8: Example of Map Reduce applied to the Academic Management System.
OLTP, as we have seen before, is mainly used in the traditional RDBMS. However, these systems cannot assure an acceptable performance when the volume of data and requests is huge, like in Facebook or Twitter. Therefore, it was necessary to adopt NoSQL databases that allow achieving very high
74
Data Analysis and Information Processing
performances in systems with such large loads. Systems like Cassandra4, HBase5 or MongoDB6 are effective solutions currently used. All of them provide their own query languages with equivalent CRUD operations to the ones provided by SQL. For example, in Cassandra is possible to create Column Families using CQL, in HBase is possible to delete a column using Java, and in MongoDB insert a document into a collection using JavaScript. Below there is a query in JavaScript for a MongoDB database equivalent to the SQL-DML query presented previously. At last, Interactive ad-hoc queries and analysis consists in a paradigm that allows querying different large- scale data sources and query interfaces with a very low latency. This type of systems argue that queries should not need more then few seconds to execute even in a Big Data scale, so that users are able to react to changes if needed. The most popular of these systems is Drill7. Drill works as a query layer that transforms a query written in a human-readable syntax (e.g. SQL) into a logical plan (query written in a platform-independent way). Then, the logical plan is transformed into a physical plan (query written in a platform-specific way) that is executed in the desired data sources (e.g. Cassandra, HBase or MongoDB) [35] .
DISCUSSION In this section we compare and discuss the approaches presented in the previous sections in terms of the two perspectives that guide this survey: Data Modeling and Data Analytics. Each perspective defines a set of features used to compare Operational Databases, DWs and Big Data approaches among themselves. Regarding the Data Modeling Perspective, Table 2 considers the following features of analysis: (1) the data model; (2) the abstraction level in which the data model resides, according to the abstraction levels (Conceptual, Logical and Physical) of the database design process; (3) the concepts or constructs that compose the data model;
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
75
Table 2: Comparison of the approaches from the Data Modeling Perspective
(4) the concrete languages used to produce the data models and that apply the previous concepts; (5) the modeling tools that allow specifying diagrams using those languages and; (6) the database tools that support the data model. Table 2 presents the values of each feature for each approach. It is possible to verify that the majority of the data models are at a logical and physical level, with the exception of the ER model and the OLAP cube model, which are more abstract and defined at conceptual and logical levels. It is also possible to verify that Big Data has more data models than the other approaches, what can explain the work and proposals that have been conducted over the last years, as well as the absence of a de facto data model. In terms of concepts, again Big Data-related data models have a more variety of concepts than the other approaches, ranging from key-value pairs or documents to nodes and edges. Concerning concrete languages, it is concluded that every data
76
Data Analysis and Information Processing
model presented in this survey is supported by a SQL-DDL-like language. However, we found that only the operational databases and DWs have concrete languages to express their data models in a graphical way, like Chen’s notation for ER model, UML Data Profile for Relational model or CWM [36] for multidimensional DW models. Also, related to that point, there are none modeling tools to express Big Data models. Thus, defining such a modeling language and respective supporting tool for Big Data models constitute an interesting research direction that fills this lack. At last, all approaches have database tools that support the development based on their data models, with the exception of the ER model that is not directly used by DBMSs. On the other hand, in terms of the Data Analytics Perspective, Table 3 considers six features of analysis: (1) the class of application domains, which characterizes the approach suitability; (2) the common operations used in the approach, which can be reads and/or writes; (3) the operations types most typically used in the approach; (4) the concrete languages used to specify those operations; (5) the abstraction level of these concrete languages (Conceptual, Logical and Physical); and (6) the technology support of these languages and operations. Table 3 shows that Big Data is used in more classes of application domains than the operational databases and DWs, which are used for OLTP and OLAP domains, respectively. It is also possible to observe that operational databases are commonly used for reads and writes of small operations (using transactions), because they need to handle fresh and critical data in a daily basis. On the other hand, DWs are mostly suited for read operations, since they perform analysis and data mining mostly with historical data. Big Data performs both reads and writes, but in a different way and at a different scale from the other approaches. Big Data applications are built to perform a huge amount of reads, and if a huge amount of writes is needed, like for OLTP, they sacrifice consistency (using “eventually consistency”) in order to achieve great availability and horizontal scalability. Operational databases support their data manipulation operations (e.g. select, insert or delete) using SQLML, which has slight variations according to the technology used. DWs also use SQL-ML through the select statement, because their operations (e.g. slice, dice or drill down/up) are mostly reads. DWs also use SQL-based languages, like MDX and XMLA (XML for Analysis) [37] , for specifying their operations. On the other hand, regarding Big Data technologies, there is a great variety of languages to manipulate data according to the different class application domains. All of these languages provide equivalent operations
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
77
to the ones offered by SQL-ML and add new constructs for supporting both ETL, data stream processing (e.g. create stream, window) [34] and MapReduce operations. It is important to note that concrete languages used in the different approaches reside at logical and physical levels, because they are directly used by the supporting software tools.
RELATED WORK As mentioned in Section 1, the main goal of this paper is to present and discuss the concepts surrounding data modeling and data analytics, and their evolution for three representative approaches: operational databases, decision support databases and Big Data technologies. Table 3: Comparison of the approaches from the Data Analytics perspective
In our survey we have researched related works that also explore and compare these approaches from the data modeling or data analytics point of view. J.H. ter Bekke provides a comparative study between the Relational, Semantic, ER and Binary data models based on an examination session results [38] . In that session participants had to create a model of a case study, similar to the Academic Management System used in this paper. The purpose was to discover relationships between the modeling approach in
78
Data Analysis and Information Processing
use and the resulting quality. Therefore, this study just addresses the data modeling topic, and more specifically only considers data models associated to the database design process. Several works focus on highlighting the differences between operational databases and data warehouses. For example, R. Hou provides an analysis between operational databases and data warehouses distinguishing them according to their related theory and technologies, and also establishing common points where combining both systems can bring benefits [39] . C. Thomsen and T.B. Pedersen compare open source ETL tools, OLAP clients and servers, and DBMSs, in order to build a Business Intelligence (BI) solution [40] . P. Vassiliadis and T. Sellis conducted a survey that focuses only on OLAP databases and compare various proposals for the logical models behind them. They group the various proposals in just two categories: commercial tools and academic efforts, which in turn are subcategorized in relational model extensions and cube- oriented approaches [41] . However, unlike our survey they do not cover the subject of Big Data technologies. Several papers discuss the state of the art of the types of data stores, technologies and data analytics used in Big Data scenarios [29] [30] [33] [42] , however they do not compare them with other approaches. Recently, P. Chandarana and M. Vijayalakshmi focus on Big Data analytics frameworks and provide a comparative study according to their suitability [35] . Summarizing, none of the following mentioned work provides such a broad analysis like we did in this paper, namely, as far as we know, we did not find any paper that compares simultaneously operational databases, decision support databases and Big Data technologies. Instead, they focused on describing more thoroughly one or two of these approaches
CONCLUSIONS In recent years, the term Big Data has appeared to classify the huge datasets that are continuously being produced from various sources and that are represented in a variety of structures. Handling this kind of data represents new challenges, because the traditional RDBMSs and DWs reveal serious limitations in terms of performance and scalability when dealing with such a volume and variety of data. Therefore, it is needed to reinvent the ways in which data is represented and analyzed, in order to be able to extract value from it.
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
79
This paper presents a survey focused on both these two perspectives: data modeling and data analytics, which are reviewed in terms of the three representative approaches nowadays: operational databases, decision support databases and Big Data technologies. First, concerning data modeling, this paper discusses the most common data models, namely: relational model and ER model for operational databases; star schema model and OLAP cube model for decision support databases; and key-value store, document-oriented database, wide-column store and graph database for Big Data-based technologies. Second, regarding data analytics, this paper discusses the common operations used for each approach. Namely, it observes that operational databases are more suitable for OLTP applications, decision support databases are more suited for OLAP applications, and Big Data technologies are more appropriate for scenarios like batch-oriented processing, stream processing, OLTP and interactive ad-hoc queries and analysis. Third, it compares these approaches in terms of the two perspectives and based on some features of analysis. From the data modeling perspective, there are considered features like the data model, its abstraction level, its concepts, the concrete languages used to described, as well as the modeling and database tools that support it. On the other hand, from the data analytics perspective, there are taken into account features like the class of application domains, the most common operations and the concrete languages used to specify those operations. From this analysis, it is possible to verify that there are several data models for Big Data, but none of them is represented by any modeling language, neither supported by a respective modeling tool. This issue constitutes an open research area that can improve the development process of Big Data targeted applications, namely applying a Model-Driven Engineering approach [12] -[14] . Finally, this paper also presents some related work on the data modeling and data analytics areas. As future work, we consider that this survey may be extended to capture additional aspects and comparison features that are not included in our analysis. It will be also interesting to survey concrete scenarios where Big Data technologies prove to be an asset [43] . Furthermore, this survey constitutes a starting point for our ongoing research goals in the context of the Data Storm and MDD Lingo initiatives. Specifically, we intend to extend existing domain-specific modeling languages, like XIS [44] and XIS-Mobile [45] [46] , and their MDE-based framework to support both the data modeling and data analytics of data-intensive applications, such as those researched in the scope of the Data Storm initiative [47] - [50] .
80
Data Analysis and Information Processing
ACKNOWLEDGEMENTS This work was partially supported by national funds through FCT―Fundação para a Ciência e a Tecnologia, under the projects POSC/EIA/57642/2004, CMUP-EPB/TIC/0053/2013, UID/CEC/50021/2013 and Data Storm Research Line of Excellency funding (EXCL/EEI-ESS/0257/2012).
NOTES 1
https://hadoop.apache.org
2
https://pig.apache.org
3
https://hive.apache.org
4
http://cassandra.apache.org
5
https://hbase.apache.org
6
https://www.mongodb.org
7
https://drill.apache.org
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
81
REFERENCES 1.
2. 3.
4. 5.
6. 7. 8.
9.
10. 11. 12.
13.
Mayer-Schonberger, V. and Cukier, K. (2014) Big Data: A Revolution That Will Transform How We Live, Work, and Think. Houghton Mifflin Harcourt, New York. Noyes, D. (2015) The Top 20 Valuable Facebook Statistics.https:// zephoria.com/top-15-valuable-facebook-statistics Shvachko, K., Hairong Kuang, K., Radia, S. and Chansler, R. (2010) The Hadoop Distributed File System. 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, 3-7 May 2010, 1-10.http://dx.doi.org/10.1109/msst.2010.5496972 White, T. (2012) Hadoop: The Definitive Guide. 3rd Edition, O’Reilly Media, Inc., Sebastopol. Dean, J. and Ghemawat, S. (2008) MapReduce: Simplified Data Processing on Large Clusters. Communications, 51, 107-113.http://dx. doi.org/10.1145/1327452.1327492 Hurwitz, J., Nugent, A., Halper, F. and Kaufman, M. (2013) Big Data for Dummies. John Wiley & Sons, Hoboken. Beyer, M.A. and Laney, D. (2012) The Importance of “Big Data”: A Definition. Gartner. https://www.gartner.com/doc/2057415 Duncan, A.D. (2014) Focus on the “Three Vs” of Big Data Analytics: Variability, Veracity and Value. Gartner.https://www.gartner.com/ doc/2921417/focus-vs-big-data-analytics Agrawal, D., Das, S. and El Abbadi, A. (2011) Big Data and Cloud Computing: Current State and Future Opportunities. Proceedings of the 14th International Conference on Extending Database Technology, Uppsala, 21-24 March, 530-533.http://dx.doi. org/10.1145/1951365.1951432 McAfee, A. and Brynjolfsson, E. (2012) Big Data: The Management Revolution. Harvard Business Review. DataStorm Project Website.http://dmir.inesc-id.pt/project/DataStorm. Stahl, T., Voelter, M. and Czarnecki, K. (2006) Model-Driven Software Development: Technology, Engineering, Management. John Wiley & Sons, Inc., New York. Schmidt, D.C. (2006) Guest Editor’s Introduction: Model-Driven Engineering. IEEE Computer, 39, 25-31.http://dx.doi.org/10.1109/ MC.2006.58
82
Data Analysis and Information Processing
14. Silva, A.R. (2015) Model-Driven Engineering: A Survey Supported by the Unified Conceptual Model. Computer Languages, Systems & Structures, 43, 139-155. 15. Ramakrishnan, R. and Gehrke, J. (2012) Database Management Systems. 3rd Edition, McGraw-Hill, Inc., New York. 16. Connolly, T.M. and Begg, C.E. (2005) Database Systems: A Practical Approach to Design, Implementation, and Management. 4th Edition, Pearson Education, Harlow. 17. Codd, E.F. (1970) A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, 13, 377-387.http://dx.doi. org/10.1145/362384.362685 18. Bachman, C.W. (1969) Data Structure Diagrams. ACM SIGMIS Database, 1, 4-10.http://dx.doi.org/10.1145/1017466.1017467 19. Chamberlin, D.D. and Boyce, R.F. (1974) SEQUEL: A Structured English Query Language. In: Proceedings of the 1974 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control (SIGFIDET’ 74), ACM Press, Ann Harbor, 249-264. 20. Chen, P.P.S. (1976) The Entity-Relationship Model—Toward a Unified View of Data. ACM Transactions on Database Systems, 1, 9-36.http:// dx.doi.org/10.1145/320434.320440 21. Tanaka, A.K., Navathe, S.B., Chakravarthy, S. and Karlapalem, K. (1991) ER-R, an Enhanced ER Model with Situation-Action Rules to Capture Application Semantics. Proceedings of the 10th International Conference on Entity-Relationship Approach, San Mateo, 23-25 October 1991, 59-75. 22. Merson, P. (2009) Data Model as an Architectural View. Technical Note CMU/SEI-2009-TN-024, Software Engineering Institute, Carnegie Mellon. 23. Kimball, R. and Ross, M. (2013) The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. 3rd Edition, John Wiley & Sons, Inc., Indianapolis. 24. Zhang, D., Zhai, C., Han, J., Srivastava, A. and Oza, N. (2009) Topic Modeling for OLAP on Multidimensional Text Databases: Topic Cube and Its Applications. Statistical Analysis and Data Mininig, 2, 378395.http://dx.doi.org/10.1002/sam.10059 25. Gray, J., et al. (1997) Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals.
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
26. 27.
28. 29.
30.
31.
32.
33.
34. 35.
36.
83
Data Mining and Knowledge Discovery, 1, 29-53.http://dx.doi. org/10.1023/A:1009726021843 Cattell, R. (2011) Scalable SQL and NoSQL Data Stores. ACM SIGMOD Record, 39, 12-27.http://dx.doi.org/10.1145/1978915.1978919 Gilbert, S. and Lynch, N. (2002) Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. ACM SIGACT News, 33, 51-59. Vogels, W. (2009) Eventually Consistent. Communications of the ACM, 52, 40-44.http://dx.doi.org/10.1145/1435417.1435432 Grolinger, K., Higashino, W.A., Tiwari, A. and Capretz, M.A.M. (2013) Data Management in Cloud Environments: NoSQL and NewSQL Data Stores. Journal of Cloud Computing: Advances, Systems and Applications, 2, 22.http://dx.doi.org/10.1186/2192-113x-2-22 Moniruzzaman, A.B.M. and Hossain, S.A. (2013) NoSQL Database: New Era of Databases for Big data Analytics-Classification, Characteristics and Comparison. International Journal of Database Theory and Application, 6, 1-14. Chang, F., et al. (2006) Bigtable: A Distributed Storage System for Structured Data. Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’ 06), Seattle, 6-8 November 2006, 205-218. Spofford, G., Harinath, S., Webb, C. and Civardi, F. (2005) MDX Solutions: With Microsoft SQL Server Analysis Services 2005 and Hyperion Essbase. John Wiley & Sons, Inc., Indianapolis. Hu, H., Wen, Y., Chua, T.S. and Li, X. (2014) Toward Scalable Systems for Big Data Analytics: A Technology Tutorial. IEEE Access, 2, 652687.http://dx.doi.org/10.1109/ACCESS.2014.2332453 Golab, L. and Ozsu, M.T. (2003) Issues in Data Stream Management.ACM SIGMOD Record, 32, 5-14.http://dx.doi.org/10.1145/776985.776986 Chandarana, P. and Vijayalakshmi, M. (2014) Big Data Analytics Frameworks. Proceedings of the International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA), Mumbai, 4-5 April 2014, 430-434.http://dx.doi. org/10.1109/cscita.2014.6839299 Poole, J., Chang, D., Tolbert, D. and Mellor, D. (2002) Common Warehouse Metamodel. John Wiley & Sons, Inc., New York.
84
Data Analysis and Information Processing
37. XML for Analysis (XMLA) Specification.https://msdn.microsoft.com/ en-us/library/ms977626.aspx. 38. ter Bekke, J.H. (1997) Comparative Study of Four Data Modeling Approaches. Proceedings of the 2nd EMMSAD Workshop, Barcelona, 16-17 June 1997, 1-12. 39. Hou, R. (2011) Analysis and Research on the Difference between Data Warehouse and Database. Proceedings of the International Conference on Computer Science and Network Technology (ICCSNT), Harbin, 24-26 December 2011, 2636-2639. 40. Thomsen, C. and Pedersen, T.B. (2005) A Survey of Open Source Tools for Business Intelligence. Proceedings of the 7th International Conference on Data Warehousing and Knowledge Discovery (DaWaK’05), Copenhagen, 22-26 August 2005, 74-84.http://dx.doi. org/10.1007/11546849_8 41. Vassiliadis, P. and Sellis, T. (1999) A Survey of Logical Models for OLAP Databases. ACM SIGMOD Record, 28, 64-69.http://dx.doi. org/10.1145/344816.344869 42. Chen, M., Mao, S. and Liu, Y. (2014) Big Data: A Survey. Mobile Networks and Applications, 19, 171-209.http://dx.doi.org/10.1007/9783-319-06245-7 43. Chen, H., Hsinchun, R., Chiang, R.H.L. and Storey, V.C. (2012) Business Intelligence and Analytics: From Big Data to Big Impact. MIS Quarterly, 36, 1165-1188. 44. Silva, A.R., Saraiva, J., Silva, R. and Martins, C. (2007) XIS-UML Profile for Extreme Modeling Interactive Systems. Proceedings of the 4th International Workshop on Model-Based Methodologies for Pervasive and Embedded Software (MOMPES’07), Braga, 31-31 March 2007, 55-66.http://dx.doi.org/10.1109/MOMPES.2007.19 45. Ribeiro, A. and Silva, A.R. (2014) XIS-Mobile: A DSL for Mobile Applications. Proceedings of the 29th Symposium on Applied Computing (SAC 2014), Gyeongju, 24-28 March 2014, 1316-1323. http://dx.doi.org/10.1145/2554850.2554926 46. Ribeiro, A. and Silva, A.R. (2014) Evaluation of XIS-Mobile, a Domain Specific Language for Mobile Application Development. Journal of Software Engineering and Applications, 7, 906-919.http:// dx.doi.org/10.4236/jsea.2014.711081
Data Modeling and Data Analytics: A Survey from a Big Data Perspective
85
47. Silva, M.J., Rijo, P. and Francisco, A. (2014). Evaluating the Impact of Anonymization on Large Interaction Network Datasets. In: Proceedings of the 1st International Workshop on Privacy and Security of Big Data, ACM Press, New York, 3-10.http://dx.doi. org/10.1145/2663715.2669610 48. Anjos, D., Carreira, P. and Francisco, A.P. (2014) Real-Time Integration of Building Energy Data. Proceedings of the IEEE International Congress on Big Data, Anchorage, 27 June-2 July 2014, 250-257. http://dx.doi.org/10.1109/BigData.Congress.2014.44 49. Machado, C.M., Rebholz-Schuhmann, D., Freitas, A.T. and Couto, F.M. (2015) The Semantic Web in Translational Medicine: Current Applications and Future Directions. Briefings in Bioinformatics, 16, 89-103.http://dx.doi.org/10.1093/bib/bbt079 50. Henriques, R. and Madeira, S.C. (2015) Towards Robust Performance Guarantees for Models Learned from High-Dimensional Data. In: Hassanien, A.E., Azar, A.T., Snasael, V., Kacprzyk, J. and Abawajy, J.H., Eds., Big Data in Complex Systems, Springer, Berlin, 71-104. http://dx.doi.org/10.1007/978-3-319-11056-1_3
Chapter 4
Big Data Analytics for Business Intelligence in Accounting and Audit
Mui Kim Chu, Kevin Ow Yong Singapore Institute of Technology, 10 Dover Drive, Singapore
ABSTRACT Big data analytics represents a promising area for the accounting and audit professions. We examine how machine learning applications, data analytics and data visualization software are changing the way auditors and accountants work with their clients. We find that audit firms are keen to use machine learning software tools to read contracts, analyze journal entries, and assist in fraud detection. In data analytics, predictive analytical tools are utilized by both accountants and auditors to make projections and estimates, and to enhance business intelligence (BI). In addition, data visualization tools are able to complement predictive analytics to help users uncover trends in the business process. Overall, we anticipate that the technological advances in these various fields will accelerate in the coming years. Thus, it is imperative that accountants and auditors embrace these technological advancements and harness these tools to their advantage. Citation: Chu, M. and Yong, K. (2021), “Big Data Analytics for Business Intelligence in Accounting and Audit”. Open Journal of Social Sciences, 9, 42-52. doi: 10.4236/ jss.2021.99004. Copyright: © 2021 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0
88
Data Analysis and Information Processing
Keywords: Data Analytics, Machine Learning, Data Visualization, Audit Analytics
INTRODUCTION Big data analytics has transformed the world that we live in. Due to technological advances, big data analytics enables new forms of business value and enterprise risk that will have an impact on the rules, standards and practices for the finance and accounting professions. The accounting and audit professionals are important players in harnessing the power of big data analytics, and they are poised to become even more vital to stakeholders in supporting data and insight-driven enterprises. Data analytics can enable auditors to focus on exception reporting more efficiently by identifying outliers in risky areas of the audit process (IAASB, 2018). The advent of inexpensive computational power and storage, as well as the progressive computerization of organizational systems, is creating a new environment in which accountants and auditors must adapt to harness the power of big data analytics. In other applications, data analytics can help auditors to improve the risk assessment process, substantive procedures and tests of controls (Lim et al., 2020). These software tools have the potential to provide further evidence to assist with audit judgements and provide greater insights for audit clients. In machine learning applications, the expectation is that the algorithm will learn from the data provided, in a manner that is similar to how a human being learns from data. A classic application of machine learning tools is pattern recognition. Facial recognition machine learning software has been developed such that a machine-learning algorithm can look at pictures of men and women and be able to identify those features that are male driven from those that are female driven. Initially, the algorithm might misclassify some male faces as female faces. It is thus important for the programmer to write an algorithm that can be trained using test data to look for specific patterns in male and female faces. Because machine learning requires large data sets in order to train the learning algorithms, the availability of a vast quantity of high-quality data will expedite the process by allowing the programmer to refine the machine learning algorithms to be able to identify pictures that contain a male face as opposed to a female face. Gradually, the algorithm will be able to classify some general characteristics of a man (e.g., spotting a beard, certain
Big Data Analytics for Business Intelligence in Accounting and Audit
89
differences in hair styles, broad faces) from those that belong to a woman (e.g., more feminine characteristics). Similarly, it is envisaged that many routine accounting processes will be handled by machine learning algorithms or robotics automation processing (RPA) tools in the near future. For example, it is possible that machine learning algorithms can receive an invoice, match it to a purchase order, determine the expense account to charge and the amount to be paid, and place it in a pool of payments for a human employee to review the documents and release them for payment to the respective vendors. Likewise, in auditing a client, a well designed machine learning algorithm could make it easier to detect potential fraudulent transactions in a company’s financial statements by training the machine learning algorithm to successfully identify transactions that have characteristics associated with fraudulent activities from bona fide transactions. The evolution of machine learning is thus expected to have a dramatic impact on business, and it is expected that the accounting profession will need to adapt so as to better understand how to utilize such technologies in modifying their ways of working when auditing financial statements of their audit clients (Haq, Abatemarco, & Hoops, 2020). Predictive analytics is a subset of data analytics. Predictive analytics can be viewed as helping the accountant or auditor in understanding the future and provides foresight by identifying patterns in historical data. One of the most common applications of predictive analytics in the field of accounting is the computation of a credit score to indicate the likelihood of timely future credit payments. This predictive analytics tool can be used to predict an accounts receivable balance at a certain date and to estimate a collection period for each customer. Data visualization tools are becoming increasingly popular because of the way these tools help users obtain better insights, draw conclusions and handle large datasets (Skapoullis, 2018). For example, auditors have begun to use visualizations as a tool to look at multiple accounts over multiple years to detect misstatements. If an auditor is attempting to examine a company’s accounts payable (AP) balances over the last ten years compared to the industry average, a data visualization tool like PowerBI or Tableau can quickly produce a graph that compares two measures against one dimension. The measures are the quantitative data, which are the company’s AP balances versus the industry averages. The dimension is a qualitative categorical variable. The difference between data visualization tools from a
90
Data Analysis and Information Processing
simple Excel graph is that this information (“sheet’) can be easily formatted and combined with other important information (“other sheets”) to create a dashboard where numerous sheets are compiled to provide an overall view that shows the auditor a cohesive audit examination of misstatement risk or anomalies in the company’s AP balances. As real-time data is streamed to update the dashboard, auditors could also examine the most current transactions that affect AP balances; thus, enabling auditor to perform continuous audit. With the real-time quality dashboard that provide realtime alerts, it enables collaboration among the audit team on a real-time continuous basis coupled with real-time supervisory review. Analytical procedures and test of transactions can be done more continually, and the auditor can investigate unusual fluctuations more promptly. The continuous review can also help to even out the workload of the audit team as the audit team members are kept abreast of the client’s business environment and financial performance throughout the financial year. The next section discusses machine learning applications to aid the audit process. Section 3 describes predictive analytics and how accountants and auditors use these tools to generate actionable insights for companies. Section 4 discusses data visualization and its role in the accounting and audit profession. Section 5 concludes.
MACHINE LEARNING Machine Learning in Audit Processes Machine learning is a subset of artificial intelligence that automates analytical model building. Machine learning uses these models to perform data analysis in order to understand patterns and make predictions. The machines are programmed to use an iterative approach to learn from the analyzed data, making the learning an automated and continuous process. As the machine is exposed to greater amount of data, more robust patterns are recognized. In turn, this iterative process helps to refine the data analysis process. Machine learning and traditional statistical analysis are similar in many aspects. However, while statistical analysis is based on probability theory and probability distributions, machine learning is designed to find that optimal combination of mathematical equations that best predict an outcome. Thus, machine learning is well suited for a broad range of problems that involve classification, linear regression, and cluster analysis.
Big Data Analytics for Business Intelligence in Accounting and Audit
91
The predictive reliability of machine learning applications is dependent on the quality of the historical data that has been fed to the machine. New and unforeseen events may create invalid results if they are left unidentified or inappropriately weighted. As a result, human biases can influence the use of machine learning. Such biases can affect which data sets are chosen for training the AI application, the methods chosen for the process, and the interpretation of the output. Finally, although machine learning technology has great potential, its models are still currently limited by many factors, including data storage and retrieval, processing power, algorithmic modeling assumptions, and human errors and judgment. Machine learning technology for auditing is a very promising area (Dickey et al., 2019). Several of the Big 4 audit firms have machine learning systems under development, and smaller audit firms are beginning to benefit from improving viability of this technology. It is expected that auditing standards will adapt to take into account the use of machine learning in the audit process. Regulators and standard setters will also need to consider how they can incorporate the impact of this technology in their regulatory and decision making process. Likewise, educational programs will continue to evolve to this new paradigm. We foresee that more accounting programs with data analytics and machine learning specializations will become the norm rather than the exception. Although there are certain limitations to the current capability of machine learning, it excels at performing repetitive tasks. Because an audit process requires a vast amount of data and has a significant number of taskrelated components, machine learning has the potential to increase both the speed and quality of audits. By harnessing machine-based performance of redundant tasks, it will free up more time for the auditors to undertake review and analytical work.
Current Audit Use Cases Audit firms are already testing and exploring the power of machine learning in audits. One example is Deloitte’s use of Argus, a machine learning tool that “learns” from every human interaction and leverages advanced machine learning techniques and natural language processing to automatically identify and extract key accounting information from any type of electronic document such as leases, derivatives contracts, and sales contracts. Argus is programmed with algorithms that allow it to identify key contract terms, as well as trends and outliers. It is highly possible for a well-designed machine
92
Data Analysis and Information Processing
to not just read a lease contract, identify key terms, determine whether it is a capital or operating lease, but also to interpret nonstandard leases with significant judgments (e.g., those with unusual asset retirement obligations). This would allow auditors to review and assess larger samples—even up to 100% of the documents, spend more time on judgemental areas and provide greater insights to audit clients, thus improving both the speed and quality of the audit process. Another example of machine learning technology currently used by PricewaterhouseCoopers is Halo. Halo analyzes journal entries and can identify potentially problematic areas, such as entries with keywords of a questionable nature, entries from unauthorized sources, or an unusually high number of journal entry postings just under authorized limits. Similar to Argus, Halo allows auditors to test 100% of the journal entries and focusing only on the outliers with the highest risk, both the speed and quality of the testing procedures are significantly improved.
Potential Machine Learning Applications Audit firms and academics are studying additional ways that machine learning can be used in financial statement audits, particularly in the risk assessment process. For example, machine learning technologies such as speech recognition could be used to examine and diagnose executive fraud interviews. The software can be used to identify situations when interviewees give questionable answers, such as “sort of” or “maybe,” that suggest potential deceptive behavior. Significant delays in responses, which might also indicate deliberate concealment of information, can also be picked up by such speech recognition technology Facial recognition technologies can be applied toward fraud interviews as well. An AI software that uses facial recognition can help to identify facial patterns that suggest excess nervousness or deceit during entrant interviews. The assistance of speech and facial recognition technology in fraud interviews could certainly complement auditors and notify them when higher-risk responses warrant further investigation. A study was done to assess risk based on machine learning by using a deep neural network (DNN) model to develop and test for a drive-off scenario involving an Oil & Gas drilling rig. The results of the study show a reasonable level of accuracy for DNN predictions and a partial suitability to overcome risk assessment challenges. Perhaps such deep learning approach can be extended to auditing by training the model on past indicators of inherent
Big Data Analytics for Business Intelligence in Accounting and Audit
93
risk, for the purpose of assessing risk of material misstatements. Data from various exogenous sources, such as forum posts, comments, conversations from social media, press release, news, management discussion notes, can be used to supplement traditional financial attributes to train the model to virtually assess the inherent risk levels (Paltrinieri et al., 2019). The use of machine learning for risk assessment can also be applied to assessment of going concern risk. By studying the traits of companies that have gone under financial distress, a Probability of Default (PD) model can be developed, with the aim to quantify the going concern on a timelier basis. The predictive model requires an indicator of financial distress and a set of indicators that leverage on environmental and financial performance scanning to produce a PD that is dynamically updated according to firm performance (Martens et al., 2008). The impact on businesses and the accounting profession will undoubtedly be significant in the near future. The major public accounting firms are focused on providing their customers with the expertise needed to deploy machine learning algorithms in businesses to accelerate and improve business decisions while lowering costs. In May 2018, PricewaterhouseCoopers announced a joint venture with eBravia, a contract analytics software company, to develop machine learning algorithms for contract analysis. Those algorithms could be used to review documents related to lease accounting and revenue recognition standards as well as other business activities, such as mergers and acquisitions, financings, and divestitures. In the area of advisory services, Deloitte has advised retailers on how they can enhance customer experience by using machine learning to target product and services based on past buying patterns. While the major public accounting firms may have the financial resources to invest in machine learning, small public accounting firms can leverage on these technological solutions and use pre-built machine learning algorithms to develop expertise through their own implementations at a smaller scale.
DATA ANALYTICS Predictive Analysis in Accounting Processes Traditionally, accounting has focused more on fact-based, historical reporting. While this perspective helps executives in analyzing historical results so they can adjust their strategic and operational plans going forward,
94
Data Analysis and Information Processing
it does not necessarily help them better predict and more aggressively plan for the future. Finding the right solution to enable a detailed analysis of financial data is critical in the transition from looking at the historical financial data to find predictors that enable forward-looking business intelligence (BI). A BI solution leverages on patterns in your data. Looking at consolidated data in an aggregate manner rather than in a piecemeal ad-hoc process from separate information systems provides an opportunity to uncover hidden trends and is a useful functionality for predictive analytics. For example, in customer relationship management (CRM) systems, improved forecasting is important in better planning for capacity peaks and troughs that directly impact the customer experience, response time, and transaction volumes. Many accountants are already using data analytics in their daily work. They compute sums, averages, and percent changes to report sales results, customer credit risk, cost per customer, and availability of inventory. Accountants also are generally familiar with diagnostic analytics because they perform variance analyses and use analytic dashboards to explain historical results. The various attempts to try to predict financial performance and leveraging on nonfinancial performance measures that might be good predictors of financial performance is expected to gain much traction in the coming years. This presents a great opportunity for accountants to provide a much more valuable role to management. Hence, accountants should further harness the power of data analytics to effectively perform their roles. Predictive analytics and prescriptive analytics are important because they provide actionable insights for companies. Accountants need to increase their competence in these areas to provide value to their organizations. Predictive analytics integrates data from various sources (such as enterprise resource planning, point-of-sale, and customer relationship management systems) to predict future outcomes based on statistical relationships found in historical data using regression-based modeling. One of the most common applications of predictive analytics is the computation of a credit score to indicate the likelihood of timely future credit payments. Prescriptive analytics utilizes a combination of sophisticated optimization techniques (self-optimizing algorithms) to make recommendations on the most favorable courses of action to be taken.
Big Data Analytics for Business Intelligence in Accounting and Audit
95
The analytics skills that an accountant needs will differ depending on whether the accounting professional will produce or consume information. Analytics production includes sourcing relevant data and performing analyses, which is more suitable for junior-level accountants. Analytics consumption is using the insights gained from analytics in decision-making and is more relevant for senior-level roles. It is not expected that accountants need to retool to become data scientists or computer engineers to harness analytics tools. Nevertheless, it is most important that the audit and accounting professions become more proficient consumers of analytics to both enhance their current audit practice with available technologies as well as to support their client base in undertaking data analytics activities (Tschakert et al., 2016).
Data Analytics Applications in Audit Processes Audit Data Analytics (ADAs) help auditors discover and analyze patterns, identify anomalies and extract other useful information from audit data through analysis, modeling and visualization. Auditors can use ADAs to perform a variety of procedures to gather audit evidence, to help with the extraction of data and facilitate the use of audit data analytics, and a tool to help illustrate where audit data analytics can be used in a typical audit program (McQuilken, 2019). Governance, risk and control, and compliance monitoring systems commonly used by larger companies include systems developed by Oracle, SAP and RSA Archer. Oracle and SAP have application-side business intelligence systems centred on business warehouses. Lavastorm, Alteryx and Microsoft’s SQL server provide advanced tools for specialists such as business analysts and, increasingly, for non-specialists. All these platforms are currently the preserve of large systems integrators, larger and mid-tier firm consultancies and specialist data analysts. It seems likely though, that over time these systems will move in-house or be provided as managed services. It also seems likely that companies such as CaseWare and Validis that currently provide data analytics services to larger and mid-tier firms, enabling those firms to offer data analytics services to their own clients. Some businesses already analyze their own data in a similar manner to auditors. As these business analyses become deeper, wider, and more sophisticated, with a focus on risk and performance, it seems likely that they will align at least in part with the risks assessed by external auditors.
Data Analysis and Information Processing
96
Data analytics is rooted in software originally developed in the early 2000s for data mining in the banking and retail sectors, and for design and modelling in financial services and engineering. What is phenomenal about this process is the volumes of data that can be handled efficiently on an industrial scale, and the speed of calculations being performed in a fraction of a second. The type of tasks such software can perform, and the connections it can make, dwarf what was previously possible. These technological improvements have facilitated the advances that we have seen in data analytics software (Davenport, 2016).
Current Audit Use Cases By using data analytics procedures, accountants and auditors can produce high-quality, statistical forecasts that help them understand and identify risks relating to the frequency and value of accounting transactions. Some of these procedures are simple, others involve complex models. Auditors using these models will exercise professional judgement to determine mathematical and statistical patterns, helping them identify exceptions for extended testing (Zabeti, 2019). Auditors commonly use data analytics procedures to examine: • •
Receivables and payables ageing; Analysis of gross margins and sales, highlighting items with negative margins; • Analysis of capital expenditure versus repairs and maintenance; • Matching of orders and purchases; • Testing of journal entries. Although data analytics techniques may not entirely substitute the traditional audit procedures and techniques, they can be powerful enablers which allow auditors to perform procedures and analysis which were not traditionally possible. For example, a three-way match process is one of the most basic procedures in audit. Traditionally, auditors perform this procedure by way of sample testing as it is typically not realistic nor expected for auditors to vouch all transaction documents. Data analytics techniques now provide auditors the ability to analyze all the transactions which have been recorded. Hence, auditors can potentially filter and identify a specific class of transactions with unmatched items. Data analytics tools can also allow auditors to trace revenue transactions to debtors and the subsequent cash received and also
Big Data Analytics for Business Intelligence in Accounting and Audit
97
analyze payments made after period end. This technique can relate the subsequent payments with the delivery dates extracted from the underlying delivery documents to ascertain if the payments relate to goods delivered before the period end or after the period end and also determine the amount of unrecorded liability.
DATA VISUALIZATION Data Visualization in Accounting and Audit Processes The auditing and accounting professions have allocated a large amount of resources in understanding the impact of different data visualizations techniques in decision making and analytical procedures. As the technology evolves, and the size and volume of data is continuously growing, new ways to present information are emerging, it is vital for accounting and auditing research to examine newer data visualization techniques (Alawadhi, 2015). The main objective of data visualization is to help users obtain better insights, draw better conclusions and eventually create hypotheses. This is achieved by integrating the user’s perceptual abilities to the data analysis process, and applying their flexibility, creativity, and general knowledge to the large data sets available in today’s systems. Data visualization involves several main advantages. It presents data in a concise manner. It also allows for faster data exploration in large data sets. Finally, data visualization tools are intuitive and do not require an understanding of complex mathematical or statistical algorithms. New software is constantly being developed to help users work with the ever-increasing volume of data produced by businesses. More and more accounting firms and private businesses are using new BI tools such as Tableau, Power BI and QlikSense (Eaton & Baader, 2018). Auditors have begun to use visualizations as a tool to look at multiple accounts over multiple years to detect misstatements. These tools can be used in risk analysis, transaction and controls testing, analytical procedures, in support of judgements and to provide insights. Many data analytics routines can now easily be performed by auditors with little or no management involvement. The ability to perform these analyses independently is important. Many routines can be performed at a very detailed level. The higher-level routines can be used for risk analysis to find a problem, while the more detailed analysis can be used to provide audit evidence and/or insights.
98
Data Analysis and Information Processing
Another promising feature of data visualization tools relates to an audit engagement communication. With these tools, information can be summarized and presented in a way that is attentional and sufficient. A reader of the report will get the required information with a simple glance of a visual presentation. It is possible that an opinion will be much more powerful if is accompanied with a visualization of facts rather than statements describing the factors to support the opinion. Introducing visualization techniques can make reports easier to read and understand while focusing on the main figures of what an auditor is trying to report. While analyzing data is the crux of an external audit, it is critical that auditors know how to work with data. Doing so ensures they will better understand their client and plan a quality audit. As the pace of innovation continues to increase, data visualization may become a necessary part of the job for many accountants and auditors. Accountants and auditors need to use vast amounts of data to not only report on the past but also provide timely assurance and insights about the business’ future. There is a need to employ dynamic analytics or visualization tool to increase the impact of their opinions and recommendations. Thus, it is imperative that the accounting profession adopts and implements dynamic reporting and visualization techniques that deal with the big-data problem and produce results that enhance the ability to make an impact and gain influence.
CONCLUSION The use of automation, big data and other technological advances such as machine learning will continue to grow in accounting and audit, producing important business intelligence tools that provide historical, current and predictive views of business operations in interactive data visualizations. Business intelligence systems allow accounting professionals to make better decisions by analyzing very large volumes of data from all lines of business, resulting in increased productivity and accuracy and better insights to make more informed decisions. The built-in, customizable dashboards allow for real-time reporting and analysis, where exceptions, trends and opportunities can be identified and transactional data drilled down for greater detail. Analytics, artificial intelligence, and direct linkages to clients’ transaction systems can allow audits to be a continuous rather than an annual process, and material misstatements and financial irregularities can be detected in real time as they occur, providing near real-time assurance.
Big Data Analytics for Business Intelligence in Accounting and Audit
99
Audit team members could reduce the performance of repetitive low-level tasks in verifying transactional data and be involved in high-value tasks by focusing their efforts on the interpretation of the results produced by machines. With adequate understanding of the wider business and economic environment in which the client entity operates, from changes in technology or competition, auditors are more able to assess the reasonableness of the assumptions made by management, instead of just focusing on mechanical details. Such improvements will enhance the application of professional skepticism in framing auditor’s judgments when performing risk assessment procedures and consequently, design an audit strategy and approach that will be responsive to the assessed risks of material misstatement. As audits become substantially more automated in the future, auditors could also provide valuable insights to the clients, such as how the clients’ performances fare in comparison with similar companies on key metrics and benchmarks, providing value-added services in addition to audit service. Eventually, it will be the investing public who will benefit from higher quality, more insightful audits powered by machine learning and big data analysis across clients and industries.
ACKNOWLEDGEMENTS Mui Kim Chu is senior lecturer at Singapore Institute of Technology. Kevin Ow Yong is associate professor at Singapore Institute of Technology. We wish to thank Khin Yuya Thet for her research assistance. All errors are our own.
100
Data Analysis and Information Processing
REFERENCES 1. 2.
3.
4. 5.
6.
7. 8.
9.
10.
11.
12. 13.
Alawadhi, A. (2015). The Application of Data Visualization in Auditing. Rutgers, The State University of New Jersey Davenport, T. H. (2016). The Power of Advanced Audit Analytics Everywhere Analytics. Deloitte Development LLC. https://www2. deloitte.com/content/dam/Deloitte/us/Documents/deloitte-analytics/ us-da-advanced-audit-analytics.pdf Dickey, G., Blanke, S., & Seaton, L. (2019). Machine Learning in Auditing: Current and Future Applications. The CPA Journal, 89, 1621. Eaton, T., & Baader, M. (2018). Data Visualization Software: An Introduction to Tableau for CPAs. The CPA Journal, 88, 50-53. Haq, I., Abatemarco, M., & Hoops, J. (2020). The Development of Machine Learning and its Implications for Public Accounting. The CPA Journal, 90, 6-9. IAASB (2018). Exploring the Growing Use of Technology in the Audit, with a Focus on Data Analytics. International Auditing and Assurance Standards Board. Lim, J. M., Lam, T., & Wang, Z. (2020). Using Data Analytics in a Financial Statement Audit. IS Chartered Accountant Journal. Martens, D., Bruynseels, L., Baesens, B., Willekens, M., & Vanthienen, J. (2008). Predicting Going Concern Opinion with Data Mining. Decision Support Systems, 45, 765-777. https://doi.org/10.1016/j. dss.2008.01.003 McQuilken, D. (2019). 5 Steps to Get Started with Audit Data Analytics. AICPA. https://blog.aicpa.org/2019/05/5-steps-to-get-started-withaudit-data-analytics.html#sthash.NSlZVigi.dpbs Paltrinieri, N., Comfort, L., & Reniers, G. (2019). Learning about Risk: Machine Learning for Risk Assessment. Safety Science, 118, 475-486. https://doi.org/10.1016/j.ssci.2019.06.001 Skapoullis, C. (2018). The Need for Data Visualisation. ICAEW. https://www.icaew.com/technical/business-and-management/strategyrisk-and-innovation/risk-management/internal-audit-resource-centre/ the-need-for-data-visualisation Tschakert, N., Kokina, J., Kozlowski, S., & Vasarhelyi, M. (2016). The Next Frontier in Data Analytics. Journal of Accountancy, 222, 58. Zabeti, S. (2019). How Audit Data Analytics Is Changing Audit. Accru.
Chapter 5
Big Data Analytics in Immunology: A Knowledge-Based Approach
Guang Lan Zhang1, Jing Sun2, Lou Chitkushev1, and Vladimir Brusic1 Department of Computer Science, Metropolitan College, Boston University, Boston, MA 02215, USA 2 Cancer Vaccine Center, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA 02115, USA
1
ABSTRACT With the vast amount of immunological data available, immunology research is entering the big data era. These data vary in granularity, quality, and complexity and are stored in various formats, including publications, technical reports, and databases. The challenge is to make the transition from data to actionable knowledge and wisdom and bridge the knowledge gap and application gap. We report a knowledge-based approach based on a framework called KB-builder that facilitates data mining by enabling
Citation: Guang Lan Zhang, Jing Sun, Lou Chitkushev, Vladimir Brusic, “Big Data Analytics in Immunology: A Knowledge-Based Approach”, BioMed Research International, vol. 2014, Article ID 437987, 9 pages, 2014. https://doi.org/10.1155/2014/437987. Copyright: © 2014 by Authors. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
102
Data Analysis and Information Processing
fast development and deployment of web-accessible immunological data knowledge warehouses. Immunological knowledge discovery relies heavily on both the availability of accurate, up-to-date, and well-organized data and the proper analytics tools. We propose the use of knowledge-based approaches by developing knowledgebases combining well-annotated data with specialized analytical tools and integrating them into analytical workflow. A set of well-defined workflow types with rich summarization and visualization capacity facilitates the transformation from data to critical information and knowledge. By using KB-builder, we enabled streamlining of normally time-consuming processes of database development. The knowledgebases built using KB-builder will speed up rational vaccine design by providing accurate and well-annotated data coupled with tailored computational analysis tools and workflow.
INTRODUCTION Data represent the lowest level of abstraction and do not have meaning by themselves. Information is data that has been processed so that it gives answers to simple questions, such as “what,” “where,” and “when.” Knowledge represents the application of data and information at a higher level of abstraction, a combination of rules, relationships, ideas, and experiences, and gives answers to “how” or “why” questions. Wisdom is achieved when the acquired knowledge is applied to offer solutions to practical problems. The data, information, knowledge, and wisdom (DIKW) hierarchy summarizes the relationships between these levels, with data at its base and wisdom at its apex and each level of the hierarchy being an essential precursor to the levels above (Figure 1(a)) [1, 2]. The acquisition cost is lowest for data acquisition and highest for knowledge and wisdom acquisition (Figure 1(b)).
Big Data Analytics in Immunology: A Knowledge-Based Approach
103
Figure 1: The DIKW hierarchy. (a) The relative quantities of data, information, knowledge, and wisdom. (b) The relative acquisition cost of the different layers. (c) The gap between data and knowledge and (d) the gap between knowledge and wisdom.
In immunology, for example, a newly sequenced molecular sequence without functional annotation is a data point, information is gained by annotating the sequence to answer questions such as which viral strain it originates from, knowledge may be obtained by identifying immune epitopes in the viral sequence, and the design of a peptide-based vaccine using the epitopes represents the wisdom level. Overwhelmed by the vast amount of immunological data, to make the transition from data to actionable knowledge and wisdom and bridge the knowledge gap and application gap, we are confronted with several challenges. These include asking the “right questions,” handling unstructured data, data quality control (garbage in, garbage out), integrating data from various sources in various formats, and developing specialized analytics tools with the capacity to handle large volume of data. The human immune system is a complex system comprising the innate immune system and the adaptive immune system. There are two branches of adaptive immunity, humoral immunity effected by the antibodies and cellmediated immunity effected by the T cells of the immune system. In humoral immunity, B cells produce antibodies for neutralization of extracellular pathogens and their antigens that prevent the spread of infection. The
104
Data Analysis and Information Processing
activation of B cells and their differentiation into antibody-secreting plasma cells is triggered by antigens and usually requires helper T cells [3]. B cells identify antigens through B-cell receptors, which recognize discrete sites on the surface of target antigens called B-cell epitopes [4]. Cellular immunity involves the activation of phagocytes, antigen-specific cytotoxic T-lymphocytes (CTLs), and the release of various cytokines in response to pathogens and their antigens. T cells identify foreign antigens through their T-cell receptors (TCRs), which interact with a peptide antigen in complex with a major histocompatibility complex (MHC) molecule in conjunction with CD4 or CD8 coreceptors [5, 6]. Peptides that induce immune responses, when presented by MHC on the cell surface for recognition by T cells, are called T-cell epitopes. CD8+ T cells control infection through direct cytolysis of infected cells and through production of soluble antiviral mediators. This function is mediated by linear peptide epitopes presented by MHC class I molecules. CD4+ T cells recognize epitopes presented by MHC class II molecules on the surface of infected cells and secrete lymphokines that stimulate B cells and cytotoxic T cells. The Immune Epitope Database (IEDB) [7] hosts nearly 20,000 T-cell epitopes as of Feb. 2014. The recognition of a given antigenic peptide by an individual immune system depends on the ability of this peptide to bind one or more of the host’s human leukocyte antigens (HLA-human MHC). The binding of antigenic peptides to HLA molecules is the most selective step in identifying T-cell epitopes. There is a great diversity of HLA genes with more than 10,000 known variants characterized as of Feb. 2014 [8]. To manage this diversity, the classification of HLA into supertypes was proposed to describe those HLA variants that have small differences in their peptide-binding grooves and share similar peptide-binding specificities [9, 10]. Peptides that can bind multiple HLA variants are termed “promiscuous peptides.” They are suitable for the design of epitope-based vaccines because they can interact with multiple HLA within human populations. The concept of reverse vaccinology supports identification of vaccine targets by large-scale bioinformatics screening of entire pathogenic genomes followed by experimental validation [11]. Using bioinformatics analysis to select a small set of key wet-lab experiments for vaccine design is becoming a norm. The complexity of identification of broadly protective vaccine targets arises from two principal sources, the diversity of pathogens and the diversity of human immune system. The design of broadly protective peptide-based vaccines involves the identification and selection of vaccine
Big Data Analytics in Immunology: A Knowledge-Based Approach
105
targets composed of conserved T-cell and B-cell epitopes that are broadly cross-reactive to viral subtypes and protective of a large host population (Figure 2).
Figure 2: The process of rational vaccine discovery using knowledge-based systems. The design of broadly protective peptide-based vaccines involves identification and selection of vaccine targets composed of conserved T-cell and B-cell epitopes that are broadly cross-reactive to pathogen subtypes and protective of a large host population.
Fuelled by the breakthroughs in genomics and proteomics and advances in instrumentation, sample processing, and immunological assays, immunology research is entering the big data era. These data vary in granularity, quality, and complexity and are stored in various formats, including publications, technical reports, and databases. Next generation sequencing technologies are shifting the paradigm of genomics and allowing researchers to perform genome-wide studies [12]. It was estimated that the amount of publically available genomic data will grow from petabytes (1015) to exabytes (1018) [13]. Mass spectrometry (MS) is the method for detection and quantitation of proteins. The technical advancements in proteomics support exponential growth of the numbers of characterized protein sequences. It is estimated that more than 2 million protein variants make the posttranslated human proteome in any human individual [14]. Capitalizing on the recent advances in immune profiling methods, the Human Immunology Project Consortium (HIPC) is creating large data sets on human subjects undergoing influenza vaccination or who are infected with pathogens including influenza virus,
106
Data Analysis and Information Processing
West Nile virus, herpes zoster, pneumococcus, and the malaria parasite [15]. Systems biology aims to study the interactions between relevant molecular components and their changes over time and enable the development of predictive models. The advent of technological breakthroughs in the fields of genomics, proteomics, and other “omics” is catalyzing advances in systems immunology, a new field under the umbrella of system biology [16]. The synergy between systems immunology and vaccinology enables rational vaccine design [17]. Big data describes the environment where massive data sources combine both structured and unstructured data so that the analysis cannot be performed using traditional database and analytical methods. Increasingly, data sources from literature and online sources are combined with the traditional types of data [18] for summarization of complex information, extraction of knowledge, decision support, and predictive analytics. With the increase of the data sources, both the knowledge and application gaps (Figures 1(c) and 1(d)) keep widening and the corresponding volumes of data and information are rapidly increasing. We describe a knowledge-based approach that helps reduce the knowledge and application gaps for applications in immunology and vaccinology.
MATERIALS AND METHODS In the big data era, knowledge-based systems (KBSs) are emerging as knowledge discovery platforms. A KBS is an intelligent system that employs a computationally tractable knowledgebase or repository in order to reason upon data in a targeted domain and reproduce expert performance relative to such reasoning operations [19]. The goal of a KBS is to increase the reproducibility, scalability, and accessibility of complex reasoning tasks [20]. Some of the web-accessible immunological databases, such as Cancer Immunity Peptide Database that hosts four static data tables containing four types of tumor antigens with defined T-cell epitopes, focus on cataloging the data and information and pay little attention to the integration of analysis tools [21, 22]. Most recent web-accessible immunological databases, such as Immune Epitope Database (IEDB) that catalogs experimentally characterized B-cell and T-cell epitopes and data on MHC binding and MHC ligand elution experiments, started to integrate some data analysis tools [7, 23]. To bridge the knowledge gap between immunological information and knowledge, we need KBSs that tightly integrate data with analysis tools to enable comprehensive screening of immune epitopes from a comprehensive
Big Data Analytics in Immunology: A Knowledge-Based Approach
107
landscape of a given disease (such as influenza, flaviviruses, or cancer), the analysis of crossreactivity and crossprotection following immunization or vaccination, and prediction of neutralizing immune responses. We developed a framework called KB-builder to facilitate data mining by enabling fast development and deployment of web-accessible immunological data knowledge warehouses. The framework consists of seven major functional modules (Figure 3), each facilitating a specific aspect of the knowledgebase construction process. The KB-builder framework is generic and can be applied to a variety of immunological sequence datasets. Its aim is to enable the development of a web-accessible knowledgebase and its corresponding analytics pipeline within a short period of time (typically within 1-2 weeks), given a set of annotated genetic or protein sequences.
Figure 3: The structure of KB-builder.
The design of a broadly protective peptide-based vaccine against viral pathogens involves the identification and selection of vaccine targets composed of conserved T-cell and B-cell epitopes that are broadly crossreactive to a wide range of viral subtypes and are protective in a large majority of host population (Figure 2). The KB-builder facilitates a systematic discovery of vaccine targets by enabling fast development of specialized bioinformatics KBS that tightly integrate the content (accurate, up-to-date, and well-organized antigen data) with tailored analysis tools. The input to KB-builder is data scattered across primary databases and scientific literature (Figure 3). Module 1 (data collection and processing module) performs automated data extraction and initial transformations. The raw antigen data (viral or tumor) consisting of protein or nucleotide
108
Data Analysis and Information Processing
sequences, or both, and their related information are collected from various sources. The collected data are then reformatted and organized into a unified XML format. Module 2 (data cleaning, enrichment, and annotation module) deals with data incompleteness, inconsistency, and ambiguities due to the lack of submission standards in the online primary databases. The semiautomated data cleaning is performed by domain experts to ensure data quality, completeness, and redundancy reduction. Semiautomated data enrichment and annotation are performed by the domain experts further enhancing data quality. The semiautomation involves automated comparison of new entries to the entries already processed within the KB and comparison of terms that are entered into locally implemented dictionaries. Terms that match the existing record annotations and dictionary terms are automatically processed. New terms and new annotations are inspected by a curator and if in error they are corrected, or if they represent novel annotations or terms they are added to the knowledgebase and to the local dictionaries. Module 3 (the import module) performs automatic import of the XML file into the central repository. Module 4 (the basic analysis toolset) facilitates fast integration of common analytical tools with the online antigen KB. All our knowledgebases have the basic keyword search tools for locating antigens and T-cell epitopes or HLA ligands. The advanced keyword search tool was included in FLAVIdB, FLUKB, and HPVdB, where users further restrict the search by selecting virus species, viral subtype, pathology, host organism, viral strain type, and several other filters. Other analytical tools include sequence similarity search enabled by basic local alignment search tool (BLAST) [24] and color-coded multiple sequence alignment (MSA) tool [25] on user-defined sequence sets as shown in Figure 4. Module 5 (the specialized analysis toolset) facilitates fast integration of specialized analysis tools designed according to the specific purpose of the knowledgebase and the structural and functional properties of the source of the sequences. To facilitate efficient antigenicity analysis, in every knowledgebase and within each antigen entry, we embedded a tool that performs on-the-fly binding prediction to 15 frequent HLA class I and class II alleles. In TANTIGEN, an interactive visualization tool, mutation map, has been implemented to provide a global view of all mutations reported in a tumor antigen. Figure 5 shows a screenshot of mutation map of tumor antigen epidermal growth factor receptor (EGFR) in TANTIGEN. In TANTIGEN and HPVdB, a T-cell epitope visualization tool has been implemented to display epitopes in all isoforms of a tumor antigen or sequences of a HPV genotype. The B-cell visualization tool in FLAVIdB and FLUKB displays neutralizing
Big Data Analytics in Immunology: A Knowledge-Based Approach
109
B-cell epitope positions on viral protein three-dimensional (3D) structures [26, 27]. To analyze viral sequence variability, given a MSA of a set of sequences, a tool was developed to calculate Shannon entropy at each alignment position. To identify conserved T-cell epitopes that cover the majority of viral population, we developed and integrated block entropy analysis tool in FLAVIdB and FLUKB to analyze peptide conservation and variability. We developed a novel sequence logo tool, BlockLogo, optimized for visualization of continuous and discontinuous motifs, fragments [28, 29]. When paired with the HLA binding prediction tool, BlockLogo is a useful tool for rapid assessing of immunological potential of selected regions in a MSA, such as alignments of viral sequences or tumor antigens.
Figure 4: A screenshot of the result page generated by the color-coded MSA tool implemented in the FLAVIdB. The residues are color-coded by frequency: white (100%), cyan (second most frequent), yellow (third most frequent residues), gray (fourth most frequent residues), green (fifth most frequent residues), purple (sixth most frequent residues), and blue (everything less frequent than the sixth most frequent residues).
110
Data Analysis and Information Processing
Figure 5: A screenshot of mutation map of tumor antigen epidermal growth factor receptor (EGFR) in TANTIGEN. The numbers are the amino acid positions in the antigen sequence and the top amino acid sequence is the reference sequence of EGFR. The highlighted amino acids in the reference sequences are positions where point mutations took place. Clicking on the amino acids below the point mutation positions links to the mutated sequence data table.
A workflow is an automated process that takes a request from the user, performs complex analysis by combining data and tools preselected for common questions, and produces a comprehensive report [30]. Module 6 (workflow for integrated analysis to answer meaningful questions) automates the consecutive execution of multiple analysis steps, which researchers usually would have to perform manually, to answer complex sequential questions. Two workflow types, the summary workflow and the query analyzer workflow, were implemented in FLAVIdB. Three workflow types, the vaccine target workflow, the crossneutralization estimation workflow, and B-cell epitope mapper workflow, were implemented in FLUKB. Module 7 (semiautomated update and maintenance of the databases) employs a semiautomated approach to maintain and update the databases.
RESULTS AND DISCUSSION Using the KB-builder, we built several immunovaccinology knowledgebases including TANTIGEN: Tumor T-cell Antigen Database (http://cvc.dfci. harvard.edu/tadb/), FLAVIdB: Flavivirus Antigen Database [31], HPVdB: Human Papillomavirus T-cell Antigen Database [32], FLUKB: Flu Virus Antigen Database (http://research4.dfci.harvard.edu/cvc/flukb/), EpsteinBarr Virus T-cell Antigen Database (http://research4.dfci.harvard.edu/ cvc/ebv/), and Merkel Cell Polyomavirus Antigen Database (http://cvc. dfci.harvard.edu/mcv/). These knowledgebases combine virus and tumor
Big Data Analytics in Immunology: A Knowledge-Based Approach
111
antigenic data, specialized analysis tools, and workflow for automated complex analyses focusing on applications in immunology and vaccinology. The Human Papillomavirus T-cell Antigen Database (HPVdB) contains 2781 curated antigen entries of antigenic proteins derived from 18 genotypes of high-risk HPV and 18 genotypes of low-risk HPV. It also catalogs 191 verified T-cell epitopes and 45 verified HLA ligands. The functions of the data mining tools integrated in HPVdB include antigen and epitope/ligand search, sequence comparison using BLAST search, multiple alignments of antigens, classification of HPV types based on cancer risk, T-cell epitope prediction, T-cell epitope/HLA ligand visualization, T-cell epitope/HLA ligand conservation analysis, and sequence variability analysis. HPV regulatory proteins E6 and E7 proteins are often studied for immunebased therapies as they are constitutively expressed in HPV-associated cancer cells. First, the prediction of A*0201 binding peptides (both 9-mers and 10-mers) of HPV16 E6 and E7 proteins was performed computationally. Based on the prediction results, 21 peptides were synthesized and ten of them were identified as binders using an A*0201 binding assay. The ten A*0201-binding peptides were further tested for immune recognition in peripheral blood mononuclear cells isolated from six A*0201-positive healthy donors using interferon γ (IFN γ) ELISpot assay. Two peptides, E711– and E629–38, elicited spot-forming-unit numbers 4-5-fold over background 19 in one donor. Finally, mass spectrometry was used to validate that peptide E711–19 is naturally presented on HPV16-transformed, A*0201-positive cells. Using the peptide conservation analysis tool embedded in HPVdB, we answered the question how many HPV strains contain this epitope. The epitope E711–19 is conserved in 16 of 17 (94.12% conserved) HPV16 E7 complete sequences (Figure 6). A single substitution mutation L15V in HPV001854 (UniProt ID: C0KXQ5) resulted in the immune escape. Among the 35 HPV16 cervical cancer samples we analyzed, only a single sample contained the HPV001854 sequence variant. The conserved HPV T-cell epitopes displayed by HPV transformed tumors such as E711–19 may be the basis of a therapeutic T-cell based cancer vaccine. This example shows the combination of bioinformatics analysis and experimental validation leading to identification of suitable vaccine targets [33, 34].
112
Data Analysis and Information Processing
Figure 6: A screenshot of the conservation analysis result page of T-cell epitope E711–19 in HPVdB.
Flaviviruses, such as dengue and West Nile viruses, are NIAID Category A and B Priority Pathogens. We developed FLAVIdB that contains 12,858 entries of flavivirus antigen sequences, 184 verified T-cell epitopes, 201 verified B-cell epitopes, and 4 representative molecular structures of the dengue virus envelope protein [31]. The data mining system integrated in FLAVIdB includes tools for antigen and epitope/ligand search, sequence comparison using BLAST search, multiple alignments of antigens, variability and conservation analysis, T-cell epitope prediction, and characterization of neutralizing components of B-cell epitopes. A workflow is an automated process that takes a request from the user, performs complex analysis by combining data and tools preselected for common questions, and produces a comprehensive report to answer a specific research question. Two predefined analysis workflow types, summary workflow and query analyzer workflow, were implemented in FLAVIdB [31]. Broad coverage of the pathogen population is particularly important when designing T-cell epitope vaccines against viral pathogens. Using FLAVIdB we applied the block entropy analysis method to the proteomes of the four serotypes of dengue virus (DENV) and found 1,551 blocks of 9-mer peptides, which cover 99% of available sequences with five or fewer unique peptides [35]. Many of the blocks are located consecutively in the proteins, so connecting these blocks resulted in 78 conserved regions which can be covered with 457 subunit peptides. Of the 1551 blocks of 9-mer peptides, 110 blocks consisted of peptides all predicted to bind to MHC with similar affinity and the same HLA restriction. In total, we identified a pool
Big Data Analytics in Immunology: A Knowledge-Based Approach
113
of 333 peptides as T-cell epitope candidates. This set could form the basis for a broadly neutralizing dengue virus vaccine. The results of block entropy analysis of dengue subtypes 1–4 from FLAVIdB are shown in Figure 7.
Figure 7: Block entropy analysis of envelope proteins of dengue subtypes 1–4 in the FLAVIdB. (a) A screenshot of the input page of block entropy analysis in the FLAVIdB. (b) The number of blocks needed to cover 99% of the sequences variation. -axis is the starting positions of blocks and -axis is the number of blocks required. The blocks with gap fraction above 10% are not plotted.
Influenza virus is a NIAID Category C Priority Pathogen. We developed the FLUKB that currently contains 302,272 influenza viral protein sequence entries from 62,016 unique strains (57,274 type A, 4,470 type B, 180 type C, and 92 unknown types) of influenza virus. It also catalogued 349 unique T-cell epitopes, 708 unique MHC binding peptides, and 17 neutralizing antibodies against hemagglutinin (HA) proteins along with their 3D structures. The detailed information on the neutralizing antibodies such as isolation information, experimentally validated neutralizing/escape influenza strains, B-cell epitope on the 3D structures, are also provided. Approximately 10% of B-cell epitopes are linear peptides, while 90% are formed from discontinuous amino acids that create surface patches resulting from 3D folding of proteins [36]. Characterization of an increasing number of broadly neutralizing antibodies specific for pathogen surface proteins, the growing number of known 3D structures of antigen-neutralizing antibody complexes, and the rapid growth of the number of viral variant sequences demand systematic bioinformatics analyses of B-cell epitopes and crossreactivity of neutralizing antibodies. We developed a generic method for the assessment of neutralizing properties of monoclonal antibodies. Previously,
114
Data Analysis and Information Processing
dengue virus was used to demonstrate a generalized method [27]. This methodology has direct relevance to the characterization and the design of broadly neutralizing vaccines. Using the FLUKB, we employed the analytical methods to estimate crossreactivity of neutralizing antibodies (nAbs) against surface glycoprotein HA of influenza virus strains, both newly emerging or the existing ones [26]. We developed a novel way of describing discontinuous motifs as virtual peptides to represent B-cell epitopes and to estimate potential crossreactivity and neutralizing coverage of these epitopes. Strains labelled as potentially cross-reactive are those that share 100% identity of B-cell epitopes with experimentally verified neutralized strains. Two workflow types were implemented in the FLUKB for cross-neutralization analysis: cross-neutralization estimation workflow and B-cell epitope mapper workflow. The cross-neutralization estimation workflow estimates the crossneutralization coverage of a validated neutralizing antibody using all fulllength sequences of HA hosted in the FLUKB, or using full-length HA sequences of a user-defined subset by restricting year ranges, subtypes, or geographical locations. Firstly, a MSA is generated using the full-length HA sequences. The resulting MSA provides a consistent alignment position numbering scheme for the downstream analyses. Secondly, for each nAb, the HA sequence from its 3D structure and from the experimentally validated strains is used to search for a strain with the highest similarity in FLUKB using BLAST. Thirdly, a B-cell epitope is identified from the validated antigen-antibody structures based on the calculation of accessible surface area and atom distance. Fourthly, using the MSA and the alignment position numbering, the residue position of the B-cell epitope is mapped onto the HA sequences of validated strains to get B-cell epitope motifs. Discontinuous motifs are extracted from all the HA sequences in the MSA and compared to the B-cell epitope motif. According to the comparison results, they are classified to be either neutralizing if identical to a neutralizing discontinuous motif, escape if identical to an escape discontinuous motif, or not validated if no identical match was found. The cross-neutralization coverage estimation of neutralizing antibody F10 on all HA sequences from FLUKB is shown in Figure 8.
Big Data Analytics in Immunology: A Knowledge-Based Approach
115
Figure 8: (a) Sequence logo of neutralizing epitopes by neutralizing antibody F10 on influenza virus HA protein. (b) BlockLogo of the discontinuous residues in F10 neutralizing epitope. (c) The structure of influenza A HA protein with neutralizing antibody F10 (PDB ID:3FKU) and the conformational epitope shown in pink. (d) Discontinuous epitope on HA protein recognized by F10.
For a newly emerged strain, the B-cell epitope mapper workflow performs in silico prediction of its cross-neutralization based on existing nAbs and provides preliminary results for the design of downstream validation experiments. Firstly, a discontinuous peptide is extracted from its HA sequence according to positions on each known B-cell epitope. Secondly, sequence similarity comparison is conducted between the discontinuous motifs and all known B-cell epitopes from experimentally validated strains.
116
Data Analysis and Information Processing
The motifs identical to the known neutralized or escape B-cell epitope motifs are proposed as neutralized or escape strains, respectively. The cross-neutralization estimation workflow provides an overview of cross-neutralization of existing neutralizing antibodies, while B-cell epitope mapper workflow gives an estimation of possible neutralizing effect of new viral strains using known neutralizing antibodies. This knowledge-based approach improves our understanding of antibody/antigen interactions, facilitates mapping of the known universe of target antigens, allows the prediction of cross-reactivity, and speeds up the design of broadly protective influenza vaccines.
CONCLUSIONS The big data analytics applies advanced analytic methods to data sets that are very large and complex and that include diverse data types. These advanced analytics methods include predictive analytics, data mining, text mining, integrated statistics, visualization, and summarization tools. The data sets used in our case studies are complex and the analytics is achieved through the definition of workflow. Data explosion in our case studies is fueled by the combinatorial complexity of the domain and the disparate data types. The cost of analysis and computation increases exponentially as we combine various types of data to answer research questions. We use the in silico identification of influenza T-cell epitopes restricted by HLA class I variants as an example. There are 300,000 influenza sequences to be analyzed for T-cell epitopes using MHC binding prediction tools based on artificial neural networks or support vector machines [37–40]. Based on the DNA typing for the entire US donor registry, there are 733 HLA-A, 921 HLA-B, and 429 HLA-C variants, a total of 2083 HLA variants, observed in US population [41]. These alleles combine into more than 45,000 haplotypes (combinations of HLA-A, -B, and -C) [41]. Each of these haplotypes has different frequencies and distributions across different populations. The in silico analysis of MHC class I restricted T-cell epitopes includes MHC binding prediction of all overlapping peptides that are 9–11 amino acids long. This task alone involves a systematic analysis of 300,000 sequences that are on average 300 amino acids long. Therefore, the total number of in silico predictions is approximately 300,000 × 300 × 3 × 2083 (number of sequences times the average length of each sequence times 3 times the number of observed HLA variants) or a total of 5.6 × 1011 calculations. Predictive models do not exist for all HLA alleles, so some analysis needs
Big Data Analytics in Immunology: A Knowledge-Based Approach
117
to be performed by analysis of similarity of HLA molecules and grouping them in clusters that share binding properties. For B-cell epitope analysis, the situation is similar, except that the methods involve the analysis of 3D structures of antibodies and the analysis of nearly 100,000 sequences of HA and neuraminidase (NA) and their cross-comparison for each neutralizing antibody. A rich set of visualization tools is needed to report population data and distributions across populations. For vaccine studies, these data need to be analyzed together with epidemiological data including transmissibility and severity of influenza viruses [42]. These functional properties can be assigned to each influenza strain and the analysis can be performed for their epidemic and pandemic potential. These numbers indicate that the analytics methods involve a large amount of calculations that cannot be performed using brute force approaches. Immunological knowledge discovery relies heavily on both the availability of accurate, up-to-date, and well-organized data and the proper analytics tools. We propose the use of knowledge-based approaches by developing knowledgebases combining well-annotated data with specialized analytical tools and integrating them into analytical workflow. A set of well-defined workflow types with rich summarization and visualization capacity facilitates the transformation from data to critical information and knowledge. By using KB-builder, we enabled streamlining of normally time-consuming process of database development. The knowledgebases built using KB-builder will speed up rational vaccine design by providing accurate and well-annotated data coupled with tailored computational analysis tools and workflow.
118
Data Analysis and Information Processing
REFERENCES 1.
J. Rowley, “The wisdom hierarchy: representations of the DIKW hierarchy,” Journal of Information Science, vol. 33, no. 2, pp. 163– 180, 2007. 2. R. Ackoff, “From data to wisdom,” Journal of Applies Systems Analysis, vol. 16, no. 1, pp. 3–9, 1989. 3. C. Janeway, Immunobiology: The Immune System in Health and Disease, Garland Science, New York, NY, USA, 6th edition, 2005. 4. M. H. V. van Regenmortel, “What is a B-cell epitope?” Methods in Molecular Biology, vol. 524, pp. 3–20, 2009. 5. S. C. Meuer, S. F. Schlossman, and E. L. Reinherz, “Clonal analysis of human cytotoxic T lymphocytes: T4+ and T8+ effector T cells recognize products of different major histocompatibility complex regions,” Proceedings of the National Academy of Sciences of the United States of America, vol. 79, no. 14 I, pp. 4395–4399, 1982. 6. J. H. Wang and E. L. Reinherz, “Structural basis of T cell recognition of peptides bound to MHC molecules,” Molecular Immunology, vol. 38, no. 14, pp. 1039–1049, 2002. 7. R. Vita, L. Zarebski, J. A. Greenbaum et al., “The immune epitope database 2.0,” Nucleic Acids Research, vol. 38, supplement 1, pp. D854–D862, 2009. 8. J. Robinson, J. A. Halliwell, H. McWilliam, R. Lopez, P. Parham, and S. G. E. Marsh, “The IMGT/HLA database,” Nucleic Acids Research, vol. 41, no. 1, pp. D1222–D1227, 2013. 9. A. Sette and J. Sidney, “Nine major HLA class I supertypes account for the vast preponderance of HLA-A and -B polymorphism,” Immunogenetics, vol. 50, no. 3-4, pp. 201–212, 1999. 10. O. Lund, M. Nielsen, C. Kesmir et al., “Definition of supertypes for HLA molecules using clustering of specificity matrices,” Immunogenetics, vol. 55, no. 12, pp. 797–810, 2004. 11. R. Rappuoli, “Reverse vaccinology,” Current Opinion in Microbiology, vol. 3, no. 5, pp. 445–450, 2000. 12. D. C. Koboldt, K. M. Steinberg, D. E. Larson, R. K. Wilson, and E. R. Mardis, “The next-generation sequencing revolution and its impact on genomics,” Cell, vol. 155, no. 1, pp. 27–38, 2013.
Big Data Analytics in Immunology: A Knowledge-Based Approach
119
13. D. R. Zerbino, B. Paten, and D. Haussler, “Integrating genomes,” Science, vol. 336, no. 6078, pp. 179–182, 2012. 14. M. Uhlen and F. Ponten, “Antibody-based proteomics for human tissue profiling,” Molecular and Cellular Proteomics, vol. 4, no. 4, pp. 384– 393, 2005. 15. V. Brusic, R. Gottardo, S. H. Kleinstein, and M. M. Davis, “Computational resources for high-dimensional immune analysis from the human immunology project consortium,” Nature Biotechnology, vol. 32, no. 2, pp. 146–148, 2014. 16. A. Aderem, “Editorial overview: system immunology,” Seminars in Immunology, vol. 25, no. 3, pp. 191–192, 2013. 17. S. Li, H. I. Nakaya, D. A. Kazmin, J. Z. Oh, and B. Pulendran, “Systems biological approaches to measure and understand vaccine immunity in humans,” Seminars in Immunology, vol. 25, no. 3, pp. 209–218, 2013. 18. L. Olsen, U. J. Kudahl, O. Winther, and V. Brusic, “Literature classification for semi-automated updating of biological knowledgebases,” BMC Genomics, vol. 14, supplement 5, article S14, 2013. 19. P. R. O. Payne, “Chapter 1: biomedical knowledge integration,” PLoS Computational Biology, vol. 8, no. 12, Article ID e1002826, 2012. 20. S.-H. Liao, P.-H. Chu, and P.-Y. Hsiao, “Data mining techniques and applications—a decade review from 2000 to 2011,” Expert Systems with Applications, vol. 39, no. 12, pp. 11303–11311, 2012. 21. N. Vigneron, V. Stroobant, B. J. van den Eynde, and P. van der Bruggen, “Database of T cell-defined human tumor antigens: the 2013 update,” Cancer Immunity, vol. 13, article 15, 2013. 22. B. J. van den Eynde and P. van der Bruggen, “T cell defined tumor antigens,” Current Opinion in Immunology, vol. 9, no. 5, pp. 684–693, 1997. 23. B. Peters, J. Sidney, P. Bourne et al., “The design and implementation of the immune epitope database and analysis resource,” Immunogenetics, vol. 57, no. 5, pp. 326–336, 2005. 24. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, vol. 215, no. 3, pp. 403–410, 1990. 25. K. Katoh, K. Misawa, K. Kuma, and T. Miyata, “MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier
120
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
Data Analysis and Information Processing
transform,” Nucleic Acids Research, vol. 30, no. 14, pp. 3059–3066, 2002. J. Sun, U. J. Kudahl, C. Simon, Z. Cao, E. L. Reinherz, and V. Brusic, “Large-scale analysis of B-cell epitopes on influenza virus hemagglutinin—implications for cross-reactivity of neutralizing antibodies,” Frontiers in Immunology, vol. 5, article 38, 2014. J. Sun, G. L. Zhang, L. R. Olsen, E. L. Reinherz, and V. Brusic, “Landscape of neutralizing assessment of monoclonal antibodies against dengue virus,” in Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics (BCB ‘13), p. 836, Washington, DC, USA, 2013. G. E. Crooks, G. Hon, J. Chandonia, and S. E. Brenner, “WebLogo: a sequence logo generator,” Genome Research, vol. 14, no. 6, pp. 1188– 1190, 2004. L. R. Olsen, U. J. Kudahl, C. Simon et al., “BlockLogo: visualization of peptide and sequence motif conservation,” Journal of Immunological Methods, vol. 400-401, pp. 37–44, 2013. J. Söllner, A. Heinzel, G. Summer et al., “Concept and application of a computational vaccinology workflow,” Immunome Research, vol. 6, supplement 2, article S7, 2010. L. R. Olsen, G. L. Zhang, E. L. Reinherz, and V. Brusic, “FLAVIdB: a data mining system for knowledge discovery in flaviviruses with direct applications in immunology and vaccinology,” Immunome Research, vol. 7, no. 3, pp. 1–9, 2011. G. L. Zhang, A. B. Riemer, D. B. Keskin, L. Chitkushev, E. L. Reinherz, and V. Brusic, “HPVdb: a data mining system for knowledge discovery in human papillomavirus with applications in T cell immunology and vaccinology,” Database, vol. 2014, Article ID bau031, 2014. A. B. Riemer, D. B. Keskin, G. Zhang et al., “A conserved E7-derived cytotoxic T lymphocyte epitope expressed on human papillomavirus 16-transformed HLA-A2+ epithelial cancers,” Journal of Biological Chemistry, vol. 285, no. 38, pp. 29608–29622, 2010. D. B. Keskin, B. Reinhold, S. Lee et al., “Direct identification of an HPV-16 tumor antigen from cervical cancer biopsy specimens,” Frontiers in Immunology, vol. 2, article 75, 2011. L. R. Olsen, G. L. Zhang, D. B. Keskin, E. L. Reinherz, and V. Brusic, “Conservation analysis of dengue virust-cell epitope-based vaccine
Big Data Analytics in Immunology: A Knowledge-Based Approach
36. 37.
38.
39.
40.
41.
42.
121
candidates using peptide block entropy,” Frontiers in Immunology, vol. 2, article 69, 2011. J. Huang and W. Honda, “CED: a conformational epitope database,” BMC Immunology, vol. 7, article 7, 2006. E. Karosiene, M. Rasmussen, T. Blicher, O. Lund, S. Buus, and M. Nielsen, “NetMHCIIpan-3.0, a common pan-specific MHC class II prediction method including all three human MHC class II isotypes, HLA-DR, HLA-DP and HLA-DQ,” Immunogenetics, vol. 65, no. 10, pp. 711–724, 2013. I. Hoof, B. Peters, J. Sidney et al., “NetMHCpan, a method for MHC class i binding prediction beyond humans,” Immunogenetics, vol. 61, no. 1, pp. 1–13, 2009. G. L. Zhang, I. Bozic, C. K. Kwoh, J. T. August, and V. Brusic, “Prediction of supertype-specific HLA class I binding peptides using support vector machines,” Journal of Immunological Methods, vol. 320, no. 1-2, pp. 143–154, 2007. G. L. Zhang, A. M. Khan, K. N. Srinivasan, J. T. August, and V. Brusic, “Neural models for predicting viral vaccine targets,” Journal of Bioinformatics and Computational Biology, vol. 3, no. 5, pp. 1207– 1225, 2005. L. Gragert, A. Madbouly, J. Freeman, and M. Maiers, “Six-locus high resolution HLA haplotype frequencies derived from mixed-resolution DNA typing for the entire US donor registry,” Human Immunology, vol. 74, no. 10, pp. 1313–1320, 2013. C. Reed, M. Biggerstaff, L. Finelli et al., “Novel framework for assessing epidemiologic effects of influenza epidemics and pandemics,” Emerging Infectious Diseases, vol. 19, no. 1, pp. 85–91, 2013.
SECTION 2: BIG DATA METHODS
Chapter 6
Integrated Real-Time Big Data Stream Sentiment Analysis Service
Sun Sunnie Chung, Danielle Aring Department of Electrical Engineering and Computer Science, Cleveland State University, Cleveland, USA
ABSTRACT Opinion (sentiment) analysis on big data streams from the constantly generated text streams on social media networks to hundreds of millions of online consumer reviews provides many organizations in every field with opportunities to discover valuable intelligence from the massive user generated text streams. However, the traditional content analysis frameworks are inefficient to handle the unprecedentedly big volume of unstructured text streams and the complexity of text analysis tasks for the real time opinion analysis on the big data streams. In this paper, we propose a parallel real time sentiment analysis system: Social Media Data Stream Sentiment Analysis Service (SMDSSAS) that performs multiple phases of sentiment analysis of social media text streams effectively in real time with two fully analytic opinion mining models to combat the scale of text data streams Citation: Chung, S. and Aring, D. (2018), “Integrated Real-Time Big Data Stream Sentiment Analysis Service”. Journal of Data Analysis and Information Processing, 6, 4666. doi: 10.4236/jdaip.2018.62004. Copyright: © 2018 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0
126
Data Analysis and Information Processing
and the complexity of sentiment analysis processing on unstructured text streams. We propose two aspect based opinion mining models: Deterministic and Probabilistic sentiment models for a real time sentiment analysis on the user given topic related data streams. Experiments on the social media Twitter stream traffic captured during the pre-election weeks of the 2016 Presidential election for real-time analysis of public opinions toward two presidential candidates showed that the proposed system was able to predict correctly Donald Trump as the winner of the 2016 Presidential election. The cross validation results showed that the proposed sentiment models with the real-time streaming components in our proposed framework delivered effectively the analysis of the opinions on two presidential candidates with average 81% accuracy for the Deterministic model and 80% for the Probabilistic model, which are 1% - 22% improvements from the results of the existing literature. Keywords: Sentiment Analysis, Real-Time Text Analysis, Opinion Analysis, Big Data An-alytics
INTRODUCTION In the era of the web based social media, user-generated contents in “any” form of user created content including: blogs, wikis, forums, posts, chats, tweets, or podcasts have become the norm of media to express people’s opinion. The amounts of data generated by individuals, businesses, government, and research agents have undergone exponential growth. Social networking giants such as Facebook and Twitter had 1.86 and 0.7 billion active users as of Feb. 2018. The user-generated texts are valuable resources to discover useful intelligence to help people in any field to make critical decisions. Twitter has become an important platform of user generated text streams where people express their opinions and views on new events, new products or news. Such new events or news from announcing political parties and candidates for elections to a popular new product release are often followed almost instantly by a burst in Twitter volume, providing a unique opportunity to measure the relationship between expressed public sentiment and the new events or the new products. Sentiment analysis can help explore how these events affect public opinion or how public opinion affects future sales of these new products. While traditional content analysis takes days or weeks to complete, opinion
Integrated Real-Time Big Data Stream Sentiment Analysis Service
127
analysis of such streaming of large amounts of user-generated text have commanded research and development of a new generation of analytics methods and tools to process them in real-time or near-real time effectively. Big data is often defined with the three characteristics: volume, velocity and variety [1] [2] because of the nature of being constantly generated massive data sets having large, varied and complex structures or often unstructured (e.g. tweet text). Those three characteristics of big data imply difficulties of storing, analyzing and visualizing for further processes and results with traditional data analysis systems. Common problems of big data analytics are firstly, traditional data analysis systems are not reliable to handle the volume of data to process in an acceptable rate. Secondly, big data processing commonly requires complex data processing in multi phases of data cleaning, preprocessing, and transformation since data is available in many different formats either in semi-structured or unstructured. Lastly, big data is constantly generated at high speed by systems giving that none of the traditional data preprocessing architectures are suitable to efficiently process in real time or near real time. Two common approaches to process big data are batch-mode big data analytics and streaming-based big data analytics. Batch processing is an efficient way to process high volumes of data where a group of transactions is collected over time [3] . Frameworks that are based on a parallel and distributed system architecture such as Apache Hadoop with MapReduce currently dominate batch mode big data analytics. This type of big data processing addresses the volume and variety components of big data analytics but not velocity. In contrast, stream processing is a model that computes a small window of recent data at one time [3] . This makes computation real time or near-real time. In order to meet the demands of the real-time constraints, the stream-processing model must be able 0to calculate statistical analytics on the fly, since streaming data like user generated content in the form of repeated online user interactions is continuously arriving at high speed [3] . This notable “high velocity” on arrival characteristic of the big data stream means that corresponding big data analytics should be able to process the stream in a single pass under strict constraints of time and space. Most of the existing works that leverage the distributed parallel systems to analyze big social media data in real-time or near real-time perform mostly statistical analysis in real time with pre-computed data warehouse aggregations [4] [5] or simple frequency based sentiment analysis model [6]
128
Data Analysis and Information Processing
. More sophisticated sentiment analyses on the streaming data are mostly the MapReduce based batch mode analytics. While it is common to find batch mode data processing works for the sophisticated sentiment analysis on social media data, there are only a few works that propose the systems that perform complex real time sentiment analysis on big data streams [7] [8] [9] and little work is found in that the proposed such systems are implemented and tested with real time data streams. Sentiment Analysis otherwise known as opinion mining commonly refers to the use of natural language processing (NLP) and text analysis techniques to extract, and quantify subjective information in a text span [10] . NLP is a critical component in extracting useful viewpoints from streaming data [10] . Supervised classifiers are then utilized to predict from labeled training sets. The polarity (positive or negative opinion) of a sentence is measured with scoring algorithms to measure a polarity level of the opinion in a sentence. The most established NLP method to capture the essential meaning of a document is the bag of words (or bag of n-gram) representations [11] . Latent Dirichlet Allocation (LDA) [12] is another widely adopted representation. However, both representations have limitations to capture the semantic relatedness (context) between words in a sentence and suffer from the problems such as polysemy and synonymy [13] . A recent paradigm in NLP, unsupervised text embedding methods, such as Skip-gram [14] [15] and Paragraph Vector [16] [17] to use a distributed representation for words [14] [15] and documents [16] [17] are shown to be effective and scalable to capture the semantic and syntactic relationships, such as polysemy and synonymy, between words and documents. The essential idea of these approaches comes from the distributional hypothesis that a word is represented by its neighboring (context) words in that you shall know a word by the company it keeps [18] . Le and Mikolov [16] [17] show that their method, Paragraph Vectors, can be used in classifying movie reviews or clustering web pages. We employed the pre-trained network with the paragraph vector model [19] for our system for preprocessing to identify n-grams and synonymy in our data sets. An advanced sentiment analysis beyond polarity is the aspect based opinion mining that looks at other factors (aspects) to determine sentiment polarity such as “feelings of happiness sadness, or anger”. An example of the aspect oriented opinion mining is classifying movie reviews based on a thumbs up or downs as seen in the 2004 paper and many other papers by Pang and Lee [10] [20] . Another technique is the lexical approach to opinion
Integrated Real-Time Big Data Stream Sentiment Analysis Service
129
mining developed famously by Taboda et al. in their SO-CAL calculator [21] . The system calculated semantic orientation, i.e. subjectivity, of a word in the text by capturing the strength and potency to which a word was oriented either positively or negatively towards a given topic, using advanced techniques like amplifiers and polarity shift calculations. The single most important information needs to be identified in a sentiment analysis is to find out about opinions and perspectives on a particular topic otherwise known as topic-based opinion mining [22] . Topic-based opinion mining seeks to extract personal viewpoints and emotions surrounding social or political events by semantically orienting user-generated content that has been correlated by topic word(s) [22] . Despite the success of these sophisticated sentiment analysis methods, little is known about whether they may be scalable to apply in the multiphased opinion analysis process to a huge text stream of user generated expressions in real time. In this paper, we examined whether a streamprocessing big data social media sentiment analysis service can offer scalability in processing these multi-phased top of the art sentiment analysis methods, while offering efficient near-real time data processing of enormous data volume. This paper also explores the methodologies of opinion analysis of social network data. To summarize, we make the following contributions: •
•
•
•
•
We propose a fully integrated, real time text analysis framework that performs complex multi-phase sentiment analysis on massive text streams: Social Media Data Stream Sentiment Analysis Service (SMDSSAS). We propose two sentiment models that are combined models of topic, lexicon and aspect based sentiment analysis that can be applied to a real-time big data stream in cooperation with the most recent natural language processing (NLP) techniques: Deterministic Topic Model that accurately measures user sentiments in the subjectivity and the context of user provided topic word(s). Probabilistic Topic Model that effectively identifies polarity of sentiments per topic correlated messages over the entire data streams. We fully experimented on the popular social media Twitter message streams captured during the pre-election weeks of the 2016 Presidential Election to test the accuracy of our two proposed sentiment models and the performance of our proposed system
130
Data Analysis and Information Processing
SMDSSAS for the real time sentiment analysis. The results show that our framework can be a good alternative for an efficient and scalable tool to extract, transform, score and analyze opinions for the user generated big social media text streams in real time.
RELATED WORKS Many existing works in the related literature concentrate on topic-based opining mining models. In topic-based opinion mining, sentiment is estimated from the messages related to a chosen topic of interest such that topic and sentiment are jointly inferred [22] . There are many works on the topic based sentiment analysis where the models are tested on a batch method as listed in the reference Section. While there are many works in the topic based models for batch processing systems, there are few works in the literature on topic-based models for real time sentiment analysis on streaming data. Real-time topic sentiment analysis is imperative to meet the strict time and space constraints to efficiently process streaming data [6] . Wang et al. in the paper [6] developed a system for Real-Time Twitter Sentiment Analysis of the 2012 Presidential Election Cycle using the Twitter firehose with a statistical sentiment model and a Naive Bayes classifier on unigram features. A full suite of analytics were developed for monitoring the shift in sentiment utilizing expert curated rules and keywords in order to gain an accurate picture of the online political landscape in real time. However, these works in the existing literature lacked the complexity of sentiment analysis processes. Their sentiment analysis model for their system is based on simple aggregations for statistical summary with a minimum primitive language preprocessing technique. More recent research [23] [24] have proposed big data stream processing architectures. The first work in 2015 [23] proposed a multi-layered storm based approach for the application of sentiment analysis on big data streams in real time and the second work in 2016 [24] proposed a big data analytics framework (ASMF) to analyze consumer sentiments embedded in hundreds of millions of online product reviews. Both approaches leverage probabilistic language models by either mimicking “document relevance”: with probability of the document generating a user provided query term found within the sentiment lexicon [23] or by adapting a classical language modeling framework to enhance the prediction of consumer sentiments [24]. However, the major limitation of their works is both the proposed frameworks have never been implemented and tested under an empirical setting or in real time.
Integrated Real-Time Big Data Stream Sentiment Analysis Service
131
ARCHITECTURE OF BIG DATA STREAM ANALYTICS FRAMEWORK In this Section, we describe the architecture of our proposed big data analytics framework that is illustrated in Figure 1. Our sentiment analysis service, namely Social Media Data Stream Sentiment Analysis Service (SMDSSAS) consists of six layers―Data Storage/Extraction Layer, Data Stream Layer, Data Preprocessing and Transformation Layer, Feature Extraction Layer, Prediction Layer, and Presentation Layer. For these layers, we employed well-proven methods and tools for real time parallel distributed data processing. For the real time data analytics component, SMDSSAS leverages the Apache Spark [1] [7] and a NoSQL Hive [25] big data ecosystem, which allows us to develop a streamlined pipelining with the natural language processing techniques for fully integrated complex multiphase sentiment analysis that store, process and analyze user generated content from the Twitter streaming API.
Figure 1: Architecture of social media data stream sentiment analysis service (SMDSSAS).
132
Data Analysis and Information Processing
The first layer is Data Storage/Extraction Layer for extraction of user tweet fields from the Twitter Stream that are to be converted to topic filtered DStreams through Spark in the next Data Stream layer. DStream is a memory unit of data in Spark. It is the basic abstraction in Spark Streaming, which is a continuous sequence of Resilient Distributed datasets (RDDs of the same type) that represents a continuous stream of data. The extracted Tweet messages are archived in Hive’s data warehouse store via Cloudera’s interactive web based analytics tool Hue and direct streaming into HDFS. The second Layer: Data Stream Layer processes the extracted live streaming of user-generated raw text of Twitter messages to Spark contexts and DStreams. This layer is bidirectional with both the Data Storage/Extraction layer and the next layer the Data Preprocessing and Transformation Layer. The third layer: Data Preprocessing and Transformation layer is in charge of building relationships in the English natural language and cleaning raw text twitter messages with the functions to remove both control characters sensitive to Hive data warehouse scanner and non-alphanumeric characters from. We employee the natural language processing techniques in the Data Preprocessing layer with the pertained network in the paragraph vector model [16] [17] . This layer can also employee the Stanford Dependency Parser [26] and Named Entity Recognizer [27] to build an additional pipelining of Dependency, Tokenizer, Sentence Splitting, POS tagging and Semantic tagging to build more sophisticated syntax relationships in the Data Preprocessing stage. The transformation component of this later preprocesses in real time the streaming text in JSON to CSV formatted Twitter statuses for Hive table inserts with Hive DDL. The layer is also in charge of removing ambiguity of a word that is determined with pre-defined word corpuses for the sentiment scoring process later. The forth Layer, Feature Extraction layer, is comprised of a topic based feature extraction function for our Deterministic and Probabilistic sentiment models. The topic based feature extraction method employees the Opinion Finder Subjectivity Lexicon [28] for identification and extraction of sentiment based on the related topics of the user twitter messages. The fifth layer of our framework: the Prediction Layer uses our two topic and lexicon based sentiment models: Deterministic, and Probabilistic for sentiment analysis. The accuracy of each model was measured using the supervised classifier Multinomial Naive Bayes to test the capability of each model for correctly identifying and correlating users’ sentiments on the topics related data streams with a given topic (event).
Integrated Real-Time Big Data Stream Sentiment Analysis Service
133
Our sixth and final layer is Presentation layer that consists of a web based user interface.
SENTIMENT MODEL Extracting useful viewpoints (aspects) in context and subjectivity from streaming data is a critical task for sentiment analysis. Classical approaches of sentimental analysis have their own limitations in identifying accurate contexts, for instance, for the lexicon-based methods; common sentiment lexicons may not be able to detect the context-sensitive nature of opinion expressions. For example, while the term “small” may have a negative polarity in a mobile phone review that refers to a “small” screen size, the same term could have a positive polarity such as “a small and handy notebook” in consumer reviews about computers. In fact, the token “small” is defined as a negative opinion word in the well-known sentiment lexicon list Opinion-Finder [28] . The sentiment models developed for SMDSSAS are based on the aspect model [29] . The aspect based opinion mining techniques are to identify to extract personal opinions and emotions of surrounding social or political events by capturing semantically orienting contents in subjectivity and context that are correlated by aspects, i.e. topic words. The design of our sentiment model was based on the assumption that positive and negative opinions could be estimated per a context of a given topic [22] . Therefore, in generating data for model training and testing, we employed a topic-based approach to perform sentiment annotation and quantification on related user tweets. The aspect model is a core of probabilistic latent semantic analysis in the probabilistic language model for general co-occurrence data which associates a class (topic) variable t∈T={t1,t2,⋯,tk} with each occurrence of a word w∈W={w1,w2,⋯,wm} in a document d∈D={d1,d2,⋯,dn} . The Aspect model is a joint probability model that can be defined in selecting a document d with probability P(d), picking a latent class (topic) t with probability P(t|d), and occurring a word (token) w with probability P(w|t). As a result one obtains an observed pair (d,w), while the latent class variable z is discarded. Translating this process into a joint probability model results in the expression as follow: where
(1)
134
Data Analysis and Information Processing
(2) Essentially, to derive (2) one has to sum over the possible choices of z that could have generated the observation. The aspect model is based on two independence assumptions: First, any pairs (d,w) are assumed to be occurred independently; this essentially corresponds to the bag-of-words (or bag of n-gram) approach. Secondly, the conditional independence assumption is made that conditioned on the latent class t, words w are occurred independently of the specific document identity di. Given that the number of class states is smaller than the number of documents ( K≪N), t acts as a bottleneck variable in predicting w conditioned on d. Following the likelihood principle, P(d), P(t|d), and P(w|t) can be determined by maximization of the log-likelihood function
(3)
where n(d,w) denotes the term frequency, i.e., the number of times w occurred in d. An equivalent symmetric version of the model can be obtained by inverting the conditional probability P(t|d) with the Bayes’ theorem, which results in
(4)
In the Information Retrieval context, this Aspect model is used to estimate the probability that a document d is related to a query q [2] . Such a probabilistic inference is used to derive a weighted vector in Vector Space Model (VSM) where a document d contains a user given query q [2] where q is a phrase or a sentence that is a set of classes (topic words) as d∩q=T={t1,t2,⋯,tk}.
(5)
where tf.idft,d is defined as a term weight wt,d of a topic word t with tft,d being the term frequency of a topic word tj occurs in di and idft, being the inverted document index defined with dft the number of documents that contain tj as below. N is the total number of documents. (6) Then d and q are represented with the weighted vectors for the common terms. score(q,d) can be derived using the cosine similarity function to capture the concept of document “relevance” of d respect to q in the context of topic words in q. Then the cosine similarity function is defined as the
Integrated Real-Time Big Data Stream Sentiment Analysis Service
135
score function with the length normalized weighted vectors of q and d as follow.
(7)
Context Identification We derive a topic set T(q) by generating a set of all the related topic words from a user given query (topics) q={t1,t2,⋯,tk} where q is a set of tokens. For each token ti in q, we derive the related topic words to add to the topic set T(q) based on the related language semantics R(ti) as follow.
(8)
where ti,tj∈T . ti.*|*.ti denote any word concatenated with ti and ti_tj denotes a bi-gram with ti and tj, label_synonym(ti) is a set of the labeled synonyms of ti in the dictionary identified by in WordNet [23] . For context identification, we can choose to employee the pre-trained network with the paragraph vector model [16] [17] for our system for preprocessing. The paragraph vector model is more robust in identifying synonyms of a new word that is not in the dictionary.
Measure of Subjectivity in Sentiment: CMSE and CSOM The design of our experiments of each model were intended to capture social media Twitter data streams of a surrounding special social or political event, so we targeted to capture data streams to test during two special events―the 2016 US Presidential election and the 2017 Inauguration. The real time user tweet streams were collected from Oct. 2016 to Jan. 2017. The time frames chosen are a pre-election time of the October 23rd week and the pre-election week of November 5th, as well as a pre-inauguration time the first week of January 2017. We define the document-level polarity sentiment(di) with a simple polarity function that simply counts the number of positive and negative words in a document (a twitter message) to determine an initial sentiment measure sentiment(di) and the sentiment label sentimenti for each document di as follow:
136
Data Analysis and Information Processing
(9)
where di is a document (message) in a tweet stream D of a given topic set T with 1 ≤ i < n and di={w1,⋯,wm} , m is the number of words in di. Pos(wk) = 1 if wk is a positive word and Neg(wk) = −1 if wk is a negative word. sentiment(di) is the difference between the frequency of the positive words denoted as FreqPos(di) and the frequency of negative words denoted as FreqNeg(di) in di indicating an initial opinion polarity measure with −m ≤ sentiment(di) ≤ m and a sentiment label of di sentimenti = 1 for positive if sentiment(di) ≥ 1, 0 if sentiment(di) = 0 for neutral, and −1 for negative if sentiment(di) ≤ −1. Then, we define w(di) a weight for a sentiment orientation for di to measure a subjectivity of sentiment orientation of a document, then a weighted sentiment measure for di senti_score(di) is defined with w(di) and sentimenti the sentiment label of di as a score of sentiment of di as follow:
(10)
(11) where −1 ≤ w(di) ≤ 1, and α is a control parameter for learning. When α = 0, senti_score(di) = sentimenti. senti_score(di) gives more weight toward a short message with strong sentiment orientation. w(di) = 0 for neural.
Class Max Sentiment Extraction (CMSE): To test the performance of our models and to predict the outcomes of events such as the 2016 Presidential election from the extracted user opinions embedded in tweet streams, we quantify the level of the sentiment in the data set with Class Max Sentiment Extraction (CMSE) to generate statistically relevant absolute sentiment values to measure an overall sentiment orientation of a data set for a given topic set for each sentiment polarity class to compare among different data sets. To quantify a sentiment of a data set D of a given topic set T, we define CMSE(D(T)) as follow. For a given Topic set T, for each di∈D(T) where di contains at least one of the topic words of interest in T in a given tweet stream D, CMSE(D(T)) returns a weighted sum of senti_score(di) of the data set D on T as follow: (12)
Integrated Real-Time Big Data Stream Sentiment Analysis Service
137
(13)
(14) where 1 ≤ i < n and D(T)={d1,⋯,dn} , n is the number of documents in D(T). CMSE is to measure the maximum sentiment orientation values of each polarity class for a given topic correlated data set D(T). It is a sum of the weighted document sentiment scores for each sentiment class―positively labeled di, negatively labeled di, and neutrally labeled di respectively in a given data set D(T) for a user given topic word set T. CMSE is the same as an aggregated count of sentimenti when α = 0.
CMSE is an indicator of how strongly positive or negative the sentiment is in a data set for a given topic word set T where D(T) is a set of documents (messages) in a tweet stream where each document di∈D(T) 1≤ i ≤ n, contains at least one of the topic words tj∈T={t1,⋯,tk} with 1 ≤ j ≤ k and T is a set of all the related topic words derived from a user given query q as a seed to generate T. Tj, which is a subset of T, is a set of topic words that is derived from a given topic tj∈T . D(Tj), a subset of D(T), is a set of documents where each document di contains at least one of the related topic words in a topic set Tj. Every topic word set is derived by the Context Identifier described in Section 4.1. With the Donald Trump and Hillary Clinton example, three topic-correlated data sets are denoted as below. D(Tj) is a set of documents with a topic word set Tj derived from {Donald Trump|Hillary Clinton}. D(TRj) is a set of documents, a subset of D(Tj), with a topic word set TRj derived from {Donald Trump}.
D(HCj) is a set of documents, a subset of D(Tj), with a topic word set HCj derived from{Hillary Clinton}.
where m, are the number of document di in D(TRj) and D(HCj) respectively. For example, CMSEpos(D(TRj)) is the maximum positive opinion measure in the tweet set that are talking about the candidate Donald Trump. CSOM (Class Sentiment Orientation Measure): CSOM is to measure a relative ratio of the level of the positive and negative sentiment orientation for a given topic correlated data set over the entire dataset of interest. For CSOM, we define two relative opinion measures: Semantic Orientation (SMO) and Sentiment Orientation (STO) to quantify a polarity
138
Data Analysis and Information Processing
for a given data set correlated with a topic set Tj. SMO indicates a relative polarity ratio between two polarity classes within a given topic data set. STO indicates a ratio of the polarity of a given topic set over an entire data set. With our Trump and Hillary example from the 2016 Presidential Election event, the positive SMO for the data set D(TRj) with the topic word “Donald Trump” and the negative SMO for the Hillary Clinton topic set D(HCj) can be derived for each polarity class respectively as below. For example, the positive SMO for a topic set D(TRj) for Donald Trump and the negative SMO for a topic set D(HCj) for Hillary Clinton are defined as follow:
(15)
(16)
When α = 0, senti_score(di) = sentimenti, so CMSE and SMO are generated with count of sentimenti of the data set. Then, Sentiment Orientation (STO) for a topic set D(TRj) for Donald Trump and the negative STO for a topic set D(HCj) for Hillary Clinton are defined as follow:
(17)
(18)
where Weight(TRj) and Weight(HCj) are the weights of the topics over the entire dataset, defined as follow. Therefore, STO(TRj) indicates a weighted polarity of the topic TRj over the entire data set D(Tj) where D(Tj)=D(TRj)∪D(HCj).
(19)
(20)
Integrated Real-Time Big Data Stream Sentiment Analysis Service
139
Deterministic Topic Model The Deterministic Topic Model considers the context of the words in the texts and the subjectivity of the sentiment of the words given the context. Given the presumption that topic and sentiment can be jointly inferred, the Deterministic Topic Model measures polarity strength of sentiment in the context of user provided topic word(s). Deterministic Topic Model considers subjectivity of each word (token) in di in D(Tj). Likelihoods were estimated as relative frequencies with the weighted subjectivity of a word. Using the Opinion Finder [28] , lexicon of the tweets were categorized and labeled by subjectivity and polarity. The 6 different weight levels below define subjectivity. Each token was categorized to one of the 6 strength scales and weighted with subjectivity strength scale range from −2 to +2 where −2 denotes “strongest” subjective negative; +2: strongest subjective positive word. subjScale(wt) is defined as Subjectivity Strength Scale for each token wt in di. The weight of each group is assigned as below for the 6 subjectivity strength sets. Any token that does not belong to any of the 6 subjectivity strength sets is set to 0. strSubjPosW:= {set of strong positive subjective words}: +2 wkSubjPosW:= {set of weak positive subjective words}: +1 strSubjNeuW:= {set of strong neutral subjective words}: 0.5 wkSubjNeuW:= {set of weak neutral subjective words}: −0.5 wkSubjNegW:= {set of weak negative subjective words}: −1. strSubjNegW:= {set of strong negative subjective words}: −2 None = None of above: 0
(21)
0m is the number of tokens in di. −m∗2≤SentimentSubj(di)≤m∗2. Note that subjScale(wt) of each neutral word is not 0. We consider a strong neutral opinion as a weak positive and a weak neutral as a weak negative by assigning very small positive or negative weights. The sentiment of each di is then defined by the sum of the frequency of each subjectivity group with its weighted subjScale.
140
Data Analysis and Information Processing
(22)
(23)
Then CMSESubj(D(T)) is a sum of subjectivity weighted opinion polarity for a given topic set D(T) with di∈D(T) . It can be defined with senti_score_ subj(di) as follow.
(24)
(25)
(26)
Then, we define our deterministic model ρε(Pos_Tj) as a length normalized sum of subjectivity weighted senti_score_subj(di) for a given topic Tj with di∈D(Tj) as follow: (27) where D(T) is a set of documents (messages) in a tweet stream where each document di∈D(T),1≤i≤n , contains one of the topic words tj∈T={t1,⋯,tk} where 1 ≤ j ≤ k and T is a set of all the related topic words derived from the user given topics and Tj is a set of all the topic words that are derived from a given query q as defined in the Section 4.1. D(Tj) , a subset of D(T), is a set of documents where each document di contains one of the related topic words in a topic set Tj and n is the number of document di in D(Tj).
Probabilistic Topic Model The probabilistic model adopts SMO and STO measures of CSOM with the subjectivity to derive a log-based modified log-likelihood of the ratio of subjectivity weighted PosSMO and NegSMO over a given topic set D(T) and a subset D(Tj).
Integrated Real-Time Big Data Stream Sentiment Analysis Service
141
Our probabilistic model ρ with a given topic set D(T) and D(Tj) measures the probability of sentiment polarity of a given a topic set D(Tj) where D(Tj) is a subset of D(T). For example, the probability of the positive opinion for Trump in D(T), denoted as P(Pos_TR), is defined as follow:
(28)
ϵϵ is a smoothing factor [30] and we consider a strong neutral subjectivity as a weak positivity here. Then, we define our probabilistic model ρ(POS_TR) ρ(POS_TR) as (29) where NegativeInfo(TR) is essentially a subjectivity weighted NegSMO(TRj) defined as follow.
(30)
Our probabilistic model penalizes with the weight of the negative opinion in the correlated topic set D(TR) when measuring a positive opinion of a topic over a given entire data set D(T).
(31)
Multinomial Naive Bayes The fifth layer of our framework: the Prediction Layer employees the Deterministic and Probabilistic sentiment models discussed in Section 4 to our predictive classifiers for event outcome prediction in a real-time environment. Predictive performance of each model was measured using a supervised predictive analytics model: Multinomial Naive Bayes. Naive Bayes is a supervised probabilistic learning method popular for text categorization problems in judging if documents belong to one category or another because it is based on the assumption that each word occurrence in a document is independent as in “bag of word” model. Naive
142
Data Analysis and Information Processing
Bayes uses a technique to construct a classifier: it assigns class labels to problem instances represented as vectors of feature values where class labels are drawn from a finite set [31] . We utilized the Multinomial model for text classification based on “bag of words” model for a document [32] . Multinomial Naive Bayes models the distribution of words in a document as a multinomial. A document is treated as a sequence of words and it is assumed that each word position is generated independently of every other. For classification, we assume that there are a fixed number of classes, where a class Ck∈{C1,C2,⋯,Cm}, each with a fixed set of multinomial parameters. The parameter vector for a class Ck={Ck1,⋯,Ckn} where n is the size of the vocabulary, and ∑Ck=1. (32) In a multinomial event model, a document is an ordered sequence of word events, that represent the frequencies which certain events have been generated by a multinomial (p1⋯pn) where pi is the probability that event i occurs, and xi is the feature vector counting the number of times event i was observed in an instance [32] . Each document di is drawn from a multinomial distribution of words with as many independent trials as the length of di, yielding a “bag of words” representation for the documents [32] . Thus the probability of a document given its class is the representation of k such multinomial [32] .
EXPERIMENTS We applied our sentiment models discussed in Section(s) 4.2 and 4.3 on the real-time twitter stream for the following events―the 2016 US Presidential election and the 2017 Inauguration. User opinion was identified extracted and measured surrounding the political candidates and corresponding election policies in an effort to demonstrate SMDSSAS’s accurate critical decision making capabilities. A total of 74,310 topic-correlated tweets were collected randomly chosen on a continuous 30-second interval in Apache Spark DStream accessing the Twitter Streaming API for the pre-election week of November 2016 and the pre-election month on October, as well as pre-inauguration week in January. The context detector on the following topics generates the set of topic words: Hillary Clinton, Donald Trump and political policies. The number of the
Integrated Real-Time Big Data Stream Sentiment Analysis Service
143
topic correlated tweets for the candidate Donald Trump was ~53,009 tweets while the number of the topic correlated tweet for the candidate Hillary Clinton was ~8510 which is a lot smaller than that of Trump. Tweets were preprocessed with a custom cleaning function to remove all non-English characters including: the Twitter at “@” and hash tag “#” signs, image/website URLS, punctuation: “[. , ! “ ‘]”, digits: [0-9], and non-alphanumeric characters: $ % & ^ * () + ~ and stored in NoSql Hive database. Each topic-correlated tweet was labeled for sentiment using the OpinionFinder subjectivity word lexicon and the subjScale(wt) defined in 4.3 associating a numeric value to each word based on polarity and subjectivity strength.
Predicting the Outcome of 2016 Presidential Election in Pre-Election Weeks Figure 2 shows the results of analysis of sentiment orientation on the two presidential candidates for the several months of pre-election 2016 tweet traffic. We noted the lowest positive polarity measure (0.11) for Donald Trump occurred during the pre-election October, but it soared to more than double (0.26) on the election month November and (0.24) on preinauguration January 2017. His negative sentiment orientation was already a lot lower (0.022) than his positive orientation on October and it kept dropping to 0.016 for November and January.
Figure 2: Measuring pre-election sentiment orientation shift for 2016 presidential election cycle.
144
Data Analysis and Information Processing
In contrast, Hillary Clinton’s positive and negative sentiment orientation measures were consistently low during October and November; her positive sentiment measure was ranging from 0.022 on October to 0.016 on November, which is almost ten times smaller than Trump’s. It kept dropping to 0.007 on January. Clinton’s negative orientation measure was 10 times higher than Trump’s ranging from 0.03 on October to 0.01 on November, but it decreased to 0.009 on January.
Predicting with Deterministic Topic Model
Our Deterministic Topic Model as discussed in 4.3 was applied to the November 2016 pre-election tweet streams. The positive polarity orientation for Donald Trump was increased to 0.60 while the positive polarity measure for Hillary Clinton was 0.069. From our results show in Figure 3(b) below, we witnessed the sharply increased positive sentiment orientation for candidate Donald Trump in the data streams during the pre-election November with candidate Donald Trump’s volume of Trump-correlated topic tweets (53,009 tweets) compared to that for Hillary Clinton (8510 tweets) for Subjectivity Weighted CMSE shown in Figure 3(a). Our system showed that Donald Trump as the definitive winner of the 2016 Presidential Election.
Cross Validation with Multinomial Naive Bayes Classifier for Deterministic and Probabilistic Models Our cross validation was performed with the following experiment settings and an assumption that for a user chosen time period for a user given topic (event), data streams are collected from randomly chosen time frames, each in a 30 sec window, from the same social platform where the messages occur randomly for both candidates.
(a)
(a)
Integrated Real-Time Big Data Stream Sentiment Analysis Service
145
(a)
(b)
(b) Figure 3: (a) Polarity comparison of two candidates: Clinton vs Trump with CMSE and subjectivity weighted CMSE; (b) Comparison of positive sentiment measure of two candidates with Pos_SMO and deterministic model.
To validate parallel stream data processing, we adopt the method of evaluation of big data stream classifier proposed in Bifet 2015 [7] . The standard K-fold cross-validation, which is used in other works with batch methods, treats each fold of the stream independently, and therefore may miss concept drift occurring in the data stream. To overcome these problems, we employed the strategy K-fold distributed cross-validation [7] to validate stream data. Assuming we have K different instances of the classifier, we want to evaluate running in parallel. The classifier does not need to be randomized. Each time a new example arrives, it is used in 10fold distributed cross-validation: each example was used for testing in one classifier selected randomly, and used for training by all the others. 10 fold distributed cross validation were performed on our stream data processing in each two different data splits. 60%: 40% training data: test data, and 90%: 10%. Average accuracy was taken for each split, for each deterministic and probabilistic model. Each cross validation was performed with classifier optimization parameters providing the model a variance of smoothing factors, features for term frequency and numeric values for min document frequency. Figure 4 illustrates the accuracies of deterministic and probabilistic models. 10 fold cross validation on 90%: 10% split with Deterministic model showed the highest accuracy with an average accuracy of 81% and the average accuracy of the Probabilistic model showed the almost comparable result with 80%. In comparison with the existing works,
146
Data Analysis and Information Processing
the overall average accuracy from the cross validation on each model shows 1% - 22% improvement from the previous work [6] [7] [8] [9] [22] [23] [24] [29] [30] . Figure 4 below illustrates the cross validation results of the Deterministic and Probabilistic models.
CONCLUSIONS The main contribution of this paper is the design and development of a real time big data stream analytic framework; providing a foundation for an infrastructure of real time sentiment analysis on big text streams. Our framework is proven to be an efficient, scalable tool to extract, score and analyze opinions on user generated text streams per user given topics in real time or near real time. The experiment results demonstrated the ability of our system architecture to accurately predict the outcome of the 2016 Presidential Race against candidates Hillary Clinton and Donald Trump. The proposed fully analytic Deterministic and Probabilistic sentiment models coupled with the real-time streaming components were tested on the user tweet streams captured during pre-election month in October 2016 and the pre-election week of November 2016. The results proved that our system was able to predict correctly Donald Trump as the definitive winner of the 2016 Presidential election.
Figure 4: Average cross validation prediction accuracy on real time pre election tweet streams of 2016 presidential election for deterministic vs. probabilistic model.
Integrated Real-Time Big Data Stream Sentiment Analysis Service
147
The cross validation results showed that the Deterministic Topic Model in real time processing consistently improved the accuracy with average 81% and the Probabilistic Topic Model with average 80% compared to the accuracies of the previous works, ranging from 59% to 80%, in the literature [6] [7] [8] [9] [22] [23] [24] [29] [30] that lacked the complexity of sentiment analysis processing, either in batch or real time processing. Finally, SMDSSAS performed efficient real-time data processing and sentiment analysis in terms of scalability. The system uses the continuous processing of a smaller window of data stream (e.g. consistent processing of a 30sec window of streaming data) in which machine learning analytics were performed on the context stream resulting in more accurate predictions with the ability of the system to continuously apply multi-layered fully analytic processes with complex sentiment models to a constant stream of data. The improved and stable model accuracies demonstrate that our proposed framework with the two sentiment models offers a scalable realtime sentiment analytic framework alternative for big data stream analysis over the traditional batch mode data analytic frameworks.
ACKNOWLEDGEMENTS The research in this paper was partially supported by the Engineering College of CSU under the Graduate Research grant.
148
Data Analysis and Information Processing
REFERENCES 1.
2.
3. 4.
5.
6.
7.
8.
9.
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A. and Zaharia, M. (2015) Spark SQL: Relational Data Processing in SPARK. Proceedings of the ACM SIGMOD International Conference on Management of Data, Melbourne, 31 May-4 June 2015, 1383-1394. https://doi. org/10.1145/2723372.2742797 Sagiroglu, S. and Sinanc, D. (2013) Big Data: A Review. 2013 International Conference on Collaboration Technologies and Systems (CTS), San Diego, 20-24 May 2013, 42-47. https://doi.org/10.1109/ CTS.2013.6567202 Lars, E. (2015) What’s the Best Way to Manage Big Data for Healthcare: Batch vs. Stream Processing? Evariant Inc., Farmington. Hu, M. and Liu, B. (2004) Mining and Summarizing Customer Reviews. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, 22-25 August 2004, 168-177. Liu, B. (2010) Sentiment Analysis and Subjectivity. In: Indurkhya, N. and Damerauthe, F.J., Eds., Handbook of Natural Language Processing, 2nd Edition, Chapman and Hall/CRC, London, 1-38. Wang, H., Can, D., Kazemzadeh, A., Bar, F. and Narayanan, S. (2012) A System for Real-Time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle. Proceedings of ACL 2012 System Demonstrations, Jeju Island, 10 July 2012, 115-120. Bifet, A., Maniu, S., Qian, J., Tian, G., He, C. and Fan, W. (2015) StreamDM: Advanced Data Mining in Spark Streaming. IEEE International Conference on Data Mining Workshop (ICDMW), Atlantic City, 14-17 November 2015, 1608-1611. Kulkarni, S., Bhagat, N., Fu, M., Kedigehalli, V., Kellogg, C., Mittal, S., Patel, J.M., Ramasamy, K. and Taneja, S. (2015) Twitter Heron: Stream Processing at Scale. Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, 31 May-4 June 2015, 239-250. https://doi.org/10.1145/2723372.2742788 Nair, L.R. and Shetty, S.D. (2015) Streaming Twitter Data Analysis Using Spark For Effective Job Search. Journal of Theoretical and Applied Information Technology, 80, 349-353.
Integrated Real-Time Big Data Stream Sentiment Analysis Service
149
10. Pang, B. and Lee, L. (2004) A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, 21-26 July 2004, 271-278. https://doi.org/10.3115/1218955.1218990 11. Harris, Z. (1954) Distributional Structure Word. WORD, 10, 146-162. https://www.tandfonline.com/doi/abs/10.1080/00437956.1954.11659 520 12. Blei, D., Ng, A. and Jordan, N. (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993-1022. 13. Zhai, C. and Lafferty, J. (2004) A Study of Smoothing Methods for Language Models Applied to Information Retrieval. ACM Transactions on Information Systems, 22, 179-214. 14. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. and Dean, J. (2013) Distributed Representations of Words and Phrases and Their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, 5-10 December 2013, 3111-3119 15. Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013) Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at ICLR, Scottsdale, 2-4 May 2013, 1-11. 16. Dai, A., Olah, C. and Le, Q. (2015) Document Embedding with Paragraph Vectors. arXiv:1507.07998. 17. Le, Q. and Mikolov, T. (2014) Distributed Representations of Sentences and Documents. Proceedings of the 31st International Conference on Machine Learning (ICML-14), Beijing, 21-26 June 2014, II1188II1196. 18. Firth, J.R. (1930) A Synopsis of Linguistic Theory 1930-1955. In: Firth, J.R., Ed., Studies in Linguistic Analysis, Longmans, London, 168-205. 19. Tang, J., Qu, M. and Mei, Q.Z. (2015) PTE: Predictive Text Embedding through Large-Scale Heterogeneous Text Networks. arXiv:1508.00200. 20. Bo, P. and Lee, L. (2008) Opinion Mining and Sentiment Analysis. In: de Rijke, M., et al., Eds., Foundations and Trends® in Information Retrieval, James Finlay Limited, Ithaca, 1-135. https://doi. org/10.1561/1500000011
150
Data Analysis and Information Processing
21. Maite, T., Brooke, J., Tofiloski, M., Voll, K. and Stede, M. (2011) Lexicon-Based Methods for Sentiment Analysis. Computational Linguistics, 37, 267-307. https://doi.org/10.1162/COLI_a_00049 22. O’Connor, B., Balasubramanyan, R., Routledge, B. and Smith, N. (2010) From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM 2010), Washington DC, 23-26 May 2010, 122-129. 23. Cheng, K.M.O. and Lau, R. (2015) Big Data Stream Analytics for Near Real-Time Sentiment Analysis. Journal of Computer and Communications, 3, 189-195. https://doi.org/10.4236/jcc.2015.35024 24. Cheng, K.M.O. and Lau, R. (2016) Parallel Sentiment Analysis with Storm. Transactions on Computer Science and Engineering, 1-6. 25. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H. and Murthy, R. (2010) Hive—A Petabyte Scale Data Warehouse Using Hadoop. Proceedings of the International Conference on Data Engineering, Long Beach, 1-6 March 2010, 9961005. 26. Manning, C., Surdeanu, A., Bauer, J., Finkel, J., Bethard, S. and McClosky, D. (2014) The Stanford CoreNLP Natural Language Processing Toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, 23-24 June 2014, 55-60. https://doi.org/10.3115/v1/P145010 27. Finkel, J., Grenager, T. and Manning, C. (2005) Incorporating NonLocal Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), Ann Arbor, 25-30 June 2005, 363-370. 28. Wilson, T., Wiebe, J. and Hoffman, P. (2005) Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Vancouver, 6-8 October 2005, 347354. https://doi.org/10.3115/1220575.1220619 29. Wang, S., Zhiyuan, C. and Liu, B. (2016) Mining Aspect-Specific Opinion Using a Holistic Lifelong Topic Model. Proceedings of the
Integrated Real-Time Big Data Stream Sentiment Analysis Service
151
25th International Conference on World Wide Web, Montréal, 11-15 April 2016, 167-176. https://doi.org/10.1145/2872427.2883086 30. Yi, J., Nasukawa, T., Bunescu, R. and Niblack, W. (2003) Sentiment Analyzer: Extracting Sentiments about a Given Topic Using Natural Language Processing Techniques. Proceedings of IEEE International Conference on Data Mining (ICDM), Melbourne, 22-22 November 2003, 1-8. https://doi.org/10.1109/ICDM.2003.1250949 31. Tilve, A. and Jain, S. (2017) A Survey on Machine Learning Techniques for Text Classification. International Journal of Engineering Sciences and Research, 6, 513-520. 32. Rennie, J., Shih, L., Teevan, J. and Karger, D. (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. Proceedings of the Twentieth International Conference on International Conference on Machine Learning, Washington DC, 21-24 August 2003, 616-623.
Chapter 7
The Influence of Big Data Analytics in the Industry
Haya Smaya Mechanical Engineering Faculty, Institute of Technology, MATE Hungarian University of Agriculture and Life Science, Gödöllo”, Hungary
ABSTRACT Big data has appeared to be one of the most addressed topics recently, as every aspect of modern technological life continues to generate more and more data. This study is dedicated to defining big data, how to analyze it, the challenges, and how to distinguish between data and big data analyses. Therefore, a comprehensive literature review has been carried out to define and characterize Big-data and analyze processes. Several keywords, which are (big-data), (big-data analyzing), (data analyzing), were used in scientific research engines (Scopus), (Science direct), and (Web of Science) to acquire up-to-date data from the recent publications on that topic. This study shows the viability of Big-data analysis and how it functions in the fast-changeable world. In addition to that, it focuses on the aspects that describe and anticipate Citation: Smaya, H. (2022), “The Influence of Big Data Analytics in the Industry”. Open Access Library Journal, 9, 1-12. doi: 10.4236/oalib.1108383. Copyright: © 2022 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0
154
Data Analysis and Information Processing
Big-data analysis behaviour. Besides that, it is important to mention that assessing the software used in analyzing would provide more reliable output than the theoretical overview provided by this essay. Keywords: Big Data, Information Tools
INTRODUCTION The research background is dedicated to defining big data, how to analyze it, the challenges, and how to distinguish between data and big data analyses. Therefore, a comprehensive literature review has been carried out to define and characterize Big-data and analyze processes. Several keywords, which are (big-data), (big-data analyzing), (data analyzing), were used in scientific research engines (Scopus), (Science direct), and (Web of Science) to acquire up-to-date data from the recent publications on that topic. The Problem this paper wants to solve is to show the viability of Bigdata analysis and how it functions in the fast-changeable world. In addition to that, it focuses on the aspects that describe and anticipate Big-data analysis behaviour. Big Data is omnipresent, and there is an almost urgent need to collect and protect whatever data is generated. In recent years, big data has exploded in popularity, capturing the attention and investigations of researchers all over the world. Because data is such a valuable tool, making proper use of it may help people improve their projections, investigations, and decisions [1]. The growth of science has driven everyone to mine and consume large amounts of data for the company, consumer, bank account, medical, and other studies, which has resulted in privacy breaches or intrusions in many cases [2]. The promise of data-driven decision-making is now widely recognized, and there is growing enthusiasm for the concept of “Big Data,” as seen by the White House’s recent announcement of new financing programs across many agencies. While Big Data’s potential is real―Google, for example, is thought to have given 54 billion dollars. In 2009 to the US economy―there is no broad unanimity on this [3]. It is difficult to recall a topic that received so much hype as broadly and quickly as big data. While barely known a few years ago, big data is one of the most discussed topics in business today across industry sectors [4]. This study is dedicated to defining the Big-data concept, assessing its viability, and investigating the different methods of analyzing and studying it.
The Influence of Big Data Analytics in the Industry
155
STATUS QUO OVERVIEW This chapter will provide a holistic assessment of the Big-data concept based on several studies carried out in the last ten years, in addition to the behaviour of big-data, features, and methodologies to analyze it.
Big-Data Definition and Concept Big data analytics is the often-complex process of analyzing large amounts of data to identify information such as hidden patterns, correlations, market trends, and customer preferences that can assist businesses in making better decisions [5]. Big Data is today’s biggest buzzword, and with the quantity of data generated every minute by consumers and organizations around the world, Big Data analytics holds a huge potential [6]. To illustrate the importance of Big-data in our world nowadays, (Spotify) could be a good example of how Big-data works. (Spotify) has nearly 96 million users that generate a vast amount of data every day. By analyzing this data and based on it, the cloud-based platform suggests songs automatically using a smart recommendation engine. This huge amount of data is the likes, shares, search history, and every click on the application. Some researchers estimate that Facebook generates more than 500 terabytes of data every day, including photos, videos, and messages. Everything we do online in every industry uses mostly the same concept; therefore, big data get all this hype [7]. Generally, Big-data is a massive amount of data set that cannot be stored, processed, or analyzed using traditional tools [8]. This data could also exist in several forms, such as structured data and semi-structured data. The structured data might be an Excel sheet that has a definite format. At the same time, Semi-structured data could be resembled by an email, for example. Unstructured data are undetermined pictures and videos. Combining all these types of data creates what is so-called (Big-data) (Figure 1) [9] [10].
Characteristics of Big-Data Firstly, it is essential to differentiate between Big-data and structured data (which is usually stored in relational database systems) based on five parameters (Figure 2) which are: 1-Volume 2-Variety 3-Velocity 4-Value 5-Veracity And usually, it can be referred to these parameters as (5V’s) that are the main challenges of Big-data management:
Data Analysis and Information Processing
156
1. Volume: Volume is the major challenge for Big-data and the paramount aspect that distinguishes it. Big-data volume is not measured by gigabytes but by terabytes
Figure 1: Scheme of big-data analyzing the output [9].
Figure 2: 5V concept. The source [12].
(1 tera = 1000 Giga) and petabytes (1 Peta = 1000 terra). The cost of storing this tremendous amount of data is a hurdle for the data scientist to overcome. 2.
3.
Variety: Variety refers to the different data types such as structured, unstructured, and semi-structured data in relational database storage systems. The data format could be in the forms as documents, emails, social media text messages, audio, video, graphics, images, graphs, and the output from all types of machine-generated data from various sensors, devices, machine logs, cell phone GPS signals and more [11]. Velocity: The motion of the data sets is a significant aspect to categorize data types based on it. Data-at-rest and data-in-motion is the term that deals with velocity. The major concern is the consistency and completeness of fast-paced data streams and getting the desired result matching. Velocity also includes time and latency characteristics: the data being analyzed, processed,
The Influence of Big Data Analytics in the Industry
4. 5.
157
stored, managed, and updated at a first-rate or with a lag time between the events. Value: Value deal with what value should be resulted from a set of data. Veracity: Veracity describes the quality of data. Is the data noiseless or conflict-free? Accuracy and completeness are concerned.
BIG-DATA ANALYSIS Viability of Big-Data Analysis Big data analytics assists businesses in harnessing their data and identifying new opportunities. As a result, smarter business decisions, more effective operations, higher profits, and happier consumers are the result. More than 50 firms were interviewed for the publication Big Data in Big Companies (Figure 3) [13] to learn how they used big data.
Figure 3: Frequency distribution of documents containing the term “big data” in ProQuest Research Library. The source [6].
According to the report, they gained value in the following ways: Cost reduction. When it comes to storing massive volumes of data, big data technologies like Hadoop and cloud-based analytics provide significant cost savings―and they can also find more effective methods of doing business.
Data Analysis and Information Processing
158
Faster, better decision-making. Businesses can evaluate information instantaneously―and make decisions based on what they’ve learned― thanks to Hadoop’s speed and in-memory analytics, as well as the ability to study new sources of data. New products and services. With the capacity to use analytics to measure client requirements and satisfaction comes the potential to provide customers with exactly what they want. According to Davenport, more organizations are using big data analytics to create new goods to fulfill the needs of their customers.
Analyzing Process Analyzing Steps Data analysts, data scientists, predictive modellers, statisticians, and other analytics experts collect, process, clean, and analyze increasing volumes of structured transaction data, as well as other types of data not typically used by traditional BI and analytics tools. The four steps of the data preparation process are summarized below (Figure 4) [7]: 1)
•
Data specialists gather information from a range of sources. It’s usually a mix of semi-structured and unstructured information. While each company will use different data streams, the following are some of the most frequent sources: clickstream data from the internet
Figure 4: Circular process steps of data analysis [7].
The Influence of Big Data Analytics in the Industry
• • • • • • • 2)
3)
4) • • • • • • •
159
web server logs cloud apps mobile applications social media content text from consumer emails and survey replies mobile phone records machine data collected by the internet of things sensors (IoT). The information is analyzed. Data professionals must organize, arrange, and segment data effectively for analytical queries after it has been acquired and stored in a data warehouse or data lake. Analytical queries perform better when data is processed thoroughly. The data is filtered to ensure its quality. Data scrubbers use scripting tools or corporate software to clean up the data. They organize and tidy up the data, looking for any faults or inconsistencies such as duplications or formatting errors. Analytics software is used to analyze the data that has been collected, processed, and cleaned. This contains items such as: Data mining, which sifts through large data sets looking for patterns and connections. Predictive analytics, which involves developing models to predict customer behaviour and other future events. Machine learning, which makes use of algorithms to evaluate enormous amounts of data. Deep learning, a branch of machine learning that is more advanced. Program for text mining and statistical analysis. Artificial intelligence (AI). Business intelligence software that is widely used.
Analyzing Tools To support big data analytics procedures, a variety of tools and technologies are used [14]. The following are some of the most common technologies and techniques used to facilitate big data analytics processes:
Data Analysis and Information Processing
160
•
•
• •
•
• •
• •
• •
•
Hadoop is a free and open-source framework for storing and analyzing large amounts of data. Hadoop is capable of storing and processing enormous amounts of structured and unstructured data. Predictive analytics large volumes of complicated data are processed by hardware and software, which uses machine learning and statistical algorithms to forecast future event outcomes. Predictive analytics technologies are used by businesses for fraud detection, marketing, risk assessment, and operations. Stream analytics are used to filter, combine, and analyze large amounts of data in a variety of formats and platforms. Distributed storage data is usually replicated on a non-relational database This can be a safeguard against independent node failures, the loss or corruption of large amounts of data, or the provision of low-latency access. NoSQL databases non-relational data management methods come in handy when dealing with vast amounts of scattered data. They are appropriate for raw and unstructured data because they do not require a fixed schema. A data lake is a big storage repository that stores raw data in native format until it’s needed. A flat architecture is used in data lakes. A data warehouse is a data repository that holds vast amounts of data gathered from many sources. Predefined schemas are used to store data in data warehouses. Knowledge discovery/big data mining tools businesses will be able to mine vast amounts of structured and unstructured big data. In-memory data fabric large volumes of data are distributed across system memory resources. This contributes to minimal data access and processing delay. Data virtualization allows data to be accessed without any technical limitations. Data integration software enables big data to be streamlined across different platforms, including Apache, Hadoop, MongoDB, and Amazon EMR. Data quality software, cleans and enriches massive amounts of data
The Influence of Big Data Analytics in the Industry
• •
161
Data preprocessing software, which prepares data to be analyzed further Unstructured data is cleared and data is prepared. Spark: which is a free and open-source cluster computing platform for batch and real-time data processing.
Different Types of Big Data Analytics Here are the four types of Big Data analytics: 1)
Descriptive Analytics: This summarizes previous data in an easyto-understand format. This aids in the creation of reports such as a company’s income, profit, and sales, among other things. It also aids in the tally of social media metrics. 2) Diagnostic Analytics: This is done to figure out what created the issue in the first place. Drill-down, data mining, and data recovery are all instances of techniques. Diagnostic analytics is used by businesses to gain a deeper understanding of a problem. 3) Predictive Analytics: This sort of analytics examines past and current data to create predictions. Predictive analytics analyzes current data and makes forecasts using data mining, artificial intelligence, and machine learning. It predicts customer and market trends, among other things. 4) Prescriptive Analytics: This type of analytics recommends a remedy to a specific issue. Both descriptive and predictive analytics are used in perspective analytics. Most of the time, AI and machine learning are used.
Big Data Analytics Benefits Big data analytics has several advantages, including the ability to swiftly evaluate massive amounts of data from numerous sources in a variety of forms and types (Figure 5). Making better-informed decisions more quickly for more successful strategizing, can benefit and improve the supply chain, operations, and other strategic decision-making sectors. Savings that can be realized because of increased business process efficiencies and optimizations. Greater marketing insights and information for product creation can come from a better understanding of client demands, behaviour, and sentiment.
Data Analysis and Information Processing
162
Risk management tactics that are improved and more informed as a result of huge data sample sizes [15].
Big Data Analytics Challenges Despite the numerous advantages of utilizing big data analytics, it is not without its drawbacks [16]: •
•
•
•
Data accessibility. Data storage and processing grow increasingly difficult as the number of data increases. To ensure that less experienced data scientists and analysts can use big data, it must be appropriately stored and managed. Ensuring data quality. Data quality management for big data necessitates a significant amount of time, effort, and resources due to the large volumes of data coming in from multiple sources and in varied forms. it. Data protection. Large data systems pose unique security challenges due to their complexity. It might be difficult to properly handle security risks within such a sophisticated big data ecosystem. Choosing the appropriate tools. Organizations must know how to choose the appropriate tool that corresponds with users’ needs and infrastructure from the huge diversity of big data analytics tools and platforms available on the market.
Figure 5: Benefits of big-data analytics [15].
The Influence of Big Data Analytics in the Industry
•
163
Some firms are having difficulty filling the gaps due to a probable lack of internal analytics expertise and the high cost of acquiring professional data scientists and engineers.
The difference between Data Analytics and Big Data 1)
Nature: Let’s look at an example of the key distinction between Big Data and Data Analytics. Data Analytics is similar to a book where you can discover solutions to your problems; on the other hand, Big Data can be compared to a large library with all the answers to all the questions, but it’s tough to locate them. 2) Structure of Data: Data analytics reveals that the data is already structured, making it simple to discover an answer to a question. However, Big Data is a generally unstructured set of data that must be sorted through to discover an answer to any question and processing such massive amounts of data is difficult. To gain useful insight from Big Data, a variety of filters must be used. 3) Tools used in Big Data vs. Data Analytics: Because the data to be analyzed is already structured and not difficult, simple statistical and predictive modelling tools will be used. Because processing the vast volume of Big Data is difficult, advanced technological solutions such as automation tools or parallel computing tools will be required to manage it. More information about Big Data Tools can be found here. 4) Type of Industry using Big Data and Data Analytics: Industries such as IT, travel, and healthcare are among the most common users of data analytics. Using historical data and studying prior trends and patterns, Data Analytics assists these businesses in developing new advances. Simultaneously, Big Data is utilized by a variety of businesses, including banking, retail, and others. In a variety of ways, big data assists these industries in making strategic business decisions. Application of Data Analytics and Big Data Data is the foundation for all of today’s decisions. Today, no choices or actions can be taken without the data. To achieve success, every company now employs a data-driven strategy. Data Scientists, Data Experts, and other data-related careers abound these days.
Data Analysis and Information Processing
164
Job Responsibilities of Data Analysts 1)
Analyzing Trends and Patterns: Data analysts must foresee and predict what will happen in the future, which can be very useful in company strategic decision-making. In this situation, a data analyst must recognize long-term trends [17]. He must also give precise recommendations based on the patterns he has discovered. 2) Creating and Designing Data Report: A data scientist’s reports are a necessary component of a company’s decision-making process. Data scientists will need to construct the data report and design it in such a way that the decision-maker can understand it quickly. Data can be displayed in a variety of ways, including pie charts, graphs, charts, diagrams, and more. Depending on the nature of the data to be displayed, data reporting can also be done in the form of a table. 3) Deriving Valuable Insights from the Data: To benefit the organizations, Data Analysts will need to extract relevant and meaningful insights from the data package. The company will be able to use those valuable and unique insights to make the greatest decision for its long-term growth. 4) Collection, Processing, and Summarizing of Data: A Data Analyst must first collect data, then process it using the appropriate tools, and finally summarize the information such that it is easily comprehended. The summarized data can reveal a lot about the trends and patterns that are used to forecast and predict things. Job Responsibilities of Big Data Professionals 1)
Analyzing Real-time Situations: Big Data professionals are in high demand for analyzing and monitoring scenarios that occur in real-time. It will assist many businesses in taking immediate and timely action to address any issue or problem, as well as capitalize on the opportunity [18]. Many businesses may cut losses, boost earnings, and become more successful this way. 2) Building a System to Process Large Scale Data: Processing large amounts of data promptly is difficult. Unstructured data that cannot be processed by a simple tool is sometimes referred to as Big Data. A Big Data Professional must create a complex technological tool or system to handle and analyze Big Data to make better decisions [19].
The Influence of Big Data Analytics in the Industry
165
3) Detecting Fraud Transactions: Fraud is on the rise, and it is critical to combat the problem. Big Data experts should be able to spot any potentially fraudulent transactions. Many sectors, particularly banking, have important duties in this area. Every day, many fraudulent transactions occur in financial sectors, and banks must act quickly to address this problem. People will lose trust in the financial system if they continue to save their hardearned money in banks.
CONCLUSIONS Gradually, the business sector is relying more on its development on data science. A tremendous amount of data is used to describe the behaviour of complex systems, anticipate the output of processes, and evaluate this output. Based on what we discussed in this essay, it can be stated that Bigdata analytics is the cutting-edge methodology in data science alongside every other technological aspect, and studying comprehensively this major, would be essential for further development. Several methods and software are commercially available for analyzing big-data sets. Each of them can relate to technology, business, or social media. Further studies using analyzing software could enhance the depth of the knowledge reported and validate the results.
166
Data Analysis and Information Processing
REFERENCES 1.
Siegfried, P. (2017) Strategische Unternehmensplanung in jungen KMU—Probleme and Lösungsansätze. de Gruyter/Oldenbourg Verlag, Berlin. 2. Siegfried, P. (2014) Knowledge Transfer in Service Research—Service Engineering in Startup Companies. EUL-Verlag, Siegburg. 3. Divesh, S. (2017) Proceedings of the VLDB Endowment. Proceedings of the VLDB Endowment, 10, 2032-2033. 4. Su, X. (2012) Introduction to Big Data. In: Opphavsrett: Forfatter og Stiftelsen TISIP, Institutt for informatikk og e-læring ved NTNU, Zürich, Vol. 10, Issue 12, 2269-2274. 5. Siegfried, P. (2015) Die Unternehmenserfolgsfaktoren und deren kausale Zusammenhänge. In: Zeitschrift Ideen-und Innovationsmanagement, Deutsches Institut für Betriebs-wirtschaft GmbH/Erich Schmidt Verlag, Berlin, 131-137. https://doi.org/10.37307/j.2198-3151.2015.04.04 6. Gandomi, A. and Haider, M. (2015) Beyond the Hype: Big Data Concepts, Methods, and Analytics. International Journal of Information Management, 35, 137-144. https://doi.org/10.1016/j. ijinfomgt.2014.10.007 7. Lembo, D. (2015) An Introduction to Big Data. In: Application of Big Data for National Security, Elsevier, Amsterdam, 3-13. https://doi. org/10.1016/B978-0-12-801967-2.00001-X 8. Siegfried, P. (2014) Analysis of the Service Research Studies in the German Research Field, Performance Measurement and Management. Publishing House of Wroclaw University of Economics, Wroclaw, Band 345, 94-104. 9. Cheng, O. and Lau, R. (2015) Big Data Stream Analytics for Near RealTime Sentiment Analysis. Journal of Computer and Communications, 3, 189-195. https://doi.org/10.4236/jcc.2015.35024 10. Abu-salih, B. and Wongthongtham, P. (2014) Chapter 2. Introduction to Big Data Technology. 1-46. 11. Sharma, S. and Mangat, V. (2015) Technology and Trends to Handle Big Data: Survey. International Conference on Advanced Computing and Communication Technologies, Haryana, 21-22 February 2015, 266-271. https://doi.org/10.1109/ACCT.2015.121
The Influence of Big Data Analytics in the Industry
167
12. Davenport, T.H. and Dyché, J. (2013) Big Data in Big Companies. Baylor Business Review, 32, 20-21. http://search.proquest.com/docv iew/1467720121?accountid=10067%5Cnhttp://sfx.lib.nccu.edu.tw/ sfxlcl41?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:jou rnal&genre=article&sid=ProQ:ProQ:abiglobal&atitle=VIEW/REVIE W:+BIG+DATA+IN+BIG+COMPANIES&title=Bay 13. Riahi, Y. and Riahi, S. (2018) Big Data and Big Data Analytics: Concepts, Types and Technologies. International Journal of Research and Engineering, 5, 524-528. https://doi.org/10.21276/ijre.2018.5.9.5 14. Verma, J.P. and Agrawal, S. (2016) Big Data Analytics: Challenges and Applications for Text, Audio, Video, and Social Media Data. International Journal on Soft Computing, Artificial Intelligence and Applications, 5, 41-51. https://doi.org/10.5121/ijscai.2016.5105 15. Begoli, E. and Horey, J. (2012) Design Principles for Effective Knowledge Discovery from Big Data. Proceedings of the 2012 Joint Working Conference on Software Architecture and 6th European Conference on Software Architecture, WICSA/ECSA, Helsinki, 20-24 August 2012, 215-218. https://doi.org/10.1109/WICSA-ECSA.212.32 16. Najafabadi, M.M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N., Wald, R. and Muharemagic, E. (2015) Deep Learning Applications and Challenges in Big Data Analytics. Journal of Big Data, 2, 1-21. https:// doi.org/10.1186/s40537-014-0007-7 17. Bätz, K. and Siegfried, P. (2021) Complexity of Culture and Entrepreneurial Practice. International Entrepreneurship Review, 7, 61-70. https://doi.org/10.15678/IER.2021.0703.05 18. Bockhaus-Odenthal, E. and Siegfried, P. (2021) Agilität über Unternehmensgrenzen hinaus—Agility across Boundaries, Bulletin of Taras Shevchenko National University of Kyiv. Economics, 3, 14-24. https://doi.org/10.17721/1728-2667.2021/216-3/2 19. Kaisler, S.H., Armour, F.J. and Espinosa, A.J. (2017) Introduction to Big Data and Analytics: Concepts, Techniques, Methods, and Applications Mini Track. Proceedings of the Annual Hawaii International Conference on System Sciences, Hawaii, 4-7 January 2017, 990-992. https://doi. org/10.24251/HICSS.2017.117
Chapter 8
Big Data Usage in the Marketing Information System
Alexandre Borba Salvador, Ana Akemi Ikeda Faculdade de Administra??o, Economia e Ciências Contábeis, Universidade de S?o Paulo, S?o Paulo, Brazil
ABSTRACT Data generation, storage capacity, processing power and analytical capacity increase had created a technological phenomenon named big data that could create big impact in research and development. In the marketing field, the use of big data in research can represent a deep dive in consumer understanding. This essay discusses the big data uses in the marketing information system and its contribution for decision-making. It presents a revision of main concepts, the new possibilities of use and a reflection about its limitations. Keywords: Big Data, Marketing Research, Marketing Information System
Citation: Salvador, A. and Ikeda, A. (2014), “Big Data Usage in the Marketing Information System”. Journal of Data Analysis and Information Processing, 2, 77-85. doi: 10.4236/jdaip.2014.23010. Copyright: © 2014 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0
170
Data Analysis and Information Processing
INTRODUCTION A solid information system is essential to obtain relevant data for the decisionmaking process in marketing. The more correct and relevant the information is, the greater the probability of success is. The 1990s was known as the decade of the network society and the transactional data analysis [1] . However, in addition to this critical data, there is a great volume of less structured information that can be analyzed in order to find useful information [2] . The growth of generation, storage capacity, processing power and data analysis provided a technological phenomenon called big data. This phenomenon would cause great impacts on studies and lead to the development of solutions in different areas. In marketing, big data research can represent the possibility of a deep understanding of the consumer behavior, through their profile monitoring (geo-demographic, attitudinal, behavioral), the statement of their areas of interest and preferences, and monitoring of their purchase behavior [3] [4] . The triangulation of the available data in real time with information previously stored and analyzed would enable the generation of insights that would not be possible through other techniques [5] . However, in order to have big data information correctly used by companies, some measures are necessary, such as investment on people qualification and equipment. More than that, the increase of information access may generate ethic-related problems, such as invasion of privacy and redlining. It may affect research as well, as in cases where information could be used without consent of the surveyed. Predictive analytics are models that seek to predict the consumer behavior through data generated by their purchase and/or consumption activities and with the advent of big data, predictive analytics grow in importance to understand this behavior from the data generated in on-line interactions among these people. The use of predictive systems can also be controversial as exemplified by the case of American chain Target, which identified the purchase behavior of women at the early stage of pregnancy and sent a congratulation letter to a teenage girl who had not yet informed her parents about the pregnancy. The case generated considerable negative repercussions and the chain suspended the action [4] . The objective of this essay is to discuss the use of big data in the context of marketing information systems, present new possibilities resulting from its use, and reflect on its limitations. For that, the point of view of researchers and experts will be explored based on academic publications, which will
Big Data Usage in the Marketing Information System
171
be analyzed and confronted so we may, therefore, infer conclusions on the subject.
THE USE OF INFORMATION ON THE DECISION-MAKING PROCESS IN MARKETING The marketing information system (MIS) was defined by Cox and Good (1967, p. 145) [6] as a series of procedures and methods for the regular, planned collection, analysis and presentation of information for use in making marketing decisions. For Berenson (1969, p. 16) [7] , the MIS would be an interactive structure of people, equipment, methods and controls, designed to create a flow of information able to provide an acceptable base for the decision-making process in marketing. The need for its implementation would derive from points that have not changed yet: 1) the increase in business complexity would demand more information and better performance; 2) the life cycle of products would be shortened, requiring more assertiveness from marketing managers to collect profits in shorter times; 3) companies would become so large that the lack of effort to create a structured information system would make its management impractical; 4) business would demand rapid decisions and therefore, in order to support decision making, an information system would be essential for marketing areas; 5) although an MIS is not dependent on computers, the advances in hardware and software technologies would have spread its use in companies, and not using its best resources would represent a competitive penalty [7] . The data supplying an MIS can be structured or non-structured regarding its search mechanisms and internal (company) or external (micro and macro environment) regarding its origin. The classic and most popular way of organizing it is through sub-systems [8] -[10] . The input and processing sub-systems of an MIS are the internal registration sub-system (structured and internal information), marketing intelligence sub-system (information from secondary sources, non-structured and from external origins), and the marketing research sub-system (information from primary sources, structured, from internal or external origins, generated from a research question).
BIG DATA The term big data applies to information that could not be processed using traditional tools or processes. According to an IBM [11] report, the three
172
Data Analysis and Information Processing
characteristics that would define big data are volume, speed and variety, as together they would have created the need for new skills and knowledge in order to improve the ability to handle the information (Figure 1). The Internet and the use of social media have transferred the power of creating content to users, greatly increasing the generation of information on the Internet. However, this represents a small part of the generated information. Automated sensors, such as RFID (radio-frequency identification), multiplied the volume of collected data, and the volume of stored data in the world is expected to jump from 800,000 petabytes (PB) in 2000 to 35 zettabytes (ZB) in 2020. According to IBM, Twitter would generate by itself over 7 terabytes (TB) of data a day, while some companies would generate terabytes of data in an hour, due to its sensors and controls. With the growth of sensors and technologies that encourage social collaboration through portable devices, such as smartphones, the data became more complex, due to its volume and different origins and formats, such as files originating from automatic control, pictures, books, reviews in communities, purchase data, electronic messages and browsing data. The traditional idea of data speed would consider its retrieval, however, due to the great number of sensors capturing information in real time, the concern with the capture and information analysis speed emerges, leading, therefore, to the concept of flow.
Figure 1. Three big data dimension. Source: Adapted from Zikopoulos and Eaton, 2012.
Big Data Usage in the Marketing Information System
173
The capture in batches is replaced by the streaming capture. Big data, therefore, regards to a massive volume of zettabytes information rather than terabytes, captured from different sources, in several formats, and in real time [11] . A work plan with big data should take three main elements into consideration: 1) collection and integration of a great volume of new data for fresh insights; 2) selection of advanced analytical models in order to automate operations and predict results of business decisions; and 3) creation of tools to translate model outputs into tangible actions and train key employees to use these tools. Internally, the benefits of this work plan would be a greater efficiency of the corporation since it would be driven by more relevant, accurate, timely information, more transparency of the operation running, better prediction and greater speed in simulations and tests [12] . Another change presented by big data is in the ownership of information. The great information storages were owned only by governmental organizations and major traditional corporations. Nowadays, new corporations connected to technology (such as Facebook, Google, LinkedIn) hold a great part of the information on people, and the volume is rapidly increasing. Altogether, this information creates a digital trail for each person and its study can lead to the identification of their profile, preferences and even prediction of their behavior [5] . Within business administration, new uses for the information are identified every day, with promises of benefits for operations (productivity gains), finance (control and scenario predictions), human resources (recruitment and selection, salary, identification of retention factors) and research and development (virtual prototyping and simulations). In marketing, the information on big data can help to both improve information quality for strategic planning in marketing and predict the definition of action programs.
USE OF BIG DATA IN THE MARKETING INFORMATION SYSTEM Marketing can benefit from the use of big data information and many companies and institutes are already being structured to offer digital research and monitoring services. The use of this information will be presented following the classical model of marketing information system proposed by Kotler and Keller (2012) [10] .
174
Data Analysis and Information Processing
Input-Sub-Systems Internal Reports Internal reports became more complete and complex, involving information and metrics generated by the company’s digital proprieties (including websites and fanpages), which would also increase the amount of information on consumers, reaching beyond the data on customer profile. With the increase of information from different origins and in different formats, a richer internal database becomes the research source for business, markets, clients and consumers insights, in addition to internal analysis.
Marketing Intelligence If in one hand the volume of information originated from marketing intelligence increases, on the other hand, it is concentrated on an area with more structured search and monitoring tools, with easier storage and integration. Reading newspapers, magazines and sector reports gains a new dimension with the access to global information in real time, changing the challenge of accessing information to selection of valuable information, increasing, therefore, the value of digital clipping services. The monitoring of competitors gains a new dimension since brand changes, whether local or global, can be easily followed up. The services of brand monitoring increase, with products such as GNPD by Mintel [13] and the Buzzz Monitor by e. Life [14] or SCUP and Bluefin.
Marketing Research Since the Internet growth and virtual communities increase, studying online behavior became, at the same time, an opportunity and a necessity. Netnography makes use of ethnography sources when proposing to study group behavior through observation of their behavior in their natural environment. In this regard, ethnography (and netnography) has the characteristic of minimizing the behavior changes setbacks by not moving the object of study from its habitat, as many other study groups do. However, academic publications have not reached an agreement on technique application and analysis depth [15] -[17] . Kozinets (2002, 2006) [16] [17] proposes a deep study, in which the researcher needs to acquire great knowledge over the object group and monitor it for long periods, while Gerbera (2008) [15] is not clear about such need of deep knowledge of the technique, enabling the understanding of that which could be similar to a content analysis based on digital data. For
Big Data Usage in the Marketing Information System
175
the former, just as ethnography, the ethical issues become more important as the researcher should ask for permission to monitor the group and make their presence known; and, for the latter, netnography would not require such observer presentation from public data collected. The great volume of data captured by social networks could be analyzed using netnography. One of the research techniques that have been gaining ground in the digital environment is the content analysis due to, on one hand, the great amount of data available for analysis on several subjects, and, on the other hand, the spread of free automated analysis tools, such as Many Eyes by IBM [18] , which offers cloud resources on terms, term correlation, scores and charts, among others. The massive volume of information of big data provides a great increase in the sample, and, in some cases, enables the population research, with “n = all” [4] .
Storage, Retrieval and Analysis With the massive increase of the information volume and complexity, the storage, retrieval and analysis activities are even more important with big data. Companies that are not prepared to deal with the challenge find support in outsourcing the process [11] . According to Soat (2013) [19] , the attribution of scores for information digitally available (e-scores) would be one of the ways of working with information from different origins, including personal data (data collected from fidelity programs or e-mail messages), browsing data collected through cookies, and outsourced data, collected from financing institutes, censuses, credit cards. The information analysis would enable the company to develop the client’s profile and present predictive analyses that would guide marketing decisions, such as identification of clients with greater lifetime value.
Information for the Decision-Making Process in Marketing The marketing information system provides information for strategic (structure, segmentation and positioning) and operational (marketing mix) decision making. The use of big data in marketing will analyzed below under those perspectives.
Segmentation and Positioning For Cravens and Piercy (2008) [20] , a segmentation strategy includes market analysis, identification of the market to be segmented, evaluation on
176
Data Analysis and Information Processing
how to segment it, definition of strategies of micro segmentation. A market analysis can identify segments that are unacknowledged or underserved by the competitors. To be successful, a segmentation strategy needs to seek identifiable and measurable, substantial, accessible, responsive and viable groups. Positioning can be understood as the key characteristic, benefit or image that a brand represents for the collective mind of the general public [21] . It is the action of projecting the company’s offer or image so that it occupies a distinctive place in the mind of the target public [10] . Cravens and Piercy (2008, p. 100) [20] connect the segmentation activity to the positioning through identification of valuable opportunities within the segment. Segmenting means identifying the segment that is strategically important to the company, whereas positioning means occupying the desired place within the segment. Digital research and monitoring tools enable studies on the consumer behavior to be used in behavioral segmentation. The assignment of scores and the use of advanced analyses help to identify and correlate variables, define predictive algorithmics to be used in market dimensioning and lifetime value calculations [19] [22] . The netnographical studies are also important sources to understand the consumer behavior and their beliefs and attitudes, providing relevant information to generate insights and define brand and product positioning.
Product From the positioning, the available information should be used to define the product attributes, considering the value created for the consumer. Information on consumer preferences and manifestations in communities and forums are inputs for the development and adjustment of products, as well as for the definition of complementary services. The consumer could also participate in the product development process by offering ideas and evaluations in real time. The development of innovation could also benefit from big data, both by surveying insights with the consumers and by using the information to develop the product, or even to improve the innovation process through the use of information, benefiting from the history of successful products, analyses of the process stages or queries to an idea archive [23] . As an improvement to the innovation process, the studies through big data would
Big Data Usage in the Marketing Information System
177
enable the replication of Cooper’s studies in order to define a more efficient innovation process, by exploring the boundary between the marketing research and the research in marketing [24] .
Distribution Internal reports became more complete and complex, involving information and metrics generated by the company’s digital proprieties (including websites and fanpages), which would also increase the amount of information on consumers, reaching beyond the data on customer profile. With the increase of information from different origins and in different formats, a richer internal database becomes the research source for business, markets, clients and consumers insights, in addition to internal analysis. In addition to the browsing location in the digital environment and the monitoring of visitor indicators, exit rate, bounce rate and time per page, the geolocation tools enable the monitoring of the consumers’ physical location and how they commute. More than that, the market and consumer information from big data enables to assess, in a more holistic manner, the variables that affect the decisions on distribution and location [25] .
Communication Big data analysis enables the emergence of new forms of communication research through the observation on how the audience interacts with the social networks. From their behavior analysis, new insights on their preferences and idols [3] may emerge to define the concepts and adjust details on the campaign execution. Moreover, the online interaction while displaying offline actions of brands enables the creation and follow up of indicators to monitor the communication [3] [26] , whether quantitative or qualitative. The increase of information storage, processing and availability enables the application of the CRM concept to B2C clients, involving the activities of gathering, processing and analyzing information on clients, providing insights on how and why clients shop, optimizing the company processes, facilitating the client-company interaction, and offering access to the client’s information to any company.
Price Even offline businesses will be strongly affected by the use of online prices information. A research by Google Shopper Marketing Council [27] ,
178
Data Analysis and Information Processing
published in April, 2013, shows that 84% of American consumers consult their smartphones while shopping in physical stores and 54% use them to compare prices. According to Vitorino (2013) [4] , the price information available in real time, together with the understanding of the consumers’ opinion and factors of influence (stated opinions, comments on experiences, browsing history, family composition, period since last purchase, purchase behavior), combined with the use of predictive algorithmics would change the dynamics, and could, in the limit, provide inputs for a customized decision making on price every time.
LIMITATIONS Due to the lack of a culture that cultivates the proper use of information and to a history of high costs for storage space, a lot of historical information was lost or simply not collected at all. A McKinsey study with retail companies observed that the chains were not using all the potential of the predictive systems due to the lack of: 1) historical information; 2) information integration; and 3) minimum standardization between the internal and external information of the chain [28] -[30] . The greater the historical information, the greater the accuracy of the algorithm, provided that the environment in which the system is implemented remains stable. Biesdorf, Court and Willmott (2013) [12] highlight the challenge of integrating information from different functional systems, legacy systems and information generated out of the company, including information from the macro environment and social networks. Not having qualified people to guide studies and handle systems and interfaces is also a limiting factor for research [23] , at least in a short term. According to Gobble (2013) [23] McKinsey report identifies the need for 190,000 qualified people to work in data analysis-related posts today. The qualification of the front line should follow the development of user-friendly interfaces [12] . In addition to the people directly connected to the analytics, Don Schults (2012) [31] still highlights the need for people with “real life” experience, able to interpret the information generated by the algorithms. “If the basic understanding of the customer isn’t there, built into the analytical models, it’s really doesn’t matter how many iterations the data went through or how quickly. The output is worthless (SCHULTZ, 2012, p. 9).” The management of clients in a different manner through CRM already faces a series of criticism and limitations. Regarding the application of CRM for service marketing, its limitations would lie in the fact that a reference
Big Data Usage in the Marketing Information System
179
based only on the history may not reflect the client’s real potential; the unequal treatment of clients could generate conflicts and dissatisfaction of clients not listed as priorities; and ethical issues involving privacy (improper information sharing) and differential treatment (such as redlining). These issues can be also applied in a bigger dimension in discussions about the use of information from big data in marketing research and its application on clients and consumers. The predictive models are based on the fact that the environment where the analyzing system is implemented remains stable, which, by itself, is a limitation to the use of information. In addition to it and to the need of investing in a structure or expending on outsourcing, the main limitations in the use of big data are connected to three main factors: data shortage and inconsistence, qualified people, and proper use of the information. The full automation of the decision due to predictive models [5] also represents a risk, since that no matter how good a model is, it is still a binary way of understanding a limited theoretical situation. At least for now, the analytical models would be responsible for performing the analyses and recommendations, but the decisions would still be the responsibility of humans. Nuan and Domenico (2013) [5] have also emphasized that people’s behavior and their relationships in social networks may not accurately reflect their behavior offline, and the first important thing to do would be to increase the understanding level of the relation between online and offline social behavior. However, if on one hand people control the content of the intentionally released information in social networks, on the other hand, a great amount of information is collected invisibly, compounding their digital trail. The use of information without the awareness and permission of the studied person involves the ethics in research [15] -[17] . Figure 2 shows a suggestion of continuum between the information that the clients would make available wittingly and the information make available unwittingly to the predictive systems. The consideration of the ethics issues raised by Kozinets (2006) [17] , Nunan and Domenico (2013) [15] , and reinforces the importance of increasing the clients’ level of awareness regarding the use of their information or ensuring the non-customization of the analysis of information obtained unwittingly by the companies.
180
Data Analysis and Information Processing
FINAL CONSIDERATIONS This study discussed the use of big data in the context of marketing information system, and, what was clear is that we are still in the beginning of a journey of understanding its possibilities and use, and we can observe the great attention generated by the subject and the increasing ethical concern. As proposed by Nunan and Domenico (2013) [5] , the self-governance via ESOMAR (European Society for Opinion and Market Research) [32] is an alternative to fight the abuses and excesses and enable the good use of the information. Nunan and Di Domenico (2013) [5] propose to include in the current ESOMAR [32] rules the right to be forgotten (possibility to request deletion of history), the right to have the data expired (complementing the right to be forgotten, the transaction data could also expire), and the ownership of a social graph (an individual should be aware of the information collected about them). Non-confidential information Greater awareness and consent in providing information open data from websites and personal pages open comments information on interests (Like, Follow, RT) aware participation in surveys (online surveys) personal information interests relationships geographical location (GPS) trail until the purchase online behavior purchase registry confidential documents e-mail content bank data
Confidential information Greater unawareness and non-consent in providing information
Figure 2: Continuum between the awareness and non-awareness regarding the use of information. Source: authors.
Big Data Usage in the Marketing Information System
181
In marketing communication, the self-governance in Brazil has showed positive results, such as the examples in the liquor industry and kid’s food industry, which, upon the pressure of public opinion, have adopted restrictive measures to repress abuses and maintain the communication of categories [33] . Industries such as the cigarette are opposite examples of how the excess has led to great restrictions to the categories. As in the prisoners’ dilemma [34] , the self-governance forces a solution in which all participants have to give up on their best short-term individual options for the good of the group in a long term (Figure 3). On the other hand, if the consumer’s consent in releasing the use of their information would solve the ethical issues, the companies would never have so much power to create value for their clients and consumers. Recovering the marketing application proposed in “Broadening the concept of marketing” [35] , the exchange of consent release could be performed by offering a major non-pecuniary value. This value offer could be the good use of the information to generate services or new proposals that increase the value perceived by the client [10] . Currently, many mobile applications offer services to consumers, apparently free of charge, in exchange for their audience for advertisements and access to their information in social networks. By understanding which service, consistent with its business proposal, the consumer sees the value in, and making this exchange clear, the service and consent of the information use could be a solution to access information in an ethical manner. From the point of view of marketing research, the development of recovery systems and the analyses of great volumes of non-structured information could lead to the understanding of consumer behaviors. Issues regarding the findings and understanding of the consumers in marketing research are addressed qualitatively. However, due to the volume of cases, could the studies, through big data, provide at the same time the understanding on the consumer and the measurement of the groups with this behavior? A suggestion for the following research
182
Data Analysis and Information Processing
All people exceed.
All people exceed.
I exceed.
I do not exceed.
Excesses in invasion of privacy and excesses in communication.
Excesses in invasion of privacy and excesses in communication.
High investment and results shared among all. Impaired
Few information and visibility for those who do not exceed.
society.
Society and those who do not exceed are impaired.
All people do not exceed.
All people do not exceed.
I exceed.
I do not exceed.
No invasion of privacy and little relevant communication. Low
No invasion of privacy and little relevant communication. Low
investments in communication. High visibility and results to
investments in communication and results shared among all.
those who exceed. Society and those who exceed are
Society is benefited.
benefited. Those who do not exceed are impaired.
Figure 3: Free exercise of the prisoners’ dilemma application. Source: Authors, based on Pindick and Rubinfeld (1994).
Alexandre Borba Salvador, Ana Akemi Ikeda would be to study the combination of the qualitative and quantitative research objectives with the use of big data and analytical systems in understanding consumer behavior and measurement of group importance.
Big Data Usage in the Marketing Information System
183
REFERENCES 1.
2. 3. 4. 5.
6.
7. 8.
9. 10. 11.
12. 13. 14.
Chow-White, P.A. and Green, S.E. (2013) Data Mining Difference in the Age of Big Data: Communication and the Social Shaping of Genome Technologies from 1998 to 2007. International Journal of Communication, 7, 556-583. ORACLE: Big Data for Enterprise. http://www.oracle.com/br/ technologies/big-data/index.html Paul, J. (2012) Big Data Takes Centre Ice. Marketing, 30 November 2012. Vitorino, J. (2013) Social Big Data. S?o Paulo, 1-5. www.elife.com.br Nunan, D. and Domenico, M.Di. (2013) Market Research and the Ethics of Big Data Market Research and the Ethics of Big Data. International Journal of Market Research, 55, 2-13. Cox, D. and Good, R. (1967) How to Build a Marketing Information System. Harvard Business Review, May-June, 145-154. ftp://donnees. admnt.usherbrooke.ca/Mar851/Lectures/IV Berenson, C. (1969) Marketing Information Systems. Journal of Marketing, 33, 16. http://dx.doi.org/10.2307/1248668 Chiusoli, C.L. and Ikeda, A. (2010) Sistema de Informa??o de Marketing (SIM): Ferramenta de apoio com aplica??es à gest?o empresarial. Atlas, S?o Paulo. Kotler, P. (1998) Administra??o de marketing. 5th Edition, Atlas, S?o Paulo. Kotler, P. and Keller, K. (2012) Administra??o de marketing. 14th Edition, Pearson Education, S?o Paulo. Zikopoulos, P. and Eaton, C. (2012) Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw Hill, New York, 166. Retrieved from Malik, A.S., Boyko, O., Atkar, N. and Young, W.F. (2001) A Comparative Study of MR Imaging Profile of Titanium Pedicle Screws. Acta Radiologica, 42, 291-293. http://dx.doi. org/10.1080/028418501127346846 Biesdorf, S., Court, D. and Willmott, P. (2013) Big Data: What’s Your Plan? McKinsey Quarterly, 40-41. MINTEL. www.mintel.com E. Life. www.elife.com.br
184
Data Analysis and Information Processing
15. Gebera, O.W.T. (2008) La netnografía: Un método de investigación en Internet. Quaderns Digitals: Revista de Nuevas Tecnologías y Sociedad, 11. http://dialnet.unirioja.es/servlet/articulo?codigo=3100552 16. Kozinets, R. (2002) The Field behind the Screen: Using Netnography for Marketing Research in Online Communities. Journal of Marketing Research, 39, 61-72. http://dx.doi.org/10.1509/jmkr.39.1.61.18935 17. Kozinets, R.W. (2006) Click to Connect: Netnography and Tribal Advertising. Journal of Advertising Research, 46, 279-288. http:// dx.doi.org/10.2501/S0021849906060338 18. Many Eyes. http://www.manyeyes.com/software/analytics/manyeyes/ 19. Soat, M. (2013) E-SCORES: The New Face of Predictive Analytics. Marketing Insights, September, 1-4. 20. Cravens, D.W. and Piercy, N.F. (2008) Marketing estratégico. 8th Edition, McGraw Hill, S?o Paulo. 21. Crescitelli, E. and Shimp, T. (2012) Comunica??o de Marketing: Integrando propaganda, promo??o e outrs formas de divulga??o. Cengage Learning, S?o Paulo. 22. Payne, A. and Frow, P. (2005) A Strategic Framework for Customer Relationship Management. Journal of Marketing, 69, 167-176. http:// dx.doi.org/10.1509/jmkg.2005.69.4.167 23. Gobble, M.M. (2013) Big Data: The Next Big Thing in Innovation. Research-Technology Management, 56, 64-67. http://dx.doi.org/10.54 37/08956308X5601005 24. Cooper, R.G. (1990) Stage-Gate Systems: A New Tool for Managing New Products, (June). 25. Parente, J. (2000) Varejo no Brasil: Gest?o e Estratégia. Atlas, S?o Paulo. 26. Talbot, D. (2011) Decoding Social Media Patterns in Tweets A SocialMedia Decoder. Technology Review, December 2011. 27. Google Shopper Marketing Agency Council (2013) Mobile In-Store Research: How Is Store Shoppers Are Using Mobile Devices, 37. http://www.marcresearch.com/pdf/Mobile_InStore_Research_Study. pdf 28. Bughin, J., Byers, A. and Chui, M. (2011) How Social Technologies Are Extending the Organization. McKinsey Quarterly, 1-10. http:// bhivegroup.com.au/wp-content/uploads/socialtechnology.pdf
Big Data Usage in the Marketing Information System
185
29. Bughin, J., Livingston, J. and Marwaha, S. (2011) Seizing the Potential of “Big Data.” McKinsey …, (October). http://whispersandshouts. typepad.com/files/using-big-data-to-drive-strategy-and-innovation.pdf 30. Manyika, J., Chui, M., Brown, B. and Bughin, J. (2011) Big Data: The Next Frontier for Innovation, Competition, and Productivity. 146. www.mckinsey.com/mgi 31. Schultz, D. (2012) Can Big Data Do It All?? Marketing News, November, 9. 32. ESOMAR. http://www.esomar.org/utilities/news-multimedia/video. php?idvideo=57 33. CONAR. Conselho Nacional de Auto-regulamenta??o Publicitária. http://www.conar.org.br/ 34. Pindyck, R.S. and Rubinfeld, D.L. (1994) Microeconomia. Makron Books, S?o Paulo. 35. Kotler, P. and Levy, S.J. (1969) Broadening the Concept of Marketing. Journal of Marketing, 33, 10-15. http://dx.doi.org/10.2307/1248740
Chapter 9
Big Data for Organizations: A Review
Pwint Phyu Khine1, Wang Zhao Shun1,2 School of Information and Communication Engineering, University of Science and Technology Beijing (USTB), Beijing, China 1
2
Beijing Key Laboratory of Knowledge Engineering for Material Science, Beijing, China
ABSTRACT Big data challenges current information technologies (IT landscape) while promising a more competitive and efficient contributions to business organizations. What big data can contribute to is what organizations have been wanted for a long time ago. This paper presents the nature of big data and how organizations can advance their systems with big data technologies. By improving the efficiency and effectiveness of organizations, people can benefit the can take advantages of a more convenient life contributed by Information Technology.
Citation: Khine, P. and Shun, W. (2017), “Big Data for Organizations: A Review”. Journal of Computer and Communications, 5, 40-48. doi: 10.4236/jcc.2017.53005. Copyright: © 2017 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0
188
Data Analysis and Information Processing
Keywords: Big Data, Big Data Models, Organization, Information System
INTRODUCTION Business organizations have been using big data to improve their competitive advantages. According to McKinsey [1], organizations which can fully apply big data get competitive advantages over its competitors. Facebook users uploads hundreds of Terabytes of data each day and these social media data are used for developing more advanced analysis which aim is to take more value from user data. Search Engines like Google and Yahoo are already monetizing by associating appropriate ads based on user queries (i.e. Google use big data to give the right ads to the right user in a split seconds). In applying information systems to improve their organization system, most government organization left behind compared to the business organizations [2]. Meanwhile some government already take initiative to get the advantages of big data. E.g. Obama’s government announced investment of more than $200 million for Big Data R & D in Scientific Foundations in 2012 [3]. Today, people are living in the data age where data become oxygen to people as organizations are producing data more than they can handle leading to big data era. This paper is sectioned as follows: Section II of this paper describes Big Data Definitions, Big Data Differences and Sources within data, Big Data characteristics, and databases and ELT process of big data. Section IV is mainly concerned with the relationship between big data information systems and organizations, how big data system should be implemented and big data core techniques for organizations. Section IV is the conclusion of the paper.
BIG DATA FOR ORGANIZATIONS The nature of Big Data can be expressed by studying the big data definition, the data hierarchy and sources of big data, and its prominent characteristics, databases and processes. According to [1], there are five organization domains for big data to create value and based on the size of the potential including health care, manufacturing, public sector administration, retail and global personal location data. There are many potential organizations which require big data solution such as scientific discovery (e.g. astronomical organizations, weather predictions) with huge amount of data.
Big Data for Organizations: A Review
189
Big Data Definition Big data refers to the world of digital data which becomes enormous to be handled by traditional data handling techniques. Big data is defined in here as “a large volume of digital data which require different kinds of velocity based on the requirements of the application domains which has a wide variety of data types and sources for the implementation of the big data project depending on the nature of the organization.” Big data can be further categorized into Big Data Science and Big data framework [4]. Big data science is “the study of techniques covering the acquisition, conditioning, and evaluation of big data”, whereas big data frameworks are “software libraries along with their associated algorithms that enable distributed processing and analysis of big data problems across clusters of computer units.” It is also stated that “an instantiation of one or more big data frameworks is known as big data infrastructure.”
Big Data Differences and Sources within Data According to the basic data hierarchy as described in Table 1, different levels of computer systems have been emerged based on the nature of the application domains and organizations to extract the required value of the data (required hierarchy level data). Big data, instead, try to get value from since the “data” steps by applying big data theories and techniques regardless of types and level of information systems. Based on the movement of data, data can be classified into “data in motion” and “data at rest”. “Data in motion” means data which have not stored in a storage medium i.e. moving data such as streaming data comes from IoTs devices. They need to control almost in real time and need interactive controlling. “Data at rest” are data that can be retrieved from the storage systems such as data from warehouses, RDBMS (Relational Database Management Systems) databases, File systems e.g. HDFS (Hadoop Distributed File Systems), etc. Table 1: Hierarchy of data Hierarchy
Description
Data
Any piece of raw information that is unprocessed e.g. name, quality, sound, image, etc.
Information
Data is processed into a useful form that become information. e.g. employee information (data about employee)
190
Data Analysis and Information Processing
Knowledge
Information is advanced by adding more contents from human experts that become knowledge (e.g. Pension data about employee)
Business Insight
Information is extracted and used in a way that help improve the business processes. (e.g. predicting the trends of customer buying patterns based on current information)
Traditional “Bringing data to perform operations” style is not suitable in voluminous big data because it will definitely waste the huge amount of computational power. Therefore, big data adopts the style of “Operations go where data exist” to reduce computational costs which is done by using the already well-established distributed and parallel computing technology [5]. Big data is also different from traditional data paradigm. Traditional data warehouses approaches map data into predefined schema and used “Schema-on-write” approach. But when big data handle data, there is no predefined schema. Instead, the required schema definition is retrieved from data itself. Therefore, big data approach can be considered as “Schema-onRead” approach. In information age with the proliferation of data in every corner of the world, the sources of big data can be difficult to differentiate. Big data sourced in the proliferation of social media, IoTs, traditional operation systems and people involvement. The sources of big data stated in [4] are IoTs (Internet of Things) such as sensor, social networks such as Twitter, open data permitted to be used by government or some business organizations (e.g. twitter data) and crowd sourcing which encourage people to provide and enter data especially for massive scale projects (e.g. census data). The popularity, major changes or new emergence of different organizations will create the new sources of big data. E.g. in the past, data from social media organizations such as Facebook, twitter, are not predicted to become a big data source. Currently, data from mobile phones handled by telecommunication companies, and IoTs for different scientific researches become important big data sources. In future, transportation vehicles with Machine-to-Machine communication (data for automobile manufacturing firms), and data from Smart city with many interconnected IoT devices will become the big data sources because of their involvement in people daily life.
Characteristics of Big Data The most prominent features of big data are characterized as Vs. The first three Vs of Big data are Volume for huge data amount, Variety for different
Big Data for Organizations: A Review
191
types of data, and Velocity for different data rate required by different kinds of systems [6]. Volume: When the scale of the data surpass the traditional store or technique, these volume of data can be generally labeled as the big data volume. Based on the types of organization, the amount of data volume can vary from one place to another from gigabytes, terabytes, petabytes, etc. [1]. Volume is the original characteristic for the emergence of big data. Variety: Include structured data defined in specific type and structure (e.g. string, numeric, etc. data types which can be found in most RDBMS databases), semi-structured data which has no specific type but have some defined structure (e.g. XML tags, location data), unstructured data with no structure (e.g. audio, voice, etc. ) which their structures have to be discovered yet [7], and multi-structured data which include all these structured, semi-structured and unstructured features [7] [8]. Variety comes from the complexity of data from different information systems of target organization. Velocity: Velocity means the rate of data required by the application systems based on the target organization domain. The velocity of big data can be considered in increasing order as batch, near real-time, real-time and stream [7]. The bigger data volume, the more challenges will likely velocity face. Velocity the one of the most difficult characteristics in big data to handle [8]. As more and more organizations are trying to use big data, big data Vs characteristics become to appear one after another such as value, veracity and validity. Value mean that data retrieved from big data must support the objective of the target organization and should create a surplus value for the organization [7]. Veracity should address confidentiality in data availablefor providing required data integrity and security. Validity means that the data must come from valid source and clean because these big data will be analyzed and the results will be applied in business operations of the target organization. Another V of data is “Viability” or Volatility of data. Viability means the time data need to survive i.e. in a way, the data life time regardless of the systems. Based on viability, data in the organizations can be classified as data with unlimited lifetime and data with limited lifetime. These data also need to be retrieved and used in a point of time. Viability is also the reason the volume challenge occurs in organizations.
192
Data Analysis and Information Processing
Database Systems and Extract-Load-Transform (ELT) in Big Data Traditional RDBMS with ACID properties?(Atomicity, Consistency, Isolation and Durability) is only intended for structured data cannot handled all V’s requirements of big data and cannot provide horizontal scalability, availability and performance [9]. Therefore, NoSQL (not only SQL) databases are need to use based on the domains of the organizations such as Mongo DB, Couch DB for documentation databases, Neo4j for graph databases, HBase columnar database for sparse data, etc. NoSQL database use the BASE properties (Basically Available, Soft state, Eventual consistency). Because big data are based on parallel computing and distributed technology, CAP (Consistency, Availability, and Partition) theorem will affect in big data technologies [10]. Data warehouses and data marts store valid and cleaned data by the process of ETL (Extract-Transform-Load). Preprocessed, highly summarized and integrated (Transformed) data are loaded into the data warehouses for further usage [11]. Because of heterogeneous sources of big data, traditional transformation process will charge a huge computational burden. Therefore, big data first “Load” all the data, and then transform only the required data based on need of the systems in the organizations. The process can change into Extract-Load- Transform. As a result, new idea like “Data Lake” also emerged which try to store all data generated by organizations and has overpower of data warehouses and data mart although there are critics for becoming a “data swamp” [12].
BIG DATA IN ORGANIZATIONS AND INFORMATION SYSTEMS Many different kind of organizations are now applying and implementing big data in various types of information systems based on their organizational needs. Information systems emerged according to the requirements of the organizations which are based on what organizations do, how they do and organizational goals. According to Mintzberg, five different kinds of organization are classified based on the organization’s structure, shape and management as (1) Entrepreneurial structure―a small startup firm, (2) Machine Bureaucracy―medium sized manufacturing firm with definite structure, (3) Divisionalized bureaucracy―a multi-national organization which produces different kinds of products controlled by the central
Big Data for Organizations: A Review
193
headquarter, (4) Professional bureaucracy―an organization relys on the efficiency of individuals such as law firms, universities, etc., and (5) Adhocracy such as consulting firm. Different kinds of information systems are required based on the work the target organization does. Information systems required by the organization and the nature of problems within them reflects the types of organizational structure. Systems are structured procedures for the regulations of the organization limited by organization boundary. These boundary express the relationship between systems and its environment (organization). Information systems collects and redistribute data within internal operations of the organization and organization environment using the three basic simplest proceduresinputting data, performing processing and outputting the information. Among the organization and systems are “Business processes” which are logically related tasks with formal rules to accomplish a specific work which need to coordinate throughout the organization hierarchy [2]. These organizational theories are always true regardless of old or new evolving data methodologies.
Relationship between Organization and Information Systems The relationship between organization and information systems are called socio- technical effects. This socio-technical model suggests that all these components- organizational structure, people, job tasks and Information Technology (IT)- must be changed simultaneously to achieve the objective of the target organization and information systems [2]. Sometimes, these changes can result in chang- ing business goals, relationship with people, and business processes for target organization, blur the organizational boundaries and cause the flattening of the organization [1] [2]. Big data transforms traditional siloed information systems in the organizations into digital nervous systems with information in and out of relating organizational systems. Organization resistance to change is need to be considered in every implementation of Information systems. The most common reason for failure of large projects is not the failure of the technology, but organizational and political resistance to change [2]. Big data projects need to avoid these kind of mistake and implement based on not only from information system perspective but also from organizational perspective.
194
Data Analysis and Information Processing
Implementing Big Data Systems in Organizations The work [13] provide a layered view of big data system. To make the complexity of big data system simpler, the big data system can be decomposed into a layered structure according to a conceptual hierarchy. The layers are “Infrastructure Layer” with raw ICT resources, “Computing Layer” which encapsulating various data tools into a middleware layer that runs over raw ICT resources, and “Application layer” which exploits the interface provided by the programming models to implement various data analysis functions to develop various field related applications in different organizations. Different scholars are considering the system development life cycle of big data system project. Based on IBM’s three-phases to build big data projects, the work in [4] proposed a holistic view for implementing the big data projects. Phase 1. Planning: Involves Global Strategy Elaboration where the main idea is that the most important thing to consider is not technology but business objectives. Phase 2. Implementation: This stages are divided into 1) data collecting from major big data sources, 2) data preprocessing by data cleaning for valid data, integrating different data types and sources, transformation (mapping data elements from source to destination systems and reducing data into a smaller structure (sometimes data discretization as a part of it), 3) smart data analysis i.e. using advanced analytics to extract value from a huge set of data, apply advanced algorithms to perform complex analytics on either structured or unstructured data, 4) representation and visualization for guiding the analysis process and presenting the results in a meaningful way. Phase 3. Post implementation: This phase involves 1) actionable and timely insight extraction stage based on the nature of organization and the value that organization is seeking which decide whether the success and failure of big data project, 2) Evaluation stage evaluates a Big data project, it is stated that diverse data inputs, their quality, and expected results are required to consider. Based on this big data project life cycle, organization can develop their own big data projects. The best way to implement big data projects is to use both technologies that are before and after big data. E.g. use both Hadoop and warehouse because they implement each other. US government considers
Big Data for Organizations: A Review
195
“all contents as data” when implementing big data projects. In digital era, data has the power to change the world and need careful implementation.
Big Data Core Techniques for Organizations There are generally two types of processing in big data-batch processing and real-time processing based on the domain nature of the organization. The fundamental of big data technology is based on MapReduce Model [14] by Google for processing batch work load of their user data. It is based on scale out model of the commodity servers. Later, real-time processing models such as twitter’s Storm, Yahoo’s S4, etc. become appear because of the nearreal time, real time and stream processing requirements of organizations. The core of MapReduce model is the power of “divide and conquer method” by distributing the jobs on the clusters of commodity servers with two steps (Map and Reduce) [14]. Jobs are divided and distributed over the clusters, and the completed jobs (intermediate results) from Map phases are sent to the reduce phase to perform required operations. In a way, In the MapReduce paradigm, the Map function performs filtering and sorting and Reduce function carries out grouping and aggregation operations. There are many implementations of MapReduce algorithm which are in open source or proprietary. Among the open source frameworks, the most prominent one is “Hadoop” with two main components―“MapReduce Engine” and “Hadoop Distributed File System (HDFS)”―In the HDFS cluster, files are broken into blocks that are stored in the DataNodes. NameNode maintains meta-data of these file blocks and keeps tracks of operations of Data Node [7]. MapReduce provide scalability by distributed execution and reliability by reassigning the failed jobs [9]. Other than MapReduce Engine and HDFS, Hadoop has a wide variety of ecosystem such as Hive for warehouses, Pig for query, YARN for resource management, Sqoop for data transfer, Zookeeper for coordination, etc. and many others. Hadoop ecosystem will continue to grow as new big data systems appeared according to the need of the different organizations. Organizations with interactive nature and high response time require real- time processing. Although MapReduce is dominant batch processing model, real-time process- ing models are still competing with each other, each with their own competitive advantages.
196
Data Analysis and Information Processing
“Storm” is a prominent big data technology for Real-time processing. The famous user of storm is Twitter. Different from MapReduce, Storm use a topology which is a graph of spouts and bolts that are connected with stream grouping. Storm consume data streams which are unbounded sequences of tuples, splits the consumed streams, and processes these split data streams. The pro- cessed data stream is again consumed and this process is repeated until the operation is halted by user. Spout performsas a source of streams in a topology, and Bolt consumes streams and produce new streams, as they execute in parallel [15]. There are other real-time processing tools for Big Data such as Yahoo’s S4 (Simple Scalable Streaming System) which is based on the combination of actor models and MapReduce model. S4 works with Processing Elements (PEs) that consume the keyed data events. Messages are transmitted between PEs in the form of data events. Each PE’s state is inaccessible to other PEs and event emission and consumption is the only mode of interaction between PEs. Processing Nodes (PN) are the logical hosts of PEs which are responsible for listening to the events, executing operating on the incoming events, dispatching events with the assistance of the communication layer, and emitting output events [16]. There is no specific winner in stream processing models, and organizations can use appropriate data models that are consistent with their works. Regardless of batch or real-time, there are many open source and proprietary software framework for big data. Open source big data framework are Hadoop, EPCC (High Performance Computing Cluster), etc. [7]. Many other proprietary big data tools such as IBM BigInsight, Accumulo, Microsoft Azure, etc. has been successfully used in many business areas of different organizations. Now, big data tools and libraries are available in other languages such as Python, R, etc. for many different kinds of specific organizations.
CONCLUSION Big data is a very wide and multi-disciplinary field which requires the collaboration from different research areas and organizations from various sources. Big data may change the traditional ETL process into ExtractLoad-Transform (ELT) process as big data give more advantages in moving algorithms near where the data exist. Like other information systems, the
Big Data for Organizations: A Review
197
success of big data projects depend on organizational resistance to change. Organizational structure, people, tasks and information technologies need to change simultaneously to get the desired results. Based on the layered view of the big data [13], big data projects can implement with step-bystep roadmap [4]. Big data sources will vary based on the past, present and future of the organizations and information systems. Big data have power to change the landscape of organization and information systems because of its different unique nature from traditional paradigms. Using big data technologies can make organizations get overall advantage with better efficiency and effectiveness. The future of big data will be the digital nervous systems for organization where every possible systems need to consider the big data as a must have technology. Data age is coming now.
ACKNOWLEDGEMENTS I want to express my gratitude for my supervisor Professor Wang Zhao Shun for encouraging and giving suggestions for improving my paper.
198
Data Analysis and Information Processing
REFERENCES 1.
2.
3. 4.
5. 6. 7. 8.
9.
10.
11. 12. 13.
14.
Manyika, J., et al. (2011) Big Data: The Next Frontier for Innovation, Competition, and Productivity. San Francisco, McKinsey Global Institute, CA, USA. Laudon, K.C. and Laudon, J.P. (2012) Management Information Systems: Managing the Digital Firm. 13th Edition, Pearson Education, US. House, W. (2012) Fact Sheet: Big Data across the Federal Government. Mousanif, H., Sabah, H., Douiji, Y. and Sayad, Y.O. (2014) From Big Data to Big Projects: A Step-by-Step Roadmap. International Conference on Future Internet of Things and Cloud, 373-378 Oracle Enterprise Architecture White Paper (March 2016) An Enterprise Architect’s Guide to Big Data: Reference Architecture Overview. Laney, D. (2001) 3D Data Management: Controlling Data Volume, Velocity and Variety, Gartner Report. Sagiroglu, S. and Sinanc, D. (2013) Big Data: A Review. International Conference on Collaboration Technologies and Systems (CTS), 42-47. de Roos, D., Zikopoulos, P.C., Melnyk, R.B., Brown, B. and Coss, R. (2012) Hadoop for Dummies. John Wiley & Sons, Inc., Hoboken, New Jersey, US. Grolinger, K., Hayes, M., Higashino, W.A., L’Heureux, A., Allison, D.S. and Capretz1, M.A.M. (2014) Challenges of MapReduce in Big Data, IEEE 10th World Congress on Services, 182-189. Hurwitz, J.S., Nugent, A., Halper, F. and Kaufman, M. (2012) Big Data for Dummies, 1st Edition, John Wiley & Sons, Inc, Hoboken, New Jersey, US. Han, J., Kamber, M. and Pei, J. (2006) Data Mining: Concepts and Techniques. 3rd Edition, Elsevier (Singapore). Data Lake. https://en.m.wikipedia.org/wiki/Data_lake Hu, H., Wen, Y.G., Chua, T.-S. and Li, X.L. (2014) Toward Scalable Systems for Big Data Analytics: A Technology Tutorial. IEEE Access, 2, 652-687. https://doi.org/10.1109/ACCESS.2014.2332453 Dean, J. and Ghemawat, S. (2008) MapReduce: Simplified Data Processing on Large Clusters. Commun ACM, 107-113. https://doi. org/10.1145/1327452.1327492
Big Data for Organizations: A Review
199
15. Storm Project. http://storm.apache.org/releases/2.0.0-SNAPSHOT/ Concepts.html 16. Neumeyer, L., Robbins, B., Nair, A. and Kesari, A. (2010) S4: Distributed Stream Computing Platform. 2010 IEEE International Conference on Data Mining Workshops (ICDMW). https://doi. org/10.1109/ICDMW.2010.172
Chapter 10
Application Research of Big Data Technology in Audit Field
Guanfang Qiao WUYIGE Certified Public Accountants LLP, Wuhan, China
ABSTRACT The era of big data has brought great changes to various industries, and the innovative application effect of big data-related technologies also shows obvious advantages. The introduction and application of big data technology in the audit field also become the future development trend. Compared with the traditional mode of audit work, the application of big data technology can help to achieve more ideal results, which needs to promote the adaptive transformation and adjustment of audit work. This paper makes a brief analysis of the application of big data technology in audit field, which first introduces the characteristics of big data and its technical application, and then points out the new requirements for audit work in the era of big data, and finally discusses how to apply the big data technology in the audit field, hoping that it can be used for reference.
Citation: Qiao, G. (2020), “Application Research of Big Data Technology in Audit Field”. Theoretical Economics Letters, 10, 1093-1102. doi: 10.4236/tel.2020.105064. Copyright: © 2020 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0
202
Data Analysis and Information Processing
Keywords: Big Data, Technology, Audit, Application
INTRODUCTION With the rapid development of information technology in today’s world, the amount of data information is getting larger and larger, which presents the characteristics of big data. Big data refers to a collection of data that cannot be captured, managed and processed by conventional software tools within a certain time. It is a massive, high-growth, diversified information asset that requires new processing models to have greater decision-making power, insight and process optimization capabilities. Under the background of big data development era, all walks of life should actively adapt to it in order to form a more positive change. With the development of new social economy, audit work is faced with higher requirements. The traditional audit methods and concepts have been difficult to show good adaptability, and it is very easy to appear many problems and defects. So positive changes should be made, and proper and scientific integration of big data technology is an effective measurement, which deserves high attention. Of course, the application of big data technology in the field of audit is indeed facing higher difficulties, for example, the development of audit software and the establishment of audit analysis model need to be effectively adjusted from multiple levels in order to give full play to the application value of big data technology.
OVERVIEW OF BIG DATA TECHNOLOGY Big data technology is a related technical means emerging with the development of big data era. It mainly involves big data platform, big data index system and other related technologies, and has been well applied in many fields. Big data refers to massive data, and the corresponding information data cannot be intuitively observed and used. It is faced with high difficulties in data information acquisition, storage, analysis and application, and inevitably shows strong application significance, and has become an important content that attracts more attention under the development of the current information age. From the point of view of big data itself, in addition to the obvious characteristics of large amount, it is often characterized by obvious diversity, rapidity, complexity and low value density. Therefore, it is inevitable to bring great difficulty to the application of these massive data, and it puts forward higher requirements for the application of big data technology, which needs to be paid high attention to (Ingrams, 2019).
Application Research of Big Data Technology in Audit Field
203
Based on the development of the era of big data, the core is not to obtain massive data information, but how to conduct professional analysis and processing for these massive information, so as to play its due role and value. In this way, it is necessary to strengthen the research on big data technology, so that all fields can realize the optimization analysis and processing of massive data information with the assistance of big data technology, and meet the original application requirements. In terms of the development and application of current big data technologies, data mining technology, massively parallel processing database, distributed database, extensible storage system and cloud computing technology are commonly used. These big data technologies can be effectively applied to the massive information acquisition, storage, analysis and management. Big data has ushered in a major era of transformation, which has changed our lives, work and even our thinking. More and more industries maintain a very optimistic attitude towards the application of big data, and more and more users are trying or considering how to use similar big data to solve the problem, so as to improve their business level. With the gradual advance of digitization, big data will become the fourth strategy that enterprises can choose after the three traditional competitive strategies of cost leadership, differentiation and concentration.
REQUIREMENTS ON AUDITING IN THE ERA OF BIG DATA Change Audit Objectives In the era of big data development, in order to better realize the flexible application of big data technology in the auditing field, it is necessary to pay high attention to the characteristics of the era of big data development, and it requires adaptive transformation so as to create good conditions for the application of big data technology. With the development of big data era, the audit work should first pay attention to the effective transformation of its own audit objectives, and it is necessary to gradually broaden the tasks of audit work in order to improve the value of audit work and give play to the application value of big data technology. In addition to find all kinds of abnormal clues in the target and control the illegal behaviors, audit work also needs to take into account promoting the development of the target, which requires the realization of the optimization assistance for relevant operating systems, and to play an active role in risk assessment and benefit
204
Data Analysis and Information Processing
improvement, better explore the law of development, and then give play to the reference value in decision analysis (Gepp, 2018).
Change Audit Content With the development of big data era, the development and transformation of audit field also need to focus on the specific audit content. The change of audit content is also the basic requirement of applying big data technology. It is necessary to select appropriate big data technology centering on audit content. Under the background of big data, audit work is often faced with more complicated content, which involves not only the previously simple data information such as amount and various expenses, but also more complicated text information, audio information and video information. As the content is more abundant, it will inevitably increase the difficulty of analysis and processing, which puts forward higher requirements for the application of big data technology. Of course, this also requires the collection of rich and detailed massive data information as much as possible in the future audit work, so as to better complete the audit task and achieve the audit objectives mentioned above with the assistance of big data technology (Alles, 2016).
Change Audit Thinking Under the background of development of big data era, the development of audit field should also focus on the change of thinking, which is also the key premise to enable audit staff to actively adapt to the new situation. Only by ensuring that audit staff have new audit thinking, can they flexibly apply big data technology, optimize and deal with rich practical contents in the era of big data, and finally better enhance the audit value. Specifically, the change in audit thinking is mainly reflected in the following aspects: First of all, audit staff needs to change the previous sampling audit thinking and realize a comprehensive audit, carry out a comprehensive audit analysis for all the information to avoid any omission. Secondly, the requirements of precision should be gradually reduced in the process of audit (Shukla, 2019). Because with the application of big data, its information value density is relatively low, which is likely to affect the accuracy of data, so it needs to be optimized with the help of appropriate big data technology. In addition, the change of audit thinking also needs to change from the original causal relationship to the correlation relationship, which requires that emphasis should be placed on exploring the correlation between different factors and indicators, so as to provide reference for decision-making and other work.
Application Research of Big Data Technology in Audit Field
205
Change Audit Techniques With the development of big data era, the transformation of audit field needs to be embodied in the technical level. Because audit content is more complex and involves a large number of types, so it is inevitable that the traditional audit technology cannot form a good satisfactory effect, thus we need to focus on innovation and optimization of the technical means. Based on the higher requirements of audit work under this new situation, the following conditions should be met in the application of correlation analysis technology. First of all, the application of audit technology should be suitable for the analysis and processing of complex data, requiring it to be able to carry out comprehensive analysis of a variety of different types of data information, so as to avoid the improper analysis technology. Secondly, the application of audit technology should also show the characteristics of intuition, try to use visual analysis means, to promote the audit results can be presented more ideal, for reference and application. In addition, in the era of big data, the application of technologies related to audit work often needs to pay attention to data mining, which requires mining valuable clues from massive data information and significantly improving the speed of data mining and analysis to meet the characteristics of big data.
APPLICATION OF BIG DATA TECHNOLOGY IN AUDIT FIELD Data Mining Analysis The application of big data technology in the audit field should focus on data mining analysis, which is obviously different from the data verification analysis under the traditional audit mode, and can better realize the efficient application of data information. In the previous audit work, random sampling was usually conducted on the collected financial data and information, and then the samples were checked and proofread one by one to verify whether there were obvious abnormal problems, which mainly involved query analysis, multi-dimensional analysis and other means. However, with the application of big data technology, data warehouse, data mining, prediction and analysis and other means can be better used to realize the comprehensive analysis and processing of massive data information, in order to better explore the laws of corresponding data information (Harris, 2020). The commonly used methods include classification analysis, correlation analysis, cluster analysis
206
Data Analysis and Information Processing
and sequence analysis. Based on the transformation of data analysis brought by big data technology application, the value of the audit work can be further promoted. It is no longer confined to the problem verification, but tries to explore more relationship between data. It enables these data information to play a stronger application value based on the found multiple correlation, and avoid the waste of data information (Sookhak, 2017). For example, in the financial loan audit, such data mining analysis technology can be fully utilized to realize the classification analysis of all data information in the loan, so as to better identify the difference between non-performing loans and normal loans, and provide reference for the follow-up loan business.
Real-Time Risk Warning The application of big data technology in the field of audit also has the effect of risk prevention, which is more convenient and efficient to analyze and identify possible risk factors, so as to give timely warning, avoid the risk problems leading to major accidents, and realize the control of economic losses. This is also obviously better than the traditional audit work mode. In the past, audit work often only found problems and gave feedback to some clues of violation, but it was difficult to realize risk warning. In the application of big data technology, the characteristics of sustainability are usually reflected. It can dynamically analyze and process the constantly rich and updated data, so that it can continuously monitor and dynamically grasp the change status of the audit target, then give timely feedback and early warning to the abnormal problems so as to remind the relevant personnel to take appropriate measures to prevent the problems. Therefore, the follow-up audit in the future audit field needs to be gradually promoted, so as to better realize the optimization of audit work with the help of big data technology. In the application of following-up audit mode, higher requirements are usually put forwards for the data analysis platform, which need to pay attention to technical innovation, establishing comprehensive audit data analysis platform and utilizing the means such as the Internet and information, to create favorable conditions for big data technology application and avoid the hidden problems in data collection. In carrying out the land tax audit work, for example, the application value of the follow-up audit mode is outstanding. The relevant staff often require full access to huge amounts of data on the provincial land tax information, reflect the characteristic of linkage, and update the corresponding data in real time, then they can analyze follow-up audit and timely require some abnormal problems so as to take timely warning to solve these problems.
Application Research of Big Data Technology in Audit Field
207
Multi-Domain Data Fusion Of course, in order to better play the application value of big data technology in the audit field, it is necessary to pay attention to the extensive collection and sorting of data information, which requires to make comprehensive analysis and judgment from multiple angles as far as possible to avoid the impact of incomplete data and information on the analysis effect. With the application of big data technology, audit work often involves the crossanalysis and application of multiple different database information, thus posing higher challenges to big data technology. It is necessary to ensure that it has cross-database analysis ability, and can use appropriate and reasonable analysis tools to better analyze and identify possible abnormal problems. Therefore, it is necessary to pay attention to the fusion and application of multi-domain data, which requires the comprehensive processing of multiple databases, so as to obtain richer and more detailed analysis results and play a stronger role in subsequent applications. For example, in order to analyze and clarify China’s macroeconomic and social risks, it is often necessary to comprehensively analyze government debt audit data, macroeconomic operation data, social security data and financial industry data, etc., so as to obtain more accurate results and conduct early risk warning. In the economic responsibility audit, it puts forward higher requirements for data fusion in multiple fields, and needs to obtain corresponding data information from finance, social security, industry and commerce, housing management, tax, public security and education, then integrate the data information effectively, and define the economic responsibility by means of horizontal correlation analysis and vertical comparison analysis in order to optimize the adjustment (Xiao, 2020).
Build a Large Audit Team The application of big data technology in audit field is not only the innovation transformation at the technical level, but also the transformation from multiple perspectives such as organizational mode and personnel structure, so as to better adapt to this new situation and avoid serious defects in audit work. For example, in view of the obvious isolated phenomenon in the previous audit work, although the simple audit conducted by each department as a unit can find some problems, it is difficult to form a more comprehensive and detailed audit effect. The audit value is obviously limited and needs to be innovated and adjusted by big data technology.
208
Data Analysis and Information Processing
Based on this, the construction of large audit group has become an important application mode. The future audit work should rely on the large audit group to divide the organizational structure from different functions such as leadership decision-making, data analysis and problem verification, so as to realize the orderly promotion of the follow-up audit work. For example, the establishment of a leading group could facilitate the implementation of the audit plan, to achieve leadership decisions for the entire audit work. For the analysis of massive data information, the data analysis group is required to make full analysis of the target with the help of rich and diverse big data technologies, so as to find clues and problems and explore rules and relationships. However, the clues and rules discovered need to be further analyzed by the problem verification team and verified in combination with the actual situation, so as to complete the audit task (Castka et al., 2020). The application of this large audit team mode can give full play to the application value of big data technology, avoid the interference brought by organizational factors, and become an important trend of optimization development in the audit field in the future. Of course, in order to give full play to the application value of the large audit team, it is often necessary to focus on the optimization of specific audit staff to ensure that all audit staff have a higher level of competence. Audit staff not only need to master and apply big data-related technical means, but also need to have big data thinking, realize the transformation under this new situation, and avoid obstacles brought by human problems. Based on this, it is of great importance to provide necessary education and training for audit staff, which should carry out detailed explanation around big data concept, technology and new audit mode, so as to make them better adapt to the new situation
Data Analysis Model and Audit Software Development In the current development of audit field, as an important development trend, the application of big data technology does show obvious advantages in many aspects, and also can play good functions. However, due to the complex audit work, the application of big data technology is bound to have the characteristics of keeping pace with the times and being targeted, so as to better improve its service value. Based on this, the application of big data technology in future audit work should focus on the development of data analysis model and audit professional software, so as to create good application conditions for the application of big data technology. First of all, in-depth and comprehensive analysis in the audit field requires a
Application Research of Big Data Technology in Audit Field
209
comprehensive grasp of all audit objectives and tasks involved in the audit industry, so as to purposefully develop the corresponding data analysis model and special software, and promote its application in subsequent audit work more efficient and convenient. For example, the query analysis, mining analysis and multi-dimensional analysis involved in the audit work need to be matched with the corresponding data analysis model in order to better improve the audit execution effect. In the development and application of audit software, it is necessary to take into account the various functions. For example, in addition to discovering and clarifying the defects existing in the audit objectives, it is also necessary to reflect the risk warning function, so as to better realize the audit function and highlight the application effect of big data technology.
Cloud Audit The application of big data technology in the auditing field is also developing towards cloud auditing, which is one of the important manifestations of the development of big data era. From the application of big data related technologies, it is often closely related to cloud computing, and they are often difficult to be separated. In order to better use big data technology, it is necessary to rely on cloud computing mode to better realize distributed processing, cloud storage and virtualization processing, facilitate the efficient use of massive data information, and solve the problem of data information. Based on this, the application of big data technology in audit field should also pay attention to the construction of cloud audit platform in the future, to better realize the optimization and implementation of audit work. In the construction of cloud audit, it is necessary to make full use of big data technology, intelligent technology, Internet technology and information means to realize the orderly storage and analysis and application of massive data information, and at the same time pay attention to the orderly sharing of massive data information, so as to better enhance its application value. For example, for the comprehensive analysis of the above mentioned cross-database data information, cloud audit platform can be used to optimize the processing. The overall analysis and processing efficiency is higher, which can effectively meet the development trend of the increasing difficulty of the current audit. Of course, the application of cloud audit mode can also realize the remote storage and analysis of data information, which obviously improves the convenience of audit work, breaks the limitation of original audit work on location, and makes the data
210
Data Analysis and Information Processing
sharing effect of relevant organizations stronger, thus solving the problem of isolated information (Appelbaum, 2018).
RISK ANALYSIS OF BIG DATA AUDIT Although big data audit plays an important role in improving the audit working mode and improving the audit working efficiency, there are still some risks in the use of data acquisition management that need to be paid attention to:
Data Acquisition and Collation Risks Data acquisition risks are mainly reflected in two aspects: On the one hand, there is a lack of effective means to verify the data of the auditee, and the integrity and authenticity of the data cannot be guaranteed, which can only be verified through the later extended investigation. On the other hand, the quality of collected data is not good, and a large number of invalid data will seriously affect the quality of data analysis. In addition, data collected outside the auditees, such as network media and social networking sites, also have high data risks. In terms of data collation, many audit institutions have collected data from a number of industries, but the data standards and formats of each industry are not same. Even within the same industry, the data formats used by organizations vary widely. In the absence of a unified audit data standard table, the data collation is difficult, and the multi-domain data association analysis method is still difficult in the practical application process.
Data Analysis and Usage Risks The risk of data analysis is mainly reflected in the analytical thinking and methods of auditors. When auditors are not familiar with the business and have weak data modeling ability, they are likely to make logic errors in the actual analysis, resulting in the deviation of data analysis results. In terms of data usage, due to the influence of factors such as data authenticity, integrity and logical association of data tables, the data analysis results are often greatly deviated from the actual situation. If direct use the data analysis results, there is a greater risk, so auditors need to be cautious.
Application Research of Big Data Technology in Audit Field
211
Data Management Risk Data management risks are mainly manifested as data loss, disclosure and destruction in the process of storage and transmission. The data collected during auditing involves information of many industries. The loss and disclosure of data will cause great losses to relevant units, and at the same time, it will also have a negative impact on the authority and credibility of audit institutions. Among them, the most important data management risk is the management of data storage equipment, such as loss of auditors’ computers and mobile storage media, weak disaster prevention ability of computer room equipment, insufficient data network encryption, etc., which should be the key areas of attention to prevent data management risk.
CONCLUSION In a word, the introduction and application of big data technology has become an important development trend in the current innovative development of audit field in China. With the introduction and application of big data, audit work does show obvious advantages with more prominent functions. Therefore, it is necessary to explore the integration of big data technology in the audit field from multiple perspectives in the future, and strive to innovate and optimize the audit concept, organizational structure, auditors and specific technologies in order to create good conditions for the application of big data technology. This paper mainly discusses the transformation of big data technology to the traditional audit work mode and its specific application. However, as the application of big data in the field of audit is not long, the research is inevitably shallow. With the development of global economic integration, multi-directional and multi-field data fusion will make audit work more complex, so big data audit will be normalized and provide better reference for decision-making.
212
Data Analysis and Information Processing
REFERENCES 1.
2.
3.
4.
5.
6.
7.
8.
9.
Alles, M., & Gray, G. L. (2016). Incorporating Big Data in Audits: Identifying Inhibitors and a Research Agenda to Address Those Inhibitors. International Journal of Accounting Information Systems, 22, 44-59. https://doi.org/10.1016/j.accinf.2016.07.004 Appelbaum, D. A., Kogan, A., & Vasarhelyi, M. A. (2018). Analytical Procedures in External Auditing: A Comprehensive Literature Survey and Framework for External Audit Analytics. Journal of Accounting Literature, 40, 83-101. https://doi.org/10.1016/j.acclit.2018.01.001 Castka, P., Searcy, C., & Mohr, J. (2020). Technology-Enhanced Auditing: Improving Veracity and Timeliness in Social and Environmental Audits of Supply Chains. Journal of Cleaner Production, 258, Article ID: 120773. https://doi.org/10.1016/j.jclepro.2020.120773 Gepp, A., Linnenluecke, M. K., O’Neill, T. J., & Smith, T. (2018). Big Data Techniques in Auditing Research and Practice: Current Trends and Future Opportunities. Journal of Accounting Literature, 40, 102115. https://doi.org/10.1016/j.acclit.2017.05.003 Harris, M. K., & Williams, L. T. (2020). Audit Quality Indicators: Perspectives from Non-Big Four Audit Firms and Small Company Audit Committees. Advances in Accounting, 50, Article ID: 100485. https://doi.org/10.1016/j.adiac.2020.100485 Ingrams, A. (2019). Public Values in the Age of Big Data: A Public Information Perspective. Policy & Internet, 11, 128-148. https://doi. org/10.1002/poi3.193 Shukla, M., & Mattar, L. (2019). Next Generation Smart Sustainable Auditing Systems Using Big Data Analytics: Understanding the Interaction of Critical Barriers. Computers & Industrial Engineering, 128, 1015-1026. https://doi.org/10.1016/j.cie.2018.04.055 Sookhak, M., Gani, A., Khan, M. K., & Buyya, R. (2017). WITHDRAWN: Dynamic Remote Data Auditing for Securing Big Data Storage in Cloud Computing. Information Sciences, 380, 101116. https://doi.org/10.1016/j.ins.2015.09.004 Xiao, T. S., Geng, C. X., & Yuan, C. (2020). How Audit Effort Affects Audit Quality: An Audit Process and Audit Output Perspective. China Journal of Accounting Research, 13, 109-127. https://doi.org/10.1016/j. cjar.2020.02.002
SECTION 3: DATA MINING METHODS
Chapter 11
A Short Review of Classification Algorithms Accuracy for Data Prediction in Data Mining Applications
Ibrahim Ba’abbad, Thamer Althubiti, Abdulmohsen Alharbi, Khalid Alfarsi, Saim Rasheed Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, KSA.
ABSTRACT Many business applications rely on their historical data to predict their business future. The marketing products process is one of the core processes for the business. Customer needs give a useful piece of information that helps to market the appropriate products at the appropriate time. Moreover, services are considered recently as products. The development of education and health services is depending on historical data. For the more, reducing online social media networks problems and crimes need a significant source of information. Data analysts need to use an efficient classification algorithm to predict the future of such businesses. However, dealing with a
Citation: Ba’abbad, I. , Althubiti, T. , Alharbi, A. , Alfarsi, K. and Rasheed, S. (2021), “A Short Review of Classification Algorithms Accuracy for Data Prediction in Data Mining Applications”. Journal of Data Analysis and Information Processing, 9, 162174. doi: 10.4236/jdaip.2021.93011. Copyright: © 2021 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0.
216
Data Analysis and Information Processing
huge quantity of data requires great time to process. Data mining involves many useful techniques that are used to predict statistical data in a variety of business applications. The classification technique is one of the most widely used with a variety of algorithms. In this paper, various classification algorithms are revised in terms of accuracy in different areas of data mining applications. A comprehensive analysis is made after delegated reading of 20 papers in the literature. This paper aims to help data analysts to choose the most suitable classification algorithm for different business applications including business in general, online social media networks, agriculture, health, and education. Results show FFBPN is the most accurate algorithm in the business domain. The Random Forest algorithm is the most accurate in classifying online social networks (OSN) activities. Naïve Bayes algorithm is the most accurate to classify agriculture datasets. OneR is the most accurate algorithm to classify instances within the health domain. The C4.5 Decision Tree algorithm is the most accurate to classify students’ records to predict degree completion time. Keywords: Data Prediction Techniques, Accuracy, Classification Algorithms, Data Mining Applications
INTRODUCTION Decision-makers in the business sector are always concerning about their business future. Since data collections form the core resource of information, digitalizing business activities help to collect business operational data in enormous storages named as a data warehouse. These historical data can be used by data analysts to predict the future behavior of the business. However, dealing with a huge quantity of data requires great time to process. Data mining (DM) is a technique that uses information technology and statistical methods to search for potential worthy information from a large database that can be used to support administrative decisions. The reason behind the importance of DM is that data can be converted into useful information and knowledge automatically and intelligently. In addition, enterprises use data mining to know companies that work status and analyze potential information values. Information mined should be protected from the disclosure of company secrets. Different data mining concepts were described by Kaur [1] functionalities, material, and mechanisms. Data mining involves the use of sophisticated data analysis tools and techniques to find advanced ambiguity,
A Short Review of Classification Algorithms Accuracy for Data Prediction...
217
patterns, and relationships that are valid in large data sets. The best-known data mining technique is Association. In association, a pattern is discovered based on a relationship between items in the same transaction. Clustering is a data mining technology that creates a useful group of objects that have comparative features using the programmed strategy. Decision Tree is one of the most common data mining techniques. One of the most difficult things to do is when choosing to implement a data mining framework is to know and decide which method to use and when. However, one of the most implemented data mining techniques in a variety of applications is the classification technique. The classification process needs two types of data: training data and testing data. Training data are the data used by a data mining algorithm to learn the classification metrics to classify the other data i.e. testing data. Many business applications rely on their historical data to predict their business future. The literature presents various problems that were solved by predicting through data mining techniques. In business, DM techniques are used to predict the export abilities of companies [2] . In social media applications, missing link problems between online social networks (OSN) nodes are a frequent problem in which a link is supposed to be between two nodes, but it becomes a missing link for some reasons [3] . In the agriculture sector, analyzing soil nutrients will prove to be a large profit to the growers through automation and data mining [4] . Data mining technique is used to enhance the building energy performance through determining the target multi-family housing complex (MFHC) for green remodeling [5] . In crime, preventing offense and force against the human female is one of the important goals. Different data mining techniques were used to analyze the causes of offense [6] . In the healthcare sector, various data mining tools have been applied to a range of diseases for detecting the infection in these diseases such as breast cancer diagnosis, skin diseases, and blood diseases [7] . For the more, data analysts in the education field used data mining techniques to develop learning strategies at schools and universities [8] . Another goal is to detect several styles of learner behavior and forecast his performance [9] . One more goal is to forecast the student’s salary after graduation based on the student’s previous record and behavior during the study [10] . In general, services are considered products. In this paper, various classification algorithms are revised in terms of accuracy in different areas of data mining applications. This paper aims
218
Data Analysis and Information Processing
to help data analysts to choose the most suitable classification algorithm for different business applications including business, in general, reducing online social media networks problems, developing education, health and agriculture sector services. The present paper consists of the following sections: Section 2 presents a methodology for several data mining techniques in the literature. Section 3 summarizes the results obtained from the related literature and further discussion. Finally, section 4 presents our conclusions and recommendations for future work.
METHODS IN LITERATURE The classification technique is one of the most implemented data mining techniques in a variety of applications. The classification process needs two types of data: training data and testing data. Training data are the data used by a data mining algorithm to learn the classification metrics to classify the other data i.e. testing data. Two data sets of text articles are used and classified into training data and testing data. Three traditional classification algorithms are compared in terms of accuracy and execution time by Besimi et al. [11] . K-nearest neighbor classifier (K-NN), Naïve Bayes classifier (NB), and Centroid classifier are considered. K-NN classifier is the slowest classifier since it uses the whole training data as a reference to classify testing data. On the other hand, the Centroid classifier uses the average vector for each class as a model to classify new data. Hence, the Centroid classifier is much faster than the K-NN classifier. In terms of accuracy, the Centroid classifier has the highest accuracy rate among the others. Several data mining techniques were used to predict the export abilities of a sample of 272 companies by Silva et al. [2] . Synthetic Minority Oversampling Technique (SMOTE) is used to oversample unbalanced data. The K-means method is used to group the sample into three different clusters. The generalized Regression Neural Network (GRNN) technique is used to minimize the error between the actual input data points in the network and the regression predicting vector in the model. Feed Forward Back Propagation Neural Network (FFBPN) is a technique used in machine learning to learn the pattern of specific input/output behavior for a set of data in a structure known as Artificial Neural Networks (ANN). Support Vector Machine (SVM) is a classification technique used to classify a set of data according to similarities between them. A Decision Tree (DT) is a
A Short Review of Classification Algorithms Accuracy for Data Prediction...
219
classification method in which classes are represented in a series of yes/ no questions in a tree view. Naive Bayes is a classification technique used to classify one data set in several data sets according to the Bayes theorem probability concept. As a result, after applying those techniques GRNN and FFBPN were the most accurate techniques used to predict the export abilities of companies. Social media applications are developed based on Online Social Network (OSN) concept. Missing link problems between OSN nodes are a frequent problem in which a link is supposed to be between two nodes, but it becomes missing link regard to some reasons. Support Vector Machine (SVM), k-Nearest Neighbor (KNN), Decision Tree (DT), Neural Network, Naive Bayes (NB), Logistic Regression, and Random Forest are prediction techniques used to predict the missing link of two Facebook data sets by Sirisup and Songmuang [3] . One dataset (DS1) with high density and the other dataset (DS2) with low density. High density reflects that there is a huge number of links between nodes. For high-density data set, Random Forest gives the best performance among the others in terms of accuracy, precision, F-measure, and area under the receiver operating characteristic curve (AUC). On the other hand, the low-density data set can be predicted perfectly with either Random Forest or Decision Tree. In the end, it can be said that Random Forest is the best prediction technique used to predict data in the OSN concept. Analyzing soil nutrients will be evidence to be a large profit to the growers. An agricultural survey has been capitalizing on technical advances such as automation, and data mining. Chiranjeevi and Ranjana [4] carried out a comparative analysis of two algorithms i.e. Naive Bayes and J48. J48 is the improvement of the C4.5 classifier. A choice tree is a flowchart like a tree development, where each inner hub explains a test on a characteristic. Naive Bayes is a modest probabilistic classifier based on the Bayesian theorem with tough naive individuality anticipation. Naive Bayes Algorithm can be bespoke to prophesy harvest growing in a soil specimen. A decision support model was developed for determining the target multi-family housing complex (MFHC) for green remodeling using a data mining technique. Jeong et al. [5] locate the goal of MFHC for green remodel that is necessary to establish a careful and sensible evaluation method of the building energy performance. The energy benchmark for MFHC in South Korea, but there was a limitation that the study was conducted on the MFHC used district heating system. To locate the green remodel goal of the MFHC,
220
Data Analysis and Information Processing
it is necessary to regard different heating systems that are used in MFHC e.g. individual heating systems, district heating systems, and central heating systems. However, there were two issues regarding this study. First, the operational rating and energy benchmark system were proposed regarding the different variables of the heating system. Second, the model to locate the goal of MFHC for green remodel was developed regarding the different characteristics. The developed decision support model can serve as a sensible standard to locate the goal of MFHC for green remodeling. Preventing offense and force against the human female is one of the important goals for police. Different data mining techniques were used to analyze the causes of offense and the relationships between multiple offenses. These techniques play important roles in offense analysis and forecasting. Kaur et al. [6] reviews the data mining techniques used in offense forecasting and analysis. It was concluded from this discussion that most researchers used classification and clustering techniques for offense manner and disclosure. In the classification, the following techniques were used: Naïve Bayes, decision tree Bayesnet, J48, JRip, and OneR. For the more, Kumar et al. [12] proposed a data mining technique for cyber-attack issues. Many applications are included in the cybersecurity concept. However, these applications need to be analyzed by data mining techniques to audit as a computer application. Deception of secret information can happen through security crack access by an unauthorized user. Malicious software and viruses such as a trojan horse that is the reason for the infringement insecurity that leads to antisocial activities in the world of cyber-crime. Data mining techniques that can be restricting either secret information or data to legitimate users and unauthorized access could be blocked. However, Thongsatapornwatana [13] provides a survey of techniques used to analyze crime modalities in previous research. The survey focuses on various types of crimes e.g. violent crime, drugs, border control, and cyber criminality. Survey results show that most of the techniques used contain research gaps. These techniques failed to accurately detect crime prediction, which increases the challenges of overcoming this failure. Hence, these techniques need crime models, analysis, and prepare data to find appropriate algorithms. Data mining in the healthcare sector is just as important as exploring various areas. The mission of understanding removal in health care records is an exacting task and complex. Mia et al. [7] review the different academic
A Short Review of Classification Algorithms Accuracy for Data Prediction...
221
literature based on health care data to find the existing data mining methods and techniques described. Many data mining tools have been applied to a range of diseases for detecting the infection in these diseases such as breast cancer diagnosis, skin diseases, and blood diseases. Data mining execution has high effectiveness in this domain due to express amplification in the size of remedial data. Moreover, Kaur and Bawa [14] present to the medical healthcare sector a detailed view of popular data mining techniques to the researchers so that they can work more exploratory. Knowledge discovery in databases (KDD) analyzes large volumes of data and turns it into meaningful information. There is a boon to data mining techniques because it helps in the early diagnosis of medical diseases with high accuracy in which saves more time and money in any effort related to computers, robots, and parallel processing. Among all the medical diseases, cardiovascular is the most critical disease. Data mining is proved efficacious as accuracy is a major concern. Data mining techniques are proved to be successfully used in the treatment of various other serious diseases which have a threat to lives. As another attempt, a comparative analysis is conducted by Parsania et al. [15] to find the best data mining classification techniques based on healthcare data in terms of accuracy, sensitivity, precision, false-positive rate, and f-measure. Naïve Bayes, Bayesian Network, J RIPPER (JRip), OneRule (OneR), and PART techniques are selected to be applied over a dataset from a health database. Results show that the PART technique is the best in terms of precision, false-positive rate, and f-measure metrics. In terms of accuracy, the OneR technique is the best while Bayesian Network is the best technique in terms of sensitivity. Data mining techniques are used widely in several fields. Data analysts in the education field used data mining techniques to develop learning strategies at schools and universities since it serves a big chunk of society. A corporative learning model to group learners into active learning groups via the web was introduced by Amornsinlaphachai [8] . Artificial Neural Network (ANN), K-Nearest Neighbor (KNN), Naive Bayes (NB), Bayesian Belief Network (BN), RIPPER (called JRIP), ID3, and C4.5 (called J48) classification data mining algorithms are used to predict the performance of 474 students who study computer programming subject at Nakhon Ratchasima Rajabhat University in Thailand. A comparison between those algorithms is made to select the most efficient algorithm among them.
222
Data Analysis and Information Processing
As a result, C4.5 was the most efficient algorithm in predicting students’ academic performance levels in terms of different measures such as correctness of the predicated data, data precision, recall, f-measure, mean absolute error, and processing time. Although C4.5 does not have the lowest processing time, it gets the highest percentage of correctness i.e. 74.89 percent since it is a simple and reliable algorithm. ID3 algorithm gets the lowest percentage of correctness since its irrationality. Selecting learners to form active learning groups by the introduced model using the C4.5 algorithm shows a better learning level against traditional selecting by instructors. To obtain a successful decision that improves learner rendering and helps him to proceed in education. Jalota and Agrawal [16] used five classification techniques on the education dataset collected through the Learning Management System (LMS). Techniques that have been used are the J48 algorithm, Support Vector Machine algorithm, Naïve Bayes algorithm, Random Forest algorithm, and Multilayer Perceptron algorithm. All these technologies are beneath the Waikato Environment for Knowledge Analysis (WEKA). After comparisons, the results showed that Multilayer Perceptron outperformed other techniques since it got the highest results in performance accuracy and performance metrics. Roy and Garg [9] present a literature survey of data mining techniques used in Educational Data Mining (EDM). Data mining techniques are used in the EDM domain to detect several styles of learner behavior and forecast his performance. It was concluded that most of the previous research collected data on predicting student performance by a set of questionnaires. The Cross-Industry Standard Process for Data Mining (CRISP-DM) model was used. WEKA and (R tool) are data mining tools based on open-source language applied for statistical and data analysis. As an application of data mining techniques in the education field, Khongchai and Songmuang [10] created an incentive for students by predicting the learner’s future salary. Learners are often bored with academic studies. This can cause making their grades poor or even they leave college. It is due to the loss of motivation that encourages them to continue their studies. To provide a good incentive for learners to make sure to continue their studies and develop their academic level. This can be achieved by suggesting a model that forecasts the student’s salary after graduation based on the student’s previous record and behavior during the study.
A Short Review of Classification Algorithms Accuracy for Data Prediction...
223
In the meantime, the data mining techniques used in this model are K-Nearest Neighbors (K-NN), Naive Bayes (NB), Decision trees J48, Multilayer Perceptron (MLP), and Support Vector Machines (SVM). To determine the preferable technique for predicting future salary, a test was conducted by entering data of students graduating from the same university during the years 2006 to 2015. A WEKA (Waikato Environment for Knowledge Analysis) tool was used to compare the outputs of data mining techniques. The results showed that after comparisons work outperformed (KNN) technique in predicting 84.69 percent for Recall, Precision, and F-measure. The other techniques were as follows: (J48) get a percentage of 73.96 percent, (SVM) (43.71 percent), Naive Bayes (NB) (43.63 percent), and Multilayer perceptron (MLP) (38.8 percent). A questionnaire was then distributed to 50 current students at the university to see if the model works to achieve its objectives. The results of the questionnaire indicate that the proposed model increased the motivation of the students and helped them to focus on continuing the study. Sulieman and Jayakumari [17] proposed the importance of using technology data mining 11th grade in Oman, which contains a lot of units that provide the school in Oman administration inclusive student data. The goal is to decrease the dropout rate of students and improve school performance. Using data mining techniques helps students to choose the appropriate mathematics for 11th grade in Oman. It is an opportunity to develop and give appropriate analysis through such a method that extracts student information from the end-of-term grades to improve student performance. Knowledge derives from data mining helps decision-makers in the field of education make the perfect decision that will help in the development of educational processing. The math subject uses data mining techniques. The results of the various algorithms acquired from the various data using in a study that confirm the fact the prediction of student choice and performance can be obtained using data mining techniques. Academic databases used to be analyzed through a data mining approach to earn new helpful knowledge. Wati et al. [18] prophesy the degree-accomplishment time of bachelor’s degree students by using data mining algorithms such as C4.5, and naive Bayes classifier algorithms. They concentrate on the achievement of ranking data mining algorithms especially the C4.5 algorithm with its decision tree-based and naive Bayes classifier algorithm based on a gain ratio to find the nodes. it shows in the result of the foresee degree accomplishment time of bachelor’s degree the
224
Data Analysis and Information Processing
C4.5 algorithm is preferable in rendering gauge with (78 percent) precision, (85 percent) measured mean class precision, and (65 percent) measured mean class recall. Anoop Kumar and Rahman [19] used data mining techniques in inculcating a setting is called educational data mining (EDM). The possibilities for data mining in education and the data to be reaped are illimitable. Erudition discovering by data mining techniques can be used not only to utility the teachers to manage their classes and understand their students learning processes. As a result, all of these helps ensure the advancement of students in their academics and enforce few treatments if the progress is infeasible to the programming and institutional anticipation. The basic advantage is that kind of analysis avails to establish a solution for slow learners. Useful for achieving educational data mining methods which are using presently to improvements in teaching and predict the performance of students to predict academic performance in the learning process. To conclude, techniques are used in data mining to modify raw data to helpful reference in the education environment. Data mining in educational environments has widespread implementation. Educational environments result in a large amount of student data, that is can be used for different purposes like predicting the needs of students. Rambola et al. [20] compare the techniques and algorithms for data mining that are used in a different implementation, thus assessing their efficiency. Categorized objectives of educational data mining can be achieved in three types: prediction, clustering, and relationship mining. Some of the most common connotations, which are considerably used in educational data mining, are mentioned such as association rule mining, classification, clustering, and outlier detection rule. Association rule mining is applied for unsuccessful type extraction and to recommend the best course for the student.
RESULTS AND DISCUSSION In this section, we summarize the comparison results that were obtained from the literature in different business applications. Table 1 shows the comparison of classification algorithms that are used to predict data in business, online social media networks, agriculture, health, and education applications domains. As mentioned in [11] , k-nearest neighbors (k-NN) classifier, Naïve Bayes (NB) classifier, and Centroid classifier as classification algorithms are compared.
A Short Review of Classification Algorithms Accuracy for Data Prediction...
225
Table 1: Comparison of classification algorithms in multiple applications
Politics, technology, and sports news articles are used with a total of 237 news articles. Experiments show that the Centroid classifier is the most accurate algorithm in classifying text documents since it classifies 226
226
Data Analysis and Information Processing
news articles correctly. Centroid classifier calculates the average vector for each class and uses them as a reference to classify each new test instance. However, k-NN needs to compare the test instance distance with all training instances distances for each time. In [2] , 272 companies are taken as a study sample to be classified. Five classification algorithms are used to classify companies into three classes: Generalized Regression Neural Network (GRNN), Feed Forward Back Propagation Neural Network (FFBPN), Support Vector Machine (SVM), Decision Tree (DT), and Naïve Bayes (NB). Results show that FFBPN is the most accurate algorithm to classify instances in the business domain with an accuracy of 85.2 percent. Two Online Social Networks (OSN) datasets are used to compare the performance of seven classification algorithms. The first dataset (DS1) with High density (0.05) and the other dataset (DS2) with low-density (0.03). The two datasets were obtained using the Facebook API tool. Each dataset contains public information about the users such as interests, friends, and demographics data. Classification algorithms include; Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), Decision Tree (DT), Neural Networks, Naïve Bayes (NB), Logistic Regression, and Random Forest. As results show in [3] , the Random Forest algorithm is the most accurate in classifying OSN activities even with a high-density OSN dataset. A dataset of 1676 soil samples has 12 attributes that need to be classified. J48 Decision Tree (J48 DT) and Naïve Bayes (NB) classification algorithms are used. Results in [4] tells that the NB algorithm is more accurate than J48 DT to classify agriculture datasets since it classifies 98 percent of instances correctly. An experiment is conducted in the health domain to classify 3163 patients’ data as mentioned in [15] . Naïve Bayes (NB), Bayesian Network (BayesNet), J Ripper (JRip), One Rule (OneR), and PART classification algorithms are used. Results show that OneR is the most accurate algorithm to classify instances in the health domain with an accuracy of 99.2 percent. Random Forest, Naïve Bayes (NB), Multilayer Perceptron (MLP), Support Vector Machine (SVM), and J48 Decision Tree (J48 DT) classification algorithms are used. 163 instances are used as an experimental dataset of students’ performance. Results in [16] tell that the MLP algorithm is the most accurate algorithm to classify students’ performance datasets since it classifies 76.1 percent of instances correctly. 13,541 students’ profiles are used as a dataset to examine five classification algorithms.
A Short Review of Classification Algorithms Accuracy for Data Prediction...
227
k-Nearest Neighbors (k-NN), Naïve Bayes (NB), J48 Decision Tree (J48 DT), Multilayer Perceptron (MLP), and Support Vector Machine (SVM) were compared in terms of accuracy. As results show in [10] , the k-NN algorithm is the most accurate algorithm with an 84.7 percent accuracy level. 297 students’ records were used as a dataset in [18] . Two classification algorithms are applied: C4.5 Decision Tree (C4.5 DT), and Naïve Bayes (NB). Results tell that the C4.5 DT algorithm is more accurate than NB to classify Students’ records since it classifies 78 percent of instances correctly.
CONCLUSIONS AND FUTURE WORK Data mining involves many useful techniques that are used to predict statistical data in a variety of business applications. The classification technique is one of the most widely used with a variety of algorithms. In this paper, various classification algorithms were revised in terms of accuracy in different areas of data mining applications including business in general, online social media networks, agriculture, health, and education to help data analysts to choose the most suitable classification algorithm for each business application. Experiments in the reviewed literature show that the Centroid classifier is the most accurate algorithm in classifying text documents. FFBPN is the most accurate algorithm to classify instances in the business domain. The Random Forest algorithm is the most accurate in classifying OSN activities. Naïve Bayes algorithm is more accurate than J48 DT to classify agriculture datasets. OneR is the most accurate algorithm to classify instances in the health domain. Multilayer Perceptron algorithm is the most accurate algorithm to classify students’ performance datasets. K-Nearest Neighbors algorithm is the most accurate algorithm in classifying students’ profiles to increase their motivation. C4.5 Decision Tree algorithm is more accurate than Naïve Bayes to classify students’ records. As future work, consideration to review more related papers in mentioned domains as well as discover new domains will significantly add to the work. Hence, the paper will be used as a reference by business data analysts.
228
Data Analysis and Information Processing
REFERENCES 1.
2.
3.
4.
5.
6.
7.
8.
Harkiran, K. (2017) A Study On Data Mining Techniques And Their Areas Of Application. International Journal of Recent Trends in Engineering and Research, 3, 93-95. https://doi.org/10.23883/ IJRTER.2017.3393.EO7O3 Silva, J., Borré, J.R., Castillo, A.P.P., Castro, L. and Varela, N. (2019) Integration of Data Mining Classification Techniques and Ensemble Learning for Predicting the Export Potential of a Company. Procedia Computer Science, 151, 1194-1200. https://doi.org/10.1016/j. procs.2019.04.171 Sirisup, C. and Songmuang, P. (2018) Exploring Efficiency of Data Mining Techniques for Missing Link in Online Social Network. 2018 International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), Pattaya, 15-17 November 2018. https://doi.org/10.1109/iSAI-NLP.2018.8692951 Chiranjeevi, M.N. and Nadagoudar, R.B. (2018) Analysis of Soil Nutrients Using Data Mining Techniques. International Journal of Recent Trends in Engineering and Research, 4, 103-107. https://doi. org/10.23883/IJRTER.2018.4363.PDT1C Jeong, K., Hong, T., Chae, M. and Kim, J. (2019) Development of a Decision Support Model for Determining the Target Multi-Family Housing Complex for Green Remodeling Using Data Mining Techniques. Energy and Buildings, 202, Article ID: 109401. https:// doi.org/10.1016/j.enbuild.2019.109401 Kaur, B., Ahuja, L. and Kumar, V. (2019) Crime against Women: Analysis and Prediction Using Data Mining Techniques. International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), 14-16 February 2019, Faridabad. https:// doi.org/10.1109/COMITCon.2019.8862195 Mia, M.R., Hossain, S.A., Chhoton, A.C. and Chakraborty, N.R. (2018) A Comprehensive Study of Data Mining Techniques in Health-Care, Medical, and Bioinformatics. International Conference on Computer, Communication, Chemical, Material and Electronic Engineering (IC4ME2), Rajshahi, 8-9 February 2018. https://doi.org/10.1109/ IC4ME2.2018.8465626 Amornsinlaphachai, P. (2016) Efficiency of Data Mining Models to Predict Academic Performance and a Cooperative Learning Model.
A Short Review of Classification Algorithms Accuracy for Data Prediction...
9.
10.
11.
12.
13.
14.
15.
16.
229
8th International Conference on Knowledge and Smart Technology (KST), Chiang Mai, 3-6 February 2016. https://doi.org/10.1109/ KST.2016.7440483 Roy, S. and Garg, A. (2017) Analyzing Performance of Students by Using Data Mining Techniques: A Literature Survey. 4th IEEE Uttar Pradesh Section International Conference on Electrical, Computer and Electronics (UPCON), Mathura, 26-28 October 2017. https://doi. org/10.1109/UPCON.2017.8251035 Khongchai, P. and Songmuang, P. (2017) Implement of Salary Prediction System to Improve Student Motivation Using Data Mining Technique. 11th International Conference on Knowledge, Information and Creativity Support Systems (KICSS), Yogyakarta, 10-12 November 2016. https://doi.org/10.1109/KICSS.2016.7951419 Besimi, N., Cico, B. and Besimi, A. (2017) Overview of Data Mining Classification Techniques: Traditional vs. Parallel/Distributed Programming Models. Proceedings of the 6th Mediterranean Conference on Embedded Computing, Bar, 11-15 June 2017, 1-4. https://doi.org/10.1109/MECO.2017.7977126 Kumar, S.R., Jassi, J.S., Yadav, S.A. and Sharma, R. (2016) DataMining a Mechanism against Cyber Threats: A Review. International Conference on Innovation and Challenges in Cyber Security (ICICCSINBUSH), Greater Noida, 3-5 February 2016. https://doi.org/10.1109/ ICICCS.2016.7542343 Thongsatapornwatana, U. (2016) A Survey of Data Mining Techniques for Analyzing Crime Patterns. Second Asian Conference on Defence Technology (ACDT), Chiang Mai, 21-23 January 2016. https://doi. org/10.1109/ACDT.2016.7437655 Kaur, S. and Bawa, R.K. (2017) Data Mining for diagnosis in Healthcare Sector-a review, International Journal of Advances in Scientific Research and Engineering. Vaishali, S., Parsania, N., Jani, N. and Bhalodiya, N.H. (2014) Applying Naïve Bayes, BayesNet, PART, JRip and OneR Algorithms on Hypothyroid Database for Comparative Analysis. International Journal of Darshan Institute on Engineering Research & Emerging Technologies, 3, 60-64. Jalota, C. and Agrawal, R. (2019) Analysis of Educational Data Mining using Classification. International Conference on Machine Learning,
230
17.
18.
19.
20.
Data Analysis and Information Processing
Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, 1416 February 2019. https://doi.org/10.1109/COMITCon.2019.8862214 Al-Nadabi, S.S. and Jayakumari, C. (2019) Predict the Selection of Mathematics Subject for 11th Grade Students Using Data Mining Technique. 4th MEC International Conference on Big Data and Smart City (ICBDSC), Muscat, 15-16 January 2019. https://doi.org/10.1109/ ICBDSC.2019.8645594 Wati, M., Haeruddin and Indrawan, W. (2017) Predicting DegreeCompletion Time with Data Mining. 3rd International Conference on Science in Information Technology (ICSITech), Bandung, 25-26 October 2017. https://doi.org/10.1109/ICSITech.2017.8257209 Anoopkumar, M. andZubair Rahman, A.M.J.Md. (2016) A Review on Data Mining Techniques and Factors Used in Educational Data Mining to Predict Student Amelioration, International Conference on Data Mining and Advanced Computing (SAPIENCE), Ernakulam, 16-18 March 2016. Rambola, R.K., Inamke, M. and Harne, S. (2018) Literature Review: Techniques and Algorithms Used for Various Applications of Educational Data Mining (EDM). 4th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, 14-15 December 2018. https://doi.org/10.1109/CCAA.2018.8777556
Chapter 12
Different Data Mining Approaches Based Medical Text Data
Wenke Xiao1, Lijia Jing2, Yaxin Xu1, Shichao Zheng1, Yanxiong Gan1, and Chuanbiao Wen1 School of Medical Information Engineering, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China 2 School of Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China 1
ABSTRACT The amount of medical text data is increasing dramatically. Medical text data record the progress of medicine and imply a large amount of medical knowledge. As a natural language, they are characterized by semistructured, high-dimensional, high data volume semantics and cannot participate in arithmetic operations. Therefore, how to extract useful knowledge or information from the total available data is very important task. Using various techniques of data mining can extract valuable knowledge or information from data. In the current study, we reviewed different approaches to apply Citation: Wenke Xiao, Lijia Jing, Yaxin Xu, Shichao Zheng, Yanxiong Gan, Chuanbiao Wen, “Different Data Mining Approaches Based Medical Text Data”, Journal of Healthcare Engineering, vol. 2021, Article ID 1285167, 11 pages, 2021. https://doi. org/10.1155/2021/1285167. Copyright: © 2021 by Authors. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
232
Data Analysis and Information Processing
for medical text data mining. The advantages and shortcomings for each technique compared to different processes of medical text data were analyzed. We also explored the applications of algorithms for providing insights to the users and enabling them to use the resources for the specific challenges in medical text data. Further, the main challenges in medical text data mining were discussed. Findings of this paper are benefit for helping the researchers to choose the reasonable techniques for mining medical text data and presenting the main challenges to them in medical text data mining.
INTRODUCTION The era of big data is coming with the mass of data growing at an incredible rate. The concept of big data for the first time was put forward in the 11th EMC World conference in 2011, which refers to large-scale datasets that cannot be captured, managed, or processed by common software tools. With the arrival of big data age, the amount of medical text data is increasing dramatically. Analyzing this immense amount of medical text data to extract the valuable knowledge or information is useful for decision support, prevention, diagnosis, and treatment in medical world [1]. However, analyzing the huge amount of multidimensional or raw data is very complicated and timeconsuming task. Data mining has capabilities for this matter. Data mining is a methodology for discovering the novel, valuable, and useful information, knowledge, or hidden pattern from enormous datasets by using various statistical approaches. Data mining is with many advantages in contrast to the traditional model for transforming data to knowledge with some manual analysis and interpretation. Data mining approaches are quicker, favorable, time-saving, and objective. Summarizing various data mining approaches in medical text data for clinical applications is essential for health management and medical research. This paper is organized in four sections. Section 2 presents the concepts of medical text data. Section 3 includes data mining approaches and its applications in medical text data analysis. Section 4 concludes this paper and presents the future works.
MEDICAL TEXT DATA The diversity of big data is inseparable from the abundance of data sources. Medical big data including experimental data, clinical data, and medical imaging data are increasing with the rapid development of medicine.
Different Data Mining Approaches Based Medical Text Data
233
Medical big data are the application of big data in the medical field after the data related to human health and medicine have been stored, searched, shared, analyzed, and presented in innovative ways [2]. Medical text data are an important part of medical big data which are described in natural language, cannot participate in an arithmetic operation, and are characterized by semistructured, high-dimensional, high data volume semantics [3]. They cannot be well applied in research owing to no fixed writing format and being highly professional [4]. Medical text data contain clinical data, medical record data, medical literature data, etc., and this type of data records the progress of medicine and implies a large amount of medical knowledge. However, utilizing human power to extract the facts of relationships between entities from a vast amount of medical text requires time-consuming efforts. With the development of data mining technology, data mining technology used for medical text to discover the relationships in medical text becomes the hot topic. Medical text data mining is able to assist the discovery of medical information. In the COVID-19 research field, medical text mining can help decision-makers to control the crown outbreak by gathering and collating scientific basic data and scientific research literature related to the new crown virus, predicting the susceptible population to new crown pneumonia, virus variability, and potential therapeutic drugs [5–8].
MEDICAL TEXT DATA MINING Data mining was defined in the “First section of the 1995 International Conference on Knowledge Discovery and Data Mining,” which has been widely used in disease auxiliary diagnosis, drug development, hospital information system, and genetic medicine to facilitate the medical knowledge discovery [9–12]. Data mining used to process medical text data can be divided into four steps: data collection, data processing, data analysis, and data evaluation and interpretation. This study summarized the algorithms and tools for medical text data based on the four steps of data mining.
Data Preparation Medical text data include electronic medical records, medical images, medical record parameters, laboratory results, and pharmaceutical antiquities according to the different data sources. The different data were selected based on the data mining task and stored in the database for further processing.
234
Data Analysis and Information Processing
Data Processing The quality of data will affect the efficiency and accuracy of data mining and the effectiveness of the final pattern. The raw medical text data contain a large amount of fuzzy, incomplete, noisy, and redundant information. Taking medical records as an example, the traditional paper-based medical records have many shortcomings, such as nonstandard terms, difficult to form clinical decision-making support, scattered information distribution, and so on. After the emergence of electronic medical records, the medical records data are gradually standardized [13]. However, the electronic medical records still as natural language are difficult for data mining. Therefore, it is necessary to clean up and filter the data to ensure data consistency and certainty by removing missing, incorrect, noisy, and inconsistent or no quality data. Missing values in medical text data are usually handled by deletion and interpolation. Deletion is the easiest method to handle, but some useful information is lost. Interpolation is a method that assigns reasonable substitution values to missing values through a specific algorithm. At present, many algorithms have emerged in the process of data processing. Multiple imputation, regression algorithm, and K-nearest neighbors are often used to supplement missing values in medical text data. The detail algorithm information is shown in Table 1. In order to further understand the semantic relationships of medical texts, researchers have used natural language processing (NLP) techniques to perform entity naming, relationship extraction, and text classification operations on medical text data with good results [19]. Table 1: The detailed algorithm information for missing values in medical text data Algorithm
Principle
Purpose
Multiple imputation [14, 15]
Estimate the value to be interpolated, and add different noises to form multiple groups of optional interpolation values; select the most appropriate interpolation value according to a certain selection basis.
Repeat the simulation to supplement the missing value
Expectation maximization [16]
Compute maximum likelihood estimates or posterior distributions with incomplete data.
Supplement missing values
K-nearest neighbors [17, 18]
Select its K closest neighbors according to a distance metric and estimate missing data with the corresponding mode or mean.
Estimate missing values with samples
Different Data Mining Approaches Based Medical Text Data
235
Natural Language Processing Natural Language Processing (NLP) as a subfield of artificial intelligence is mainly used for Chinese word segmentation, part-of-speech tagging, parsing, natural language generation, text categorization, information retrieval, information extraction, text-proofing, question answering, machine translation, automatic summarization, and textual entailment with the advantage of the fast process and lasting effect. It affirms positive motivation without negative influence, which can effectively stimulate potential, keep learning, keep growing, and keep developing [20]. In medical text processing, NLP is often used for information extraction and entity naming including word segmentation, sentence segmentation, syntactic analysis, grammatical analysis, and pragmatic analysis. The schematic of natural language processing is shown in Figure 1. Kou et al. [21] used NLP tools to extract important disease-related concepts from clinical notes, form a multichannel processing method, and improve data extraction ability. Jonnagaddala et al. [22] proposed a hybrid NLP model to identify Framingham heart failure signs and symptoms from clinical notes and electronic health record (EHR). Trivedi et al. [23] designed an interactive NLP tool to extract information from clinical texts, which can serve clinicians well after evaluation. Datta et al. [24] evaluated the NLP technology to extract cancer information from EHR, summarized the implementation functions of each framework, and found many repetitive parts in different NLP frameworks resulting in a certain waste of resources. The possibility of diversified medical text data will also bring the transformation of medical data analysis mode and decision support mode. Roberts and Demner-Fushman [25] manually annotated tags on 468 electronic medical records to generate a corpus, which provided corpus support for medical data mining. The development of NLP technology greatly reduces the difficulty of manual data processing in data mining. Shikhar Vashishth et al. [26] used semantic type filtering to improve the performance connectivity of medical entities across all toolkits and datasets, which provided a new semantic type prediction module for the biomedical NLP pipeline. Topaz et al. [27] used an NLP-based classification system, support vector machine (SVM), recurrent neural network (RNN), and other machine learning methods to identify diabetic patients from clinical records and reduce the manual workload in medical text data mining.
236
Data Analysis and Information Processing
Natural Language Processing
Medical text data Medical claims data
Prescription data
Image data
Electronic medical record input
input word Lexical analysis
noun
word
string
word
Participle
Thesaurus
verb Part of speech tagging Structured processing
SBV
Syntactic parsing
I
am
S
V
Local syndactyly
s-v relation
dependency grammar
Phrase-structure syntactic parsing
Dependency syntactic parsing
Figure 1: Schematic of natural language processing flow.
Figure 1: Schematic of natural language processing flow.
Data Analysis Data analysis is applying data mining methods for extracting interesting patterns. The model establishment is essential for knowledge discovery in data analysis. According to the characteristics of the data, modeling and analysis are performed. After the initial test, the model is parametrically adjusted. The advantages and disadvantages of different models are analyzed to choose the final optimal model. Data analysis methods for medical text data include clustering, classification, association rules, and regression on the goal. The detail information of methods is shown in Table 2. Table 2: The information of analysis methods for medical text data Methods
Purpose
Algorithms Advantages
Clustering
Classify similar sub- K-means jects in medical texts [28, 29]
Shortcomings
1.Simple and fast 1. Large amount of 2. Scalability and data and time-conefficiency suming 2. More restrictions on use
Different Data Mining Approaches Based Medical Text Data Classification Read medical text data for intention recognition
ANN [30, 31]
1. Solve complex mechanisms in text data 2. High degree of self-learning 3. Strong fault tolerance
237
1. Slow training 2. Many parameters and difficulty in adjusting parameters
Decision 1. Handle continu- 1. Overfitting tree [32, 33] ous variables and 2. The result is unmissing values stable 2. Judge the importance of features Naive bayes 1. The learning Higher requirements [34] process is easy for data independence 2. Good classification performance Association rules
Logistic Regression
Simple and easy Mine frequent items Apriori [35, 36] to implement and corresponding association rules from massive medical text FP-tree [37] 1. Reduce the number of datadatasets base scans 2. Reduce the amount of memory space
Analyze how variables affect results
Low efficiency and time-consuming High memory overhead
FP-growth [38]
1. Improve data Harder to achieve density structure 2. Avoid repeated scanning
Logistic regression [39]
1.Visual understanding and interpretation 2. Very sensitive to outliers
1.Easy underfitting 2. Cannot handle a large number of multiclass features or variables
Artificial Neural Network -Artificial Neural Network (ANN) is a nonlinear prediction model that is learned by training, which has the advantages of accurate classification, self-learning, associative memory, and high speed searching for the optimal solution and good stability in data mining. ANN mainly consists of three parts: input layer, hidden layer, and output layer [40]. The input layer is
238
Data Analysis and Information Processing
responsible for receiving external information and data. The hidden layer is responsible for processing information and constantly adjusting the connection properties between neurons, such as weights and feedback, while the output layer is responsible for outputting the calculated results. ANN is different from traditional artificial intelligence and information processing technology, which overcomes the drawbacks of traditional artificial intelligence based on logical symbols in processing intuitive and unstructured information, and has the characteristics of self-adaption, selforganizing, and real-time learning. It can complete data classification, feature mining, and other mining tasks. Medical text data contain massive amounts of patient health records, vital signs, and other data. ANN can analyze the conditions of patients’ rehabilitation, find the law of patient data, predict the patient’s condition or rehabilitation, and help to discover medical knowledge [41]. There are several ANN mining techniques that are used for medical text data, such as backpropagation and factorization machine-supported neural network (FNN). The information on ANN mining techniques is shown in Table 3. Table 3: The information of ANN mining techniques ANN mining techniques
Advantages
Shortcomings
Backpropagation [42]
1. Strong nonlinear mapping capability 2. Strong generalization ability 3. Strong fault tolerance
1. Local minimization 2. Slow convergence 3. Different structure choices
Radial basis function [43]
1. Fast learning speed Complex structure 2. Easy to solve text data classification problems
FNN [44]
1.Reduce feature engineering 2. Improve FM learning ability
Limited modeling capability
(1) ANN Core Algorithm: BP Algorithm. Backpropagation (BP) algorithm, as the classical algorithm of the ANN, widely used for medical text data. BP algorithm is developed on the basis of single-layer neural network. It uses reverse propagation to adjust the weights and construct multilayer network, so that the system can continue to learn. BP is a multilayered feed-forward network and its propagation is forward. Compared with recurrent neural
Different Data Mining Approaches Based Medical Text Data
239
network algorithms, error spreads reversely makes it faster and more powerful for high-throughput microarray or sequencing data modeling [45]. BP algorithm training data is mainly divided into the following two stages:(1)Forward propagation process: the actual output values of each computer unit are implicitly processed layer by layer from the input layer(2) Backpropagation process: when the output value does not reach the expected value, the difference between the actual output and the expected output is calculated recursively, and the weight is adjusted according to the difference. The total error is defined as
(1)
m is the total number of samples. K is the sample data order. T is the unit serial number.
is the desired output.
is the actual output.
In clinics, the judgment of disease is often determined by the integration of multidimensional data. In the establishment of disease prediction models, BP algorithms can not only effectively classify complex data but also have good multifunctional mapping. The relationship between data and disease can be found in the process of repeated iteration [46]. (2) Application Examples. Adaptive learning based on ANN can find the law of medical development from the massive medical text data and assist the discovery of medical knowledge. Heckerling et al. [47] combined a neural network and genetic algorithm to predict the prognosis of patients with urinary tract infections (as shown in Figure 2). In this study, nine indexes (eg, frequent micturition, dysuria, etc.) from 212 women with urinary tract infections were used as predictor variables for training. The relationship between symptoms and urinalysis input data and urine culture output data was determined using ANN. The predicted results were accurate.
Data Analysis and Information Processing
240 6
Journal of Healthcare Engineering Data analysis Input layer
Hide layer
1
1
…
…
Values closer to 1: prediction of urinary tract infection
30
10
Values closer to 0: prediction of no infecti
Data collection Age collection
Clinical data warehouse
The duration of symptoms Symptom:dysuria, frequency, urgency, hematuria, Etc.
Train
Expected input
Output layer
Reasult Output range [0-1]
Modification right Calculation Error
Output
Figure 2: ANN algorithm analysis process.
Figure 2: ANN algorithm analysis process. Data collection
1. t least one visit to collection primary care clinic in the past year
Data mining
Data analysis Naive Bayes classification algorithm
Reasult Output range [0-1]
Miotto et al. [48] derived a general-purpose patient representation from aggregated EHRs based on ANN that facilitates clinical predictive modeling Figure 3: NB algorithm analysis[49] process. used ANN to analyze 240 given the patient status. Armstrong et al. microcalcifications in 220 cases of mammography. Data mining results can accurately predict whether the microcalcification in the early stage of suspected breast cancer is benign or malignant. CPCSSN
Canadian primary care sentinel surveillance network
2. anadian patients with one or more chronic conditions
1. pecification documents 2. ata structure 3. ata sequencing and cleaning 4. ulation
Building joint probability models
Determination of the algorithmic formula
Classification of new case x
Values closer to 1: prediction of associated diseases
Naive Bayes Naive Bayes (NB) is a classification counting method based on the Bayes theory [50]. The conditional independence hypothesis of the NB classification algorithm assumes that the attribute values are independent of each other and the positions are independent of each other [51]. Attribute values are independent of each other, which means there is no dependence between terms. The position independence hypothesis means that the position of the term in the document has no effect on the calculation of probability. However, conditional dependence exists among terms in medical texts, and the location of terms in documents contributes differently to classification [52]. But medical text existence conditions depend on the relationship between a middle term and the term in the document; the location of the contribution to the classification is different. These two independent assumptions lead to the poor effect of NB estimation. However, NB has been widely used in medical texts because it plays an effective role in classification decisionmaking. (1) Core Algorithm: NBC4D. Naive Bayes classifier for continuous variables using a novel method (NBC4D) is a new algorithm based on NB. It classifies continuous variables into Naive Bayes classes, replaces traditional distribution techniques with alternative distribution techniques, and improves classification accuracy by selecting appropriate distribution techniques [53].
Different Data Mining Approaches Based Medical Text Data
241
The implementation of the NBC4D algorithm is mainly divided into five steps: (1)
Gaussian Distribution:
(2)
Exponential Distribution:
(3)
Kernel Density Estimation:
(4) (5)
Rayleigh Distribution: NBC4D Method: find the product of the probability (possibility) of each attribute of a given specific class and the probability of a specific class to improve the accuracy x is the input value, μ is the mean value, σ2 is the variance, α is the parameter that represents the average value (μ), θ represents the standard deviation (σ), K is the kernel function of Gaussian function, and h is the smoothing parameter. (2) 6
Application Examples. Behrouz Ehsani Moghaddam et al. [54] adopted electronic medical records (EMRs) extracted from the Journal of Healthcare Engineering Canadian primary care sentinel surveillance network, used the Naive Bayes algorithm to classify disease features, and found that Naive Bayes classifier was an effective algorithm to help physicians diagnose Hunter syndrome and optimize patient management (as shown in Figure 3). In order to predict angiographic outcomes, Golpour et al. [55] used the NB algorithm to process the hospital medical records and assessment scale and found that the NB model with three variables had the best performance and could well support physician decision-making. Figure 2: ANN algorithm analysis process. Data analysis
Input layer
Hide layer
1
1
…
…
Values closer to 1: prediction of urinary tract infection
30
10
Values closer to 0: prediction of no infecti
Data collection
Age
collection
Clinical data warehouse
The duration of symptoms
Symptom:dysuria, frequency, urgency, hematuria, Etc.
Train
Data collection
CPCSSN
1. t least one visit to collection primary care clinic in the past year
Canadian primary care sentinel surveillance network
2. anadian patients with one or more chronic conditions
Expected input
Data mining 1. pecification documents 2. ata structure 3. ata sequencing and cleaning 4. ulation
Output layer
Reasult
Output range [0-1]
Modification right
Calculation Error
Output
Data analysis Naive Bayes classification algorithm Building joint probability models
Determination of the algorithmic formula
Classification of new case x
Reasult Output range [0-1] Values closer to 1: prediction of associated diseases
Figure 3: NB algorithm analysis process.
Figure 3: NB algorithm analysis process.
Decision Tree The decision tree is a tree structure, in which each nonleaf node represents a test on a feature attribute, each branch represents the output of the feature attribute on a certain value domain, and each leaf node stores a category [56]. The process of using a decision tree to make a decision is to start from
242
Data Analysis and Information Processing
the root node, then test the corresponding characteristic attributes of the items to be classified, select the output branch according to its value until it reaches the leaf node, and finally take the category stored in the leaf node as the decision result [57]. The advantages of decision tree learning algorithms include good interpretability induction, various types of data processing (categorical and numerical data), white-box modeling, sound robust performance for noise, and large dataset processing. Medical text data is complex [58]. For instance, electronic medical record data include not only disease characteristics but also patient age, gender, and other characteristic data. Since the construction of decision tree starts from a single node, the training data set is divided into several subsets according to the attributes of the decision node, so the decision tree algorithm can deal with the data types and general attributes at the same time, which has certain advantages for the complexity of medical text data processing [59]. The construction of a decision tree is mainly divided into two steps: classification attribute selection and number pruning. The common algorithm is C4.5 [60]. (1) Core algorithm: C4.5. Several decision tree algorithms are proposed such as ID3 and C4.5. The famous ID3 algorithm proposed by Quinlan in 1986 has the advantages of clear theory, simple method, and strong learning ability. The disadvantage is that it is only effective for small datasets and sensitive to noise. When the training data set increases, the decision tree may change accordingly. When selecting test attributes, the decision tree tends to select attributes with more values. In 1993, Quinlan proposed the C4.5 algorithm based on the ID3 algorithm [61]. Compared with ID3, C4.5 overcomes the shortages of selecting more attributes in information attribute selection, prunes the tree construction process, and processes incomplete data. And it uses the gain ratio as the selection standard of each node attribute in the decision tree [62]. In particular, its extension which is called S-C4.5-SMOTE and can not only overcome the problem of data distortion but also improve overall system performance. Its mechanism aims to effectively reduce the amount of data without distortion by maintaining the balance of datasets and technical smoothness.
Different Data Mining Approaches Based Medical Text Data
243
The processing formula is as follows:
(2)
n is the classification number. p(xi) represents the proportion of sample xi. is the proportion of A is used as the feature of dividing data set S. the number of samples in the total number of samples. (2) Application Examples. The decision tree algorithms can construct specific decision trees for multiattribute datasets and get feasible results in relative time. It can be used as a good method for data classification in medical text data mining. Byeon [63] used the C4.5 algorithm to develop a depression prediction model for Korean dementia caregivers based on a secondary analysis of the 2015 Korean Community Health Survey (KCHS) survey results. And the effective prediction rate was 70%. The overall research idea is shown in Figure 4. Data analysis C4.5:Processing continuous data and incomplete data
extract KCHS survey results in 2015
Depression
POOR
Data collection
GOOD Reasult
Subjective stress
Subjective stress
Gender
The risk index of the cross classification model: 0.304 The misclassification rate:30%
Smoke Marriage
NO
Etc.
NO
YSE
NO
YSE
Disease for the recent 2 weeks
The frequency of meeting with relatives
Disease for the recent 2 weeks
YSE
NO
YSE
Figure 4: C4.5 algorithm application flow.
Figure 4: C4.5 algorithm application flow. Data Processing
Feature Database TID ITEMS 1 eature data1 2 eature data2 …
Wei et al. [64] selected the reports from the Chinese ITEMES spontaneous SUP Data Feature 50% denoising 2010 extraction report database from to 2011 and used a decision tree to{F1,F2,F3} calculate the ECG classification of adverse drug reactions (ADR) signals. Tao Zheng et al. [65] adopted a decision tree algorithm to construct a basic data framework. 300 data were randomly selected from the EHR of 23281 diabetic patients to classify the type of diabetes. The performance of the framework was good and the classification accuracy was as high as 98%. Algorithm iterations
244
Data Analysis and Information Processing
However, decision tree algorithms are difficult to deal with missing values in data. And there are many missing values in medical text data, due to the high complexity of data. Therefore, when various types of data are inconsistent, the decision tree algorithms will produce information deviation, and the correct results cannot be obtained.
Association Rules Association rules are often sought for very large datasets, whose efficient algorithms are highly valued. They are used to discover the correlations from large amounts of data and reflect the dependent or related knowledge between events and other events [66]. Medical text data contains a large number of association data, such as the association between symptoms and diseases and the relationship between drugs and diseases. Mining medical text data using an association rule algorithm is conducive to discovering the potential links in medical text data and promoting the development of medicine. Association rules are expressions like X ≥ Y. There are two key expressions in the transaction database:(1)Support{X≥Y}. The ratio of the number of transactions with X and Y to all transactions(2)Confidence{X≥Y}. The ratio of the number of transactions with X and Y to the number of transactions with X Given a transaction data set, mining association rules is to generate association rules whose support and trust are greater than the minimum support and minimum confidence given by users, respectively. (1) Core Algorithm: Apriori. The apriori algorithm is the earliest and the most classic algorithm. The iterative search method is used to find the relationship between items in the database layer by layer. The process consists of connection (class matrix operation) and pruning (removing unnecessary intermediate results). In this algorithm, the concept of item set is the set of items. A set containing K items is a set of K items. Item set frequency is the number of transactions that contain an item set. If an item set satisfies the minimum support, it is called a frequent item set. Apriori algorithm is divided into two steps to find the largest item set:(1) Count the occurrence frequency of an element item set, and find out the data set which is not less than the minimum support to form a one-dimensional maximum item set(2)Loop until no maximum item set is generated (2) Application Examples. Association rules are usually a data mining approach used to explore and interpret large transactional
Different Data Mining Approaches Based Medical Text Data
245
datasets to identify unique patterns and rules. They are often used to predict the correlation between index data and diseases. Exarchos et al. [67] proposed an automation method based on association rules, used an association rule algorithm to classify and model electrocardiographic (ECG) data, and monitored ischemic beats in ECG for a long time. In this study, the specific Figure of 4: C4.5 algorithm application flow. is shown in Figure 5. application process association rules Data analysis
C4.5:Processing continuous data and incomplete data
extract
KCHS survey results in 2015
Depression
POOR
Data collection
GOOD
Reasult
Subjective stress
Subjective stress
Gender
The risk index of the cross classification model: 0.304 The misclassification rate:30%
Smoke
NO
Marriage Etc.
YSE
NO
YSE
Disease for the recent 2 weeks
The frequency of meeting with relatives
Disease for the recent 2 weeks
NO
YSE
Data Processing Data denoising ECG
Feature extraction
Algorithm iterations
NO
YSE
Feature Database TID ITEMS 1 eature data1 2 eature data2 …
ITEMES {F1,F2,F3}
SUP 50%
Figure 5: Application process of association rules.
Hrovat et al. [68] combined association rule mining, which was designed for mining large transaction datasets, with model-based recursive partitioning to predict temporal trends (e.g., behavioral patterns) for subgroups of patients based on discharge summaries. In the correlation analysis between adverse drug reaction events and drug treatment, Chen et al. [69] used the apriori algorithm to explore the relationship between adverse events and drug treatment in patients with non-small-cell lung cancer, showing a promising method to reveal the risk factors of adverse events in the process of cancer treatment. In the association between drugs and diseases, Lu et al. [70] used the apriori algorithm to find herbal combinations for the treatment of uremic pruritus from Chinese herb bath therapy and explore the core drugs.
Model Evaluation Classifications generated by data mining models through test sets are not necessarily optimal, which can lead to the error of test set classification. In order to get a perfect data model, it is very important to evaluate the model. Receiver operating characteristic (ROC) curve and area under the curve (AUC) are common evaluation methods in medical text data mining. The ROC curve has a y-axis of TPR (sensitivity, also called recall rate) and an x-axis of FPR (1-specificity). The higher the TPR, the smaller the FPR, and the higher the efficiency of the model. AUC is defined as the area under the ROC curve, that is, AUC is the integral of ROC, and the value of the area is less than 1. We randomly select a positive sample and a negative sample. The probability that the classifier determines that the positive sample value is higher than the negative sample is the AUC value. Pourhoseing Holi
246
Data Analysis and Information Processing
et al. [71] used the AUC method to evaluate the prognosis model of rectal cancer patients and found that the prediction accuracy of random forest (RF) and BN models was high.
DISCUSSION Data mining is useful for medical text data to extract novel and usable information or knowledge. This paper reviewed several research works which are done for mining medical text data based on four steps. It is beneficial for helping the researchers to choose reasonable approaches for mining medical text data. However, some difficulties in medical text data mining are also considered. First, the lack of a publicly available annotation database affects the development of data mining to a certain extent, due to differences in medical information records and descriptions among countries. Its information components are highly heterogeneous and the data quality is not uniform. Ultimately, it brings about a key obstacle that makes annotation bottleneck existing in medical text data [72]. At present, the international standards include ICD (International Classification of Diseases), SNOMED CT (The Systematized Nomenclature of Human and Veterinary Medicine Clinical Terms), CPT (Current Procedural Terminology), DRG (Diagnosis-Related Groups), LOINC (Logical Observation Identifiers Names and Codes), Mesh (Medical Subject Headings), MDDB (Main Drug Database), and UMLS (Unified Medical Language System). There are few corpora in the field of medical text. In recent 10 years, natural language has undergone a truly revolutionary paradigm shift. More new technologies have been applied to the extraction of natural language information. Many scholars have established a corpus for a certain disease. However, there is a close relationship between medical entities. A single corpus cannot cut the data accurately, and it is easy to omit keyword information. Second, text records of different countries have different opinions. For example, Ayurvedic medicine, traditional Arab Islamic medicine, and traditional Malay medicine from India, the Middle East, and Malaysia have problems such as inconsistent treatment description, complex treatment methods, and difficulty in statistical analysis, leading to great difficulty in medical data mining [73]. At the same time, the information construction of traditional medicine is insufficient. For example, the traditional North American indigenous medical literature mainly involves clinical efficacy evaluation and disease application, which is complicated in recording
Different Data Mining Approaches Based Medical Text Data
247
methods, leading to difficulty of data mining [74]. Chinese medical texts have the particularity of language. Unlike English expressions, Chinese words are not separated from each other, which increases the difficulty of data analysis. In terms of semantics, Chinese medical texts have problems such as existential polysemy, synonym, the ambiguity of expression, complex relationship, and lack of clear correlation. Building a standard database based on these data is very difficult, which requires very advanced and complex algorithms. In addition, the electronic medical record contains personal privacy information. Sometimes, the clinical electronic medical record data will inevitably be used in medical text data mining. Therefore, the protection of patient privacy data is also an issue that needs to be paid attention to in data mining. In future work, we will attempt to establish and popularize medical text data standards with the help of intelligent agents and construct publicly available annotation databases for the mining of medical text data.
ACKNOWLEDGMENTS This work was supported by the National Natural Science Foundation of China (81703825), the Sichuan Science and Technology Program (2021YJ0254), and the Natural Science Foundation Project of the Education Department of Sichuan Province (18ZB01869).
248
Data Analysis and Information Processing
REFERENCES 1.
2.
3.
4.
5.
6.
7.
8.
9.
R. J. Oskouei, N. M. Kor, and S. A. Maleki, “Data mining and medical world: breast cancers’ diagnosis, treatment, prognosis and challenges [J],” American journal of cancer research, vol. 7, no. 3, pp. 610–627, 2017. Y. Zhang, S.-L. Guo, L.-N. Han, and T.-L. Li, “Application and exploration of big data mining in clinical medicine,” Chinese Medical Journal, vol. 129, no. 6, pp. 731–738, 2016. B. Polnaszek, A. Gilmore-Bykovskyi, M. Hovanes et al., “Overcoming the challenges of unstructured data in multisite, electronic medical record-based abstraction,” Medical Care, vol. 54, no. 10, pp. e65–e72, 2016. E. Ford, M. Oswald, L. Hassan, K. Bozentko, G. Nenadic, and J. Cassell, “Should free-text data in electronic medical records be shared for research? A citizens’ jury study in the UK,” Journal of Medical Ethics, vol. 46, no. 6, pp. 367–377, 2020. S. M. Ayyoubzadeh, S. M. Ayyoubzadeh, H. Zahedi, M. Ahmadi, and S. R Niakan Kalhori, “Predicting COVID-19 incidence through analysis of google trends data in Iran: data mining and deep learning pilot study,” JMIR public health and surveillance, vol. 6, no. 2, Article ID e18828, 2020. X. Ren, X. X. Shao, X. X. Li et al., “Identifying potential treatments of COVID-19 from Traditional Chinese Medicine (TCM) by using a data-driven approach,” Journal of Ethnopharmacology, vol. 258, no. 1, Article ID 12932, 2020. E. Massaad and P. Cherfan, “Social media data analytics on telehealth during the COVID-19 pandemic,” Cureus, vol. 12, no. 4, Article ID e7838, 2020. J. Dong, H. Wu, D. Zhou et al., “Application of big data and artificial intelligence in COVID-19 prevention, diagnosis, treatment and management decisions in China,” Journal of Medical Systems, vol. 45, no. 9, p. 84, 2021. L. B. Moreira and A. A. Namen, “A hybrid data mining model for diagnosis of patients with clinical suspicion of dementia [J],” Computer Methods and Programs in Biomedicine, vol. 165, no. 1, pp. 39–49, 2018.
Different Data Mining Approaches Based Medical Text Data
249
10. S. Vilar, C. Friedman, and G. Hripcsak, “Detection of drug-drug interactions through data mining studies using clinical sources, scientific literature and social media,” Briefings in Bioinformatics, vol. 19, no. 5, pp. 863–877, 2018. 11. H. S. Cha, T. S. Yoon, K. C. Ryu et al., “Implementation of hospital examination reservation system using data mining technique,” Healthcare informatics research, vol. 21, no. 2, pp. 95– 101, 2015. 12. B. L. Gudenas, J. Wang, S.-z. Kuang, A.-q. Wei, S. B. Cogill, and L.-j. Wang, “Genomic data mining for functional annotation of human long noncoding RNAs,” Journal of Zhejiang University - Science B, vol. 20, no. 6, pp. 476–487, 2019. 13. R. S. Evans, “Electronic health records: then, now, and in the future,” Yearbook of medical informatics, vol. Suppl 1, no. Suppl 1, pp. S48–S61, 2016. 14. P. C. Austin, I. R. White, D. S. Lee, and S. van Buuren, “Missing data in clinical research: a tutorial on multiple imputation,” Canadian Journal of Cardiology, vol. 37, no. 9, pp. 1322–1331, 2021. 15. L. Yu, L. Liu, and K. E. Peace, “Regression multiple imputation for missing data analysis,” Statistical Methods in Medical Research, vol. 29, no. 9, pp. 2647–2664, 2020. 16. P. C. Chang, C. L. Wang, F. C. Hsiao et al., “Sacubitril/valsartan vs. angiotensin receptor inhibition in heart failure: a real‐world study in Taiwan,” ESC heart failure, vol. 7, no. 5, pp. 3003–3012, 2020. 17. E. Tavazzi, S. Daberdaku, R. Vasta, C. Andrea, C. Adriano, and D. C. Barbara, “Exploiting mutual information for the imputation of static and dynamic mixed-type clinical data with an adaptive k-nearest neighbours approach,” BMC Medical Informatics and Decision Making, vol. 20, no. Suppl 5, p. 174, 2020. 18. A. Idri, I. Kadi, I. Abnane, and J. L. Fernandez-Aleman, “Missing data techniques in classification for cardiovascular dysautonomias diagnosis,” Medical, & Biological Engineering & Computing, vol. 58, no. 11, pp. 2863–2878, 2020. 19. C. Wang, C. Yao, P. Chen, S. Jiamin, G. Zhe, and Z. Zheying, “Artificial intelligence algorithm with ICD coding technology guided by the embedded electronic medical record system in medical record
250
20.
21.
22.
23.
24.
25.
26.
27.
28.
Data Analysis and Information Processing
information management,” Journal of healthcare engineering, vol. 2021, Article ID 3293457, 9 pages, 2021. K. Kreimeyer, M. Foster, A. Pandey et al., “Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review,” Journal of Biomedical Informatics, vol. 73, pp. 14–29, 2017. T. T. Kuo, P. Rao, C. Maehara et al., “Ensembles of NLP tools for data element extraction from clinical notes,” AMIA Annual Symposium proceedings AMIA Symposium, vol. 2016, pp. 1880–1889, 2017. J. Jonnagaddala, S.-T. Liaw, P. Ray, M. Kumar, N.-W. Chang, and H.J. Dai, “Coronary artery disease risk assessment from unstructured electronic health records using text mining,” Journal of Biomedical Informatics, vol. 58, no. Suppl, pp. S203–S210, 2015. G. Trivedi, E. R. Dadashzadeh, R. M. Handzel, W. C. Wendy, V. Shyam, and H. Harry, “Interactive NLP in clinical care: identifying incidental findings in radiology reports,” Applied Clinical Informatics, vol. 10, no. 4, pp. 655–669, 2019. S. Datta, E. V. Bernstam, and K. Roberts, “A frame semantic overview of NLP-based information extraction for cancer-related EHR notes [J],” Journal of Biomedical Informatics, vol. 100, no. 1, pp. 03–301, 2019. K. Roberts and D. Demner-Fushman, “Annotating logical forms for EHR questions [J]. LREC,” International Conference on Language Resources & Evaluation: [proceedings] International Conference on Language Resources and Evaluation, vol. 2016, no. 3, pp. 772–778, 2016. S. Vashishth, D. Newman-Griffis, R. Joshi, D. Ritam, and P. R. Carolyn, “Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets,” Journal of Biomedical Informatics, vol. 121, no. 10, pp. 38–80, 2021. M. Topaz, L. Murga, O. Bar-Bachar, M. McDonald, and K. Bowles, “NimbleMiner,” CIN: Computers, Informatics, Nursing, vol. 37, no. 11, pp. 583–590, 2019. D. M. Maslove, T. Podchiyska, and H. J. Lowe, “Discretization of continuous features in clinical datasets,” Journal of the American Medical Informatics Association, vol. 20, no. 3, pp. 544–553, 2013.
Different Data Mining Approaches Based Medical Text Data
251
29. P. Yildirim, L. Majnarić, O. Ekmekci, and H. Andreas, “Knowledge discovery of drug data on the example of adverse reaction prediction,” BMC Bioinformatics, vol. 15, no. Suppl 6, p. S7, 2014. 30. H. Ayatollahi, L. Gholamhosseini, and M. Salehi, “Predicting coronary artery disease: a comparison between two data mining algorithms,” BMC Public Health, vol. 19, no. 1, p. 448, 2019. 31. M. Reiser, B. Wiebner, B. Wiebner, and J. Hirsch, “Neural-network analysis of socio-medical data to identify predictors of undiagnosed hepatitis C virus infections in Germany (DETECT),” Journal of Translational Medicine, vol. 17, no. 1, p. 94, 2019. 32. M. A. Rahman, B. Honan, T. Glanville, P. Hough, and K. Walker, “Using data mining to predict emergency department length of stay greater than 4 hours: d,” Emergency Medicine Australasia, vol. 32, no. 3, pp. 416–421, 2020. 33. J.-A. Lee, K.-H. Kim, D.-S. Kong, S. Lee, S.-K. Park, and K. Park, “Algorithm to predict the outcome of mdh spasm: a data-mining analysis using a decision tree,” World neurosurgery, vol. 125, no. 5, pp. e797–e806, 2019. 34. A. Awaysheh, J. Wilcke, F. Elvinger, L. Rees, W. Fan, and K. L. Zimmerman, “Review of medical decision support and machinelearning methods,” Veterinary pathology, vol. 56, no. 4, pp. 512–525, 2019. 35. X. You, Y. Xu, J. Huang et al., “A data mining-based analysis of medication rules in treating bone marrow suppression by kidneytonifying method [J]. Evidence-based complementary and alternative medicine,” eCAM, no. 1, p. 907848, 2019. 36. A. Atashi, F. Tohidinezhad, S. Dorri et al., “Discovery of hidden patterns in breast cancer patients, using data mining on a real data set,” Studies in Health Technology and Informatics, vol. 262, no. 1, pp. 42–45, 2019. 37. Z. Luo, G. Q. Zhang, and R. Xu, “Mining patterns of adverse events using aggregated clinical trial results [J],” AMIA Joint Summits on Translational Science proceedings AMIA Joint Summits on Translational Science, vol. 2013, no. 1, pp. 12–16, 2013. 38. X. Li, G. Liu, W. Chen, Z. Bi, and H. Liang, “Network analysis of autistic disease comorbidities in Chinese children based on ICD-10 codes,” BMC Medical Informatics and Decision Making, vol. 20, no. 1, p. 268, 2020.
252
Data Analysis and Information Processing
39. M. M. Liu, L. Wen, Y. J. Liu, C. Qiao, T. L. Li, and M. C. Yong, “Application of data mining methods to improve screening for the risk of early gastric cancer,” BMC Medical Informatics and Decision Making, vol. 18, no. Suppl 5, p. 121, 2018. 40. Y.-c. Wu and J.-w. Feng, “Development and application of artificial neural network,” Wireless Personal Communications, vol. 102, no. 2, pp. 1645–1656, 2018. 41. A. Ramesh, C. Kambhampati, J. Monson, and P. Drew, “Artificial intelligence in medicine,” Annals of the Royal College of Surgeons of England, vol. 86, no. 5, pp. 334–338, 2004. 42. Y. Liang, Q. Li, P. Chen, L. Xu, and J. Li, “Comparative study of back propagation artificial neural networks and logistic regression model in predicting poor prognosis after acute ischemic stroke,” Open Medicine, vol. 14, no. 1, pp. 324–330, 2019. 43. S. Y. Park and S. M. Kim, “Acute appendicitis diagnosis using artificial neural networks [J]. Technology and health care,” Official Journal of the European Society for Engineering and Medicine, vol. 23, no. Suppl 2, pp. S559–S565, 2015. 44. R.-J. Kuo, M.-H. Huang, W.-C. Cheng, C.-C. Lin, and Y.-H. Wu, “Application of a two-stage fuzzy neural network to a prostate cancer prognosis system,” Artificial Intelligence in Medicine, vol. 63, no. 2, pp. 119–133, 2015. 45. L. Liu, T. Zhao, M. Ma, and Y. Wang, “A new gene regulatory network model based on BP algorithm for interrogating differentially expressed genes of Sea Urchin,” SpringerPlus, vol. 5, no. 1, p. 1911, 2016. 46. T. J. Cleophas and T. F. Cleophas, “Artificial intelligence for diagnostic purposes: principles, procedures and limitations [J],” Clinical Chemistry and Laboratory Medicine, vol. 48, no. 2, pp. 159–165, 2010. 47. P. Heckerling, G. Canaris, S. Flach, T. Tape, R. Wigton, and B. Gerber, “Predictors of urinary tract infection based on artificial neural networks and genetic algorithms,” International Journal of Medical Informatics, vol. 76, no. 4, pp. 289–296, 2007. 48. R. Miotto, L. Li, and B. A. Kidd, “Deep patient: an unsupervised representation to predict the future of patients from the electronic health,” Records [J]. Scientific reports, vol. 6, no. 2, p. 6094, 2016. 49. A. J. Armstrong, M. S. Marengo, S. Oltean et al., “Circulating t cells from patients with advanced prostate and breast cancer display both
Different Data Mining Approaches Based Medical Text Data
50.
51.
52.
53.
54.
55.
56.
57.
58.
253
epithelial and mm,” Molecular Cancer Research, vol. 9, no. 8, pp. 997–1007, 2011. D. V. Lindley, “Fiducial distributions and bayes’ theorem,” Journal of the Royal Statistical Society: Series B, vol. 20, no. 1, pp. 102–107, 1958. S. Uddin, A. Khan, M. E. Hossain, and M. A. Moni, “Comparing different supervised machine learning algorithms for disease prediction,” BMC Medical Informatics and Decision Making, vol. 19, no. 1, p. 281, 2019. H. H. Rashidi, N. K. Tran, E. V. Betts, P. H. Lydia, and G. Ralph, “Artificial intelligence and machine learning in pathology: the present landscape of supervised methods,” Academic pathology, vol. 6, 2019. P. Yildirim and D. Birant, “Naive Bayes classifier for continuous variables using novel method (NBC4D) and distributions,” in Proceedings of the 2014 IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA), pp. 110–115, IEEE, Alberobello, Italy, June 2014. B. Ehsani-Moghaddam, J. A. Queenan, J. Mackenzie, and R. V. Birtwhistle, “Mucopolysaccharidosis type II detection by Naïve Bayes Classifier: an example of patient classification for a rare disease using electronic medical records from the Canadian Primary Care Sentinel Surveillance Network,” PLoS One, vol. 13, no. 12, Article ID e0209018, 2018. P. Golpour, M. Ghayour-Mobarhan, A. Saki et al., “Comparison of support vector machine, naïve bayes and logistic regression for assessing the necessity for coronary angiography,” International Journal of Environmental Research and Public Health, vol. 17, no. 18, 2020. D. Che, Q. Liu, K. Rasheed, X. Tao, and T. Xiuping, “Decision tree and ensemble learning algorithms with their applications in bioinformatics,” Advances in Experimental Medicine & Biology, pp. 191–199, 2011. L. O. Moraes, C. E. Pedreira, S. Barrena, A. Lopez, and A. Orfao, “A decision-tree approach for the differential diagnosis of chronic lymphoid leukemias and peripheral B-cell lymphomas,” Computer Methods and Programs in Biomedicine, vol. 178, pp. 85–90, 2019. W. Oh, M. S. Steinbach, M. R. Castro et al., “Evaluating the impact of data representation on EHR-based analytic tasks,” Studies in Health Technology and Informatics, vol. 264, no. 2, pp. 88–92, 2019.
254
Data Analysis and Information Processing
59. C. T. Nakas, N. Schütz, M. Werners, and A. B. Leichtle, “Accuracy and calibration of computational approaches for inpatient mortality predictive modeling,” PLoS One, vol. 11, no. 7, Article ID e0159046, 2016. 60. J. R. Quliilan, C4.5: Programs for Machine Learning, Mor-gan Kaufmann Publisher, San Mateo, CA, vol. 993. 61. Q. JR, C4.5:Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, 1993. 62. G. Franzese and M. Visintin, “Probabilistic ensemble of deep information networks,” Entropy, vol. 22, no. 1, p. 100, 2020. 63. H. Byeon, “Development of depression prediction models for caregivers of patients with dementia using decision tree learning algorithm,” International Journal of Gerontology, vol. 13, no. 4, pp. 314–319, 2019. 64. J.-X. Wei, J. Wang, Y.-X. Zhu, J. Sun, H.-M. Xu, and M. Li, “Traditional Chinese medicine pharmacovigilance in signal detection: decision treebased data classification,” BMC Medical Informatics and Decision Making, vol. 18, no. 1, p. 19, 2018. 65. T. Zheng, W. Xie, L. Xu et al., “A machine learning-based framework to identify type 2 diabetes through electronic health records,” International Journal of Medical Informatics, vol. 97, pp. 120–127, 2017. 66. R. Veroneze, T. Cruz, S. Corbi et al., “Using association rule mining to jointly detect clinical features and differentially expressed genes related to chronic inflammatory diseases,” PLoS One, vol. 15, no. 10, Article ID e0240269, 2020. 67. T. P. Exarchos, C. Papaloukas, D. I. Fotiadis, and L. K. Michalis, “An association rule mining-based methodology for automated detection of ischemic ECG beats,” IEEE Transactions on Biomedical Engineering, vol. 53, no. 8, pp. 1531–1540, 2006. 68. G. Hrovat, G. Stiglic, P. Kokol, and M. Ojsteršek, “Contrasting temporal trend discovery for large healthcare databases,” Computer Methods and Programs in Biomedicine, vol. 113, no. 1, pp. 251–257, 2014. 69. W. Chen, J. Yang, H. L. Wang, Y. F Shi, H Tang, and G. H Li, “Discovering associations of adverse events with pharmacotherapy in patients with non-small cell lung cancer using modified Apriori
Different Data Mining Approaches Based Medical Text Data
70.
71.
72.
73.
74.
255
algorithm,” BioMed Research International, vol. 2018, no. 12, Article ID 1245616, 10 pages, 2018. P. H. Lu, J. L. Keng, K. L. Kuo, F. W. Yu, C. T. Yu, and Y. K. Chan, “An Apriori algorithm-based association rule analysis to identify herb combinations for treating uremic pruritus using Chinese herbal bath therapy,” Evidence-based Complementary and Alternative Medicine: eCAM, vol. 2020, no. 8, Article ID 854772, 9 pages, 2020. M. Mlakar, P. E. Puddu, M. Somrak, S. Bonfiglio, and M. Luštrek, “Mining telemonitored physiological data and patient-reported outcomes of congestive heart failure patients,” PLoS One, vol. 13, no. 3, Article ID e0190323, 2018. I. Spasic and G. Nenadic, “Clinical text data in machine learning: systematic review,” JMIR Medical Informatics, vol. 8, no. 3, Article ID e17984, 2020. R. R. R. Ikram, M. K. A. Ghani, and N. Abdullah, “An analysis of application of health informatics in Traditional Medicine: a review of four Traditional Medicine Systems,” International Journal of Medical Informatics, vol. 84, no. 11, pp. 988–996, 2015. N. Redvers and B. s. Blondin, “Traditional Indigenous medicine in North America: a scoping review,” PLoS One, vol. 15, no. 8, Article ID e0237531, 2020.
Chapter 13
Data Mining in Electronic Commerce: Benefits and Challenges
Mustapha Ismail, Mohammed Mansur Ibrahim, Zayyan Mahmoud Sanusi, Muesser Nat Management Information Systems Department, Cyprus International University, Haspolat, Lefkoşa via Mersin, Turkey
ABSTRACT Huge volume of structured and unstructured data which is called big data, nowadays, provides opportunities for companies especially those that use electronic commerce (e-commerce). The data is collected from customer’s internal processes, vendors, markets and business environment. This paper presents a data mining (DM) process for e-commerce including the three common algorithms: association, clustering and prediction. It also highlights some of the benefits of DM to e-commerce companies in terms of merchandise planning, sale forecasting, basket analysis, customer relationship management and market segmentation which can be achieved
Citation: Ismail, M. , Ibrahim, M. , Sanusi, Z. and Nat, M. (2015), “Data Mining in Electronic Commerce: Benefits and Challenges”. International Journal of Communications, Network and System Sciences, 8, 501-509. doi: 10.4236/ijcns.2015.812045. Copyright: © 2015 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0.
258
Data Analysis and Information Processing
with the three data mining algorithms. The main aim of this paper is to review the application of data mining in e-commerce by focusing on structured and unstructured data collected thorough various resources and cloud computing services in order to justify the importance of data mining. Moreover, this study evaluates certain challenges of data mining like spider identification, data transformations and making data model comprehensible to business users. Other challenges which are supporting the slow changing dimensions of data, making the data transformation and model building accessible to business users are also evaluated. A clear guide to e-commerce companies sitting on huge volume of data to easily manipulate the data for business improvement which in return will place them highly competitive among their competitors is also provided in this paper. Keywords: Data Mining, Big Data, E-Commerce, Cloud Computing
INTRODUCTION Data mining in e-commerce is all about integrating statistics, databases and artificial intelligence together with some subjects to form a new idea or a new integrated technology for the purpose of better decision making. Data mining as a whole is believed to be a good promoter of e-commerce. Presently, applying data mining to e-com- merce has become a hot cake among businesses [1] . Data mining in cloud computing is the process of extracting structured information from unstructured or semi unstructured web data sources. From business point of view, the core concept of cloud computing is to render computing resources in form of service to the users who need to buy whenever they are in demand [2] . The end product of data mining creates an avenue for decision makers to be able to track their customers’ purchasing patterns, demand trends and locations, making their strategic decision more effective for the betterment of their business. This can bring down the cost of inventory together with other expenses and maximizing the overall profit of the company. With the wide availability of the Internet, 21st century companies highly utilize online tools and technologies for various reasons. Therefore, today many companies buy and sell through e-commerce and the need for developing e-commerce applications by an expert who takes responsibility for running and maintaining the services is increasing. When businesses grow, the required resources for e-commerce maintenance may increase more than the level the enterprise can handle. Based on that regard, data mining can
Data Mining in Electronic Commerce: Benefits and Challenges
259
be used to handle e-commerce enterprise services and explore patterns for online customers so companies can boost sales and the general pro- ductivity of the business [3] . However, the cost of running such services is a challenge to almost all e-commerce companies. Therefore cloud computing becomes a game changer in the way and manner companies transact their businesses by offering a comprehensive scalable and flexible services over the Internet. Cloud computing provides a new breakthrough for enterprises, offering a service model that includes network storage, new information resource sharing, on-demand access to information and processing mechanism. It is possible to provide data mining software via cloud computing which gives e-commerce companies opportunity to centralize their software management and data storage with absolute assurance of reliability, efficiency and protected services to their users which in turn cut their cost and increase their profit [4] . Cloud computing is a technology that has to do with accessing products and services in the cloud without shouldering the burden of hosting or delivering these services. It can be also viewed as a “model that enhances a flexible on-demand network access to a shared pool of configurable computing resources like networks, servers, storage applications and services that can speedily provisioned and released with minimal management effort or service provider interaction”. In the aspect of cloud computing everything is considered as a service. There are three service delivery models of cloud computing namely: Infrastructure as a Service (IaaS) which is responsible for fundamental computing resources like, storage, processing, networks and also some standardized services over the networks. The second is the Platform as a Service (PaaS) which gives abstractions together with the services for developing, testing, hosting and of course maintaining the applications in the complex and developed environment. The third one is the Software as the Service (SaaS). The entire application or service is delivered over the web through a browser or via application programming interface (API). With service model the consumers only need to focus on administering users to the system. One of the most important applications of cloud computing is the storage capability. Cloud storage has the capability to cluster different types of storage equipment by employing cluster system, grid technology or distributed system in the network to provide external data storage and access services by the use of software application. Cloud computing in e-commerce is the idea of paying bandwidth and storage space on the scale that depends on the usage. It is much more on the utility on-demand basis whereby a
260
Data Analysis and Information Processing
user pays for less with pay per use models. Most e-commerce companies welcome the idea as it eliminates the high cost of storage for large volume of business data by keeping it in the cloud data centers. The platform also gives opportunity to use e-commerce business applications e.g. B2B and B2C with smaller investment. Some other advantages of cloud computing for e-commerce include the following: cost effective, speed of operations, scalability and security of the entire service [3] [4] . The association between cloud computing and data mining is that cloud is used to store the data on the servers and data mining is use to provide client server relationship as a service and information being collected based on ethical issues like privacy and individuality are violated [5] . Considering the importance of data mining for today’s companies, this paper discusses benefits and challenges of data mining for e-commerce companies. Furthermore, it reviews the process of data mining in e-commerce together with the common types of database and cloud computing in the field of e-commerce.
DATA MINING Data mining is the process of discovering meaningful pattern and correlation by sifting through large amounts of data stored in repositories. There are several tools for this data generation, which include abstractions, aggregations, summarization and characteristics of data [6] . In the past decade, data mining has change the e-commerce business. Data mining is not specific to one type of data. Data mining can be germane to any type of information source, however, algorithms and tactics may differ when applied to different kind of data. The challenges presented by different type of data varies. Data mining is being used in many form of databases like flat file, data warehouses, object oriented databases and etc. This paper concentrates on relational databases. Relational database consists of a set of tables containing either values of entity attributes or values of attributes from entity relationship. Tables have columns and rows, where columns represent attributes and rows represent tuples. A tuple in relational table corresponds to either an object or a relationship between objects and is identified by a set of attribute values representing a unique key [6] . The most commonly used query language for relational database is SQL, which allows to manipulate and retrieve data stored in the tables. Data mining algorithms using relational database can be more versatile than data
Data Mining in Electronic Commerce: Benefits and Challenges
261
mining algorithms specifically written for flat files. Data mining can benefit from SQL for data selection, transformation and consolidation [7] . There are several core techniques in data mining that are used to build data mining. Most common techniques are as follows [8] [9] : 1)
2)
3)
Association Rules: Association rule mining is among the most important methods of data mining. The essence of this method is extracting interesting correlation and association among sets of items in the transactional databases or other data pools. Association rules are used extensively in various areas. A typical association rule has an implication of the form A→B where A is an item set and B is an item set that contains only a single atomic condition [10] . Clustering: This is the organisation of data in classes or it refers to a collection of objects by grouping similar objects to form more than one class of methods. Moreover, clustering class labels are unidentified and it is up to the clustering algorithm to discover acceptable classes. Clustering is sometimes called unsupervised classification. The reason was classification is not dictated by given class labels. Clustering is the process of grouping a set of physical or abstract object into classes of similar object [10] . Prediction: Prediction has attracted substantial attention given the possible consequences of successful forecasting in a business context. There are two types of predictions. The first one is predicting unavailable data values and the second one is as soon as classification model is form on a training set, the class label of the object can be pre- dicted based on the attribute values of the object. Prediction is more often referred to the forecast of missing numerical values [10] .
SOME COMMON DATA MINING TOOLS 1)
Weka: To have accurate data mining result require the right tool for the dataset you are mining. Weka however, gives the ability to put into reality the learning methods algorithms. The tool has lots of benefits as it’s include all the standard data mining procedures like data pre-processing, clustering, association, classification, regression and also attribute selection. It has both the Java and non-Java version together with visualization application, and the
Data Analysis and Information Processing
262
2)
3)
tool is free to users to customize it to their own specification [11] [12] . NLTK: It is mainly for language processing task with pool of different language processing tools together with machine learning, data mining and sentiment analysis, data scrapping and different language processing tasks. NLTK tool require a user to install the tool on his systems and have access to the full package. It is built in python and a user can build application on top and can play around with the tool to his own specification. All the three mentioned tools above are open source [11] . Spider Miner: A data mining tool that does not require a user to write a code, written in Java programming language. Part of SpiderMiner tool capability is that, it provides a thorough analytics via template-based frameworks. It is very flexible tool and user friendly offered as a service, and apart from data mining function, the tool can visualize, predict, data pre-processing, deployment statistical modelling and of course evaluation functions. In the tool there learning schemes, algorithms and models from WEKA and R script which makes the tool to be more powerful [12] . All the three mentioned tools above are open source.
DATA MINING IN E-COMMERCE Data mining in e-commerce is a vital way of repositioning the e-commerce company for supporting the enterprise with the required information concerning the business. Recently, most companies adopt e-commerce and being in possession of big data in their data repositories. The only way to get the most out of this data is to mine it to increase decision making or to enable business intelligence. In e-commerce data mining there are three important processes that data must pass before turning into knowledge or application. Figure 1 shows the steps for data mining in e-commerce.
Target Database
Cleaned data
Pattern Analysis
Interpretation & Validation
Pattern Analysis
Cleaned data
Data Mining Pattern
Data Pre-prooessing
Data warehouse
Cleaned data
263
Selection
Data Mining in Electronic Commerce: Benefits and Challenges
Knowledge
Figure 1: Data mining process in e-commerce [16] .
The first and easier process of data mining is data preprocessing and it is actually a step before the data mining, whereby, the data is cleaned by removing the unwanted data that has no relation with the required analysis. Hence, the process will boost the performance of the entire data mining process and the accuracy of the data will also be high and the time needed for the actual mining will be minimise reasonably. Usually this happens if company already have an existing target data warehouse, but if not then the process will consume at least 80% of the selection, cleaning and transformation of data termed as preprocessing [13] . Mining pattern is the second step and it actually refers to techniques or approach used to develop a recommendation rules, or developing a model out of a large data set. It can also be referred as techniques or algorithms of data mining. The most common patterns used in e-commerce are prediction, clustering and association rules. The purpose of third step which is pattern analysis is to verify and shade more light on the discovered model in order to give a clear path for the
Data Analysis and Information Processing
264
startup up for applying of the data mining result. The analysis lay much emphasis on the statistics and rules of the pattern used, by observing them after multiple users have accessed them [14] . However all this has to do with how iterative the overall process is, and the interpretation of visual information you get at each sub step. Therefore, in general data mining process iterates from the following five basic steps, which are: •
•
•
•
•
Data selection: This step is all about identifying the kind of data to be mined, the goals for it and the necessary tool to enable the process. At the end of it the right input attributes and output information in order to represent the task are chosen. Data transformation: This step is all about organising the data based on the requirements by removing noise, converting one type of data to another, normalising the data if there is need to, and also defining the strategy to handle the missing data. Data mining step per se: Having mined the transformed data using any of the techniques to extract pattern of interest, the miner can also make data mining method by performing the proceeding steps correctly. Result interpretation and validation: For better understanding of data and it synthesised knowledge together with its validity span, the robustness is check by data mining application test. The information retrieved can also be evaluated by comparing it with the earlier expertise in the application domain. Incorporation of the discovered knowledge: This has to do with presenting the result of discovered knowledge to decision maker so that it is possible to compare or check/resolve for conflict with an earlier extracted knowledge where a new discovered pattern can be applied [15] .
BENEFITS OF DATA MINING IN E-COMMERCE Application of data mining in e-commerce refers to possible areas in the field of e-commerce where data mining can be utilised for the purpose of enhancements in business. As we all know while visiting an online store for shopping, users normally leave behind certain facts that companies can store in their database. These facts repre- sent unstructured or structured
Data Mining in Electronic Commerce: Benefits and Challenges
265
data that can be mined to provide a competitive advantage to the company. The following areas are where data mining can be applied in the field of e-commerce for the benefits of companies: 1)
2)
3)
Customer Profiling: This is also known as customer-oriented strategy in e-commerce. This allows companies to use business intelligence through the mining of customer’s data to plan their business activities and operations as well as develop new research on products or services for prosperous e-commerce. Classifying the customers of great purchasing potentially from the visiting data can help companies to lessen the sales cost [17] . Companies can use users’ browsing data to identify whether they purposefully shopping or just browsing or buying something they are familiar with or something new. This helps companies to plan and improve their infrastructure [18] . Personalization of Service: Personalization is the act to provide contents and services geared to individuals on the basis of information of their needs and behavior. Data mining research related to personalization has focused mostly on recommender systems and related subjects such as collaborative filtering. Recommender systems have been explored intensively in the data mining community. This systems can be divided into three groups: Content-based, social data mining and collaborative filtering. These systems are cultured and learned from explicit or implicit feedback of users and are usually represented as the user profile. Social data mining, in considering the source of data that are created by the group of individuals as part of their daily activities, can be important source of important information for companies. Contrarily, personalization can be achieved by the aid of collaborative filtering, where users are matched with particular interest and in the same vein the preferences of these users to make recommendations [19] . Basket Analysis: Every shoppers’ basket has a story to tell and market basket analysis (MBA) is a common retail, analytic and business intelligence tool that helps retailers to know their customers better. There are different ways to get the best out of market basket analysis and these include: – Identification of product affinities; tracking not so apparent product affinities and leveraging on them is the real
266
Data Analysis and Information Processing
challenge in retail. Walmart customers purchasing Barbie dolls shows an affinity towards one of three candy bars, obscure connection such as this canbe discovered with an advanced market basket analytics for planning more effective marketing efforts. – Cross-sell and up-sell campaigns; these shows the products purchased together, so customers who purchase the printer can be persuaded to pick up high quality paper or premium cartridges. – Planograms and product combos; are used for better inventory control based on product affinities, developing combo offers and design effective user friendly planograms in focusing on products that sells together. – Shoppers profile; in analyzing market basket with the aid of data mining over time to get a glimpse of who your shoppers really are, gaining insight to their ages, income range, buying habits, likes and dislikes, purchase preferences, levering this and giving the customer experience [19] . 4) Sales Forecasting: Sales forecasting involves the aspect of the time an individual customer spend to buy an item and in this process trying to predict if the customer will buy again. This type of analysis can be used to determine a strategy of planned obsolescence or figure out complimentary products to sell. In sales forecasting, cash flow can be projected into three which include the pessimistic, optimistic and the realistic. This helps to have a plan on the adequate amount of capital available to endure the worst possible scenario that is if sales do not go actually as planned [19] . 5) Merchandise Planning: Merchandise planning is useful for both online and offline retail companies. In the case of online business, merchandise planning will help to determine stocking options and the inventory warehousing, while in the case of offline companies, business that are looking to boost by adding stores can assess the required amount of merchandise they will be adequately needing by having a foresight at the exact layout of the current store [20] .
Data Mining in Electronic Commerce: Benefits and Challenges
267
Using the right approach to merchandise planning will definitely lead to answers on what to do with: • Pricing: the aspect of database mining will help determining the suited best price of products or services in the processes of revealing customer sensitivity. • Deciding on products; data mining provides e-commerce businesses with the aspect of which products customers actually desire, which includes the aspect of intelligence on competitor’s merchandise. • Balancing of stocks; in mining the retail database, it helps determine the right and specific amount of stocks needed i.e. not too much and not too less, throughout the business year and also during the buying seasons. 6) Market Segmentation: Customer segmentation is one of the best uses of data mining. From the lots of data gotten, it can be broken down into different and meaningful segments like income, age, gender, occupation of customers, and this can be used when either the companies are running email marketing campaigns or SEO strategies. The aspect of market segmentation can also help a company identify its own competitors. This provided information alone can help the retail company identify that the periodic respondents are usually not the only ones pointing the same customer money as the present company is [21] . Segmenting the database of a retail company will improve the conversion rates as the company can focus there promotion on a close-fitted and highly wanted market. This also helps the retail company to understand the competitors that are involved in each and every segment in the process permitting the customization of products that will actually satisfy the target audience in a generic way [21] .
CHALLENGES OF DATA MINING IN E-COMMERCE Besides the benefits data mining provides challenges for e-commerce companies, which are as follows: 1)
Spider Identification: As it is commonly known main aim of data mining is to convert data into useful knowledge. Main source of data for e-commerce companies is web pages. Therefore, it is critical for e-commerce companies to understand how search engines work to follow how quickly things happen, how they happen and when changes will show up in the search engines.
268
Data Analysis and Information Processing
Spiders are software programs that are sent out by the search engine to find new information. These spiders can also be called as bots or crawlers. It is a software program that search engine uses to request pages and download them, it comes as a surprise to some people, however what the search engine does is they use a link of an existing website to find a new website and request a copy of that page to download it to their server. This is what the search engines use to run the ranking algorithm against and that is what shows up in the search engine result page. Therefore, the challenge here is that the search engines need to download a correct copy of the website. E-commerce website needs to be readable and seeable and the algorithm is applied to the search engines database. Tools are needed to have the mechanisms to enable them automatically remove unwanted data that will be transformed to information in order for data mining algorithm to provide reliable and sensible output [22] . 2) Data Transformations: In this case data transformation pose a challenge for data mining tools. Today, the data needed to transform can only be gotten from two different sources, one of which an active and operational system for the data warehouse to be built and secondly it should include some activities that involves assigning new columns, binning data and also aggregating the data as well. In the first process, it is needed to be modified infrequently that is only when there is a change in the site and lastly the set of the transformed data gives a significantly great challenge in the data mining process [22] . 3) Scalability of Data Mining Algorithms: With yahoo which has over 1.2 billion page views in a day with the presence of large amount of data, scalability arises with significant issues; • Due to the large amount of data size gathered from the website at a reasonable time, the data mining algorithm can handle or process it as much as it’s needed especially because of the scale nonlinearly. • The models that are generated tends to be too complicated for individuals to understand how it is interpreted [22] .
Data Mining in Electronic Commerce: Benefits and Challenges
4)
5)
6)
269
Make Data Mining Models Comprehensible to Business Users: The results of data mining should be clearly understood by business users, from the merchandisers who are in charge of decision making to the creative designers that design the sites to marketers to spend advertising money. The challenge is to design and define extra model types and a strategic way to present them to business users, what regression models can we come up with and how can we present them? (Even linear regression is usually hard for business users to understand.) How can we present nearestneighbor models, for example? How can we present the results of association rule algorithms without overwhelming users with tens of thousands of rules? [22] . Support Slowly Changing Dimensions: The demographic aspect of visitors change, in that they may get married, there is an increase in salaries or income, the rapid growth of their children, needs which are the bases on which it is modelled changes. Thus, the products attributes also change, in terms of new choices may be available, the design and the way the products or service is packaged and also the increase or degrade of quality. These attribute that change over time are often known as “Slowly Changing Dimensions”. In this case the main challenge here is to keep track of those changes and in the same vein providing support for the identified change in the analysis [2] . Make Data Transformation and Model Building Accessible to Business Users: Having the ability to provide definite answers to questions by individual business users, this requires the aspects of data transformations but with the technical understanding of the tools used in the analysis. Many commercials report designers and also online analytical processing (OLAP) tools are basically hard to understand by business users. In this case, two preferred solutions are (I) provision of templates, (e.g. online analytical pro- cessing cubes and recommended transformations for mining) for the expected questions and (ii) provision of the experts via consultation or even a service organization. This mentioned challenge basically is to find a way to enrich the business users to as to be able to analyze the information themselves without and hiccups [2] .
270
Data Analysis and Information Processing
SUMMARY AND CONCLUSION Data mining for e-commerce companies should no longer be privilege but requirement in order to survive and remain relevant in the competitive environment. On one hand, data mining offers number of benefits to e-commerce companies and allows them to do merchandise planning, analyze customers’ purchasing behaviors and forecast their sales which in turn would place them over other companies and generate more revenue. On the other hand, there are certain challenges of data mining in the field of e-commerce such as spider identification, data transformation, scalability of data mining algorithms, making data mining model comprehensible to business users, support slow changing dimensions and making data transformation and model building accessible to business users. The data collected about customers and their transactions, which are the greatest assets of e-commerce companies, needs to be used consciously for the benefits of the companies. For such companies, data mining plays an important role in providing customer-oriented services to increase customer satisfaction. It has become apparent that utilizing data mining tools is a necessity for e-commerce companies in this global competitive environment. Although the complexity and granularity of the mentioned challenges differ, e-commerce companies can overcome these problems by using and applying the right techniques. For example, developing e-commerce website in a way that search engines can read and access the latest version of the website, help companies to overcome the search engine spider identification problem. Another hot topic in e-commerce data mining is cloud computing which is also covered in this paper. While the need of data mining tools is growing every day, the ability of integrating them in cloud computing becomes more stringent. It is obvious that making good use of the cloud computing technology in e-commerce helps effective use of resources and reduces costs for companies that enable efficient data mining.
Data Mining in Electronic Commerce: Benefits and Challenges
271
REFERENCES 1.
Cao, L., Li, Y. and Yu, H. (2011) Research of Data Mining in Electronic Commerce. IEEE Computer Society, Hebei. 2. Bhagyashree, A. and Borkar, V. (2012) Data Mining in Cloud Computing. Multi Conference (MPGINMC-2012). http://reserach. ijcaonline.org/ncrtc/number6/mpginme1047.pdf 3. Rao, T.K.R.K., Khan, S.A., Begun, Z. and Divakar, Ch. (2013) Mining the E-Commerce Cloud: A Survey on Emerging Relationship between Web Mining, E-Commerce and Cloud Computing. IEEE International Conference on Computational Intelligence and Computing Research, Enathi, 26-28 December 2013, 1-4. http://dx.doi.org/10.1109/iccic.2013.6724234 4. Wu, M., Zhang, H. and Li, Y. (2013) Data Mining Pattern Valuation in Apparel Industry E-Commerce Cloud. IEEE 4th International Conference on Software Engineering and Service Science (ICSESS), 689-690. 5. Srinniva, A., Srinivas, M.K. and Harsh, A.V.R.K. (2013) A Study on Cloud Computing Data Mining. International Journal of Innovative Research in Computer and Communication Engineering, 1, 1232-1237. 6. Carbone, P.L. (2000) Expanding the Meaning and Application of Data Mining. International Conference on Systems, Man and Cybernetics, 3, 1872-1873. http://dx.doi.org/10.1109/icsmc.2000.886383 7. Barry, M.J.A. and Linoff, G.S. (2004) On Data Mining Techniques for Marketing, Sales and Customer Relationship Management. Indianapolis Publishing Inc., Indiana. 8. Pan, Q. (2011) Research of Data Mining Technology in Electronic Commerce. IEEE Computer Society, Wuhan, 12-14 August 2011, 1-4. http://dx.doi.org/10.1109/icmss.2011.5999185 9. Verma, N., Verma, A., Rishma and Madhuri (2012) Efficient and Enhanced Data Mining Approach for Recommender System. International Conference on Artificial Intelligence and Embedded Systems (ICAIES2012), Singapore, 15-16 July 2012. 10. Kamba, M. and Hang, J. (2006) Data Mining Concept and Techniques. Morgan Kaufmann Publishers, San Fransisco. 11. News Stack (2015). http://thenewstack.io/six-of-the-best-open-sourcedata-mining-tools/
272
Data Analysis and Information Processing
12. Witten, I.H. and Frank, E. (2014) The Morgan Kaufmann Series on Data Mining Management Systems: Data Mining. 2nd Edition, Publisher Morgan Kaufmann, San Francisco, 365-528. 13. Liu, X.Y. And Wang, P.Z. (2008) Data Mining Technology and Its Application in Electronic Commerce. IEEE Computer Society, Dalian, 12-14 October 2008, 1-5. 14. Zeng, D.H. (2012) Advances in Computer Science and Engineering. Springer Heidelberg, NewYork. 15. Ralph, K. and Caserta, J. (2011) The Data Warehouse ETL Toolkit: Practical Techniques for Extraction, Cleaning, Conforming and Delivering Data. Wiley Publishing Inc., USA. 16. Michael, L.-W. (1997) Discovering the Hidden Secrets in Your Data— The Data Mining Approach to Information. Information Research, 3. http://informationr.net/ir/3-2/ 17. Li, H.J. and Yang, D.X. (2006) Study on Data Mining and Its Application in E-Business. Journal of Gansu Lianhe University (Natural Science), No. 2006, 30-33. 18. Raghavan, S.N.R. (2005) Data Mining in E-Commerce: A Survey. Sadhana, 30, 275-289. http://dx.doi.org/10.1007/BF02706248 19. Michael, J.A.B. and Gordon, S.L. (1997) Data Mining Techniques: For Marketing and Sales, and Customer Relationship Management. 3rd Edition, Wiley Publishing Inc., Canada. 20. Wang, J.-C., David, C.Y. and Chris, R. (2002) Data Mining Techniques for Customer Relationship Management. Technology in Society, 24, 483-502. 21. Christos, P., Prabhakar. R. and Jon, K. (1998) A Microeconomic View of Data Mining. Data Mining and Knowlege Discovery, 2, 311-324. http://dx.doi.org/10.1023/A:1009726428407 22. Yahoo (2001) Second Quarter Financial Report. Yahoo Inc., Califonia.
Chapter 14
Research on Realization of Petrophysical Data Mining Based on Big Data Technology
Yu Ding1,2, Rui Deng2,3, Chao Zhu4 School of Computer Science, Yangtze University, Jingzhou, China
1
Key Laboratory of Exploration Technologies for Oil and Gas Resources (Yangtze University), Ministry of Education, Wuhan, China
2
School of Geophysics and Oil Resource, Yangtze University, Wuhan, China
3
The Internet and Information Center, Yangtze University, Jingzhou, China
4
ABSTRACT This paper studied the interpretation method of realization of data mining for large-scale petrophysical data, which took distributed architecture, cloud computing technology and B/S mode referred to big data technology and data mining methods. Based on petrophysical data mining application of K-means clustering analysis, it elaborated the practical significance of application association with big data technology in well logging field, which
Citation: Ding, Y. , Deng, R. and Zhu, C. (2018), “Research on Realization of Petrophysical Data Mining Based on Big Data Technology”. Open Journal of Yangtze Oil and Gas, 3, 1-10. doi: 10.4236/ojogas.2018.31001. Copyright: © 2018 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0
274
Data Analysis and Information Processing
also provided a scientific reference for logging interpretation work and data analysis and processing method to broaden the application. Keywords: Big Data Technology, Data Mining, Logging Field Method
INTRODUCTION With the increasing scale of oil exploration and the development of engineering field, the application of high-tech logging tools is becoming more and more extensive. The structural, semi-structured and unstructured complex types of oil and gas exploration data are exploded. In this paper, the petrophysical data was taken as the object; Big data technology and data mining methods were used for data analysis and processing, which mines effective and available knowledge to assist routine interpretation of work and to broaden the scientific way to enhance the interpretation of precision. The research allows full play to great potential of logging interpretation for comparative study of geologic laws and oil and gas prediction. The rapid development of network and computer technology as well as the large-scale use of database technology makes it possible to extract effective information from petrophysical data in more different ways adopted by logging interpretation. Relying on the traditional database query mechanism and mathematical statistical analysis method, it is difficult to satisfy the effective processing of large-scale data. It tends to be that the data contains a lot of valuable information, but it cannot be of efficient use because the data is in an isolated state and cannot be transformed into useful knowledge applied to logging interpretation work. Too much useless information will inevitably lead to the loss of information distance [1] and useful knowledge which is in the “rich information and lack of knowledge” dilemma [2] .
ANALYSIS OF BIG DATA MINING OF PETROPHYSICAL DATA Processing Methods of Big Data Big data can be taken as the reasons for the basis of the data scale, and it is difficult to use existing software tools and mathematical methods in a reasonable time to achieve the analysis and processing of data which has the features of large scale, complex structure and many types [3] .
Research on Realization of Petrophysical Data Mining Based on Big Data...
275
At present, the amount of rock physical data information gradually increases more and more types, which is consistent with the basic characteristics of big data. With the advantages of cloud computing in data processing performance and the good characteristics of distributed architecture, the existing C/S mode interpretation method is transformed into B/S mode on basis of distributed architecture. Then, the situation, in which processing capacity of the original client single node is insufficient, can be handled through increasing the horizontal scaling of the monomer processing node and the node server in the condition of the rational allocation and the use of system resources. Meanwhile, the on-line method is adopted for the analysis and processing of the petrophysical data which can store the data mining results and analysis process in the server. Interpreters can interpret process documents through querying the server-side to make a more reasonable explanation of the logging data in the unknown area or the same type of geological conditions, which can achieve the change of data sharing from the lower stage (data sharing) to the advanced stage (knowledge sharing). The essence of big data processing methods can be seen as the development and extension of grid computing and prior distributed computing. The significance of big data processing does not just lie in the amount of data, but in these massive available data resources in which valuable information can be gained quickly and effectively while the available mode can be mined and the purpose of acquiring new knowledge can be achieved.
Overview of Big Data Technology Distributed System Architecture Distributed file system is mainly used to achieve data access of the local underlying and the upper-level file system. It is the software system on the basis of the network with a high degree of cohesion and transparency. The distributed system architecture can be considered as the software architecture design that operates in multiple processors. This paper chooses HDFS open source distributed file system to build software operating environment [4] . HDFS system architecture shown in Figure 1 adopts master/slave architecture, and an HDFS cluster is composed of a Namenode and a number of Datanodes. The Namenode node is used to manage the namespace of the file system and to handle client access to the file. The Datanodenode is used
276
Data Analysis and Information Processing
to manage the literacy requests of the storage and processing of the file system clients on its nodes.
Figure 1: HDFS system architecture.
Cloud Computing Technology Cloud computing is the Internet-based computing which has been put forward on the basis of the context of the development, being stuck in the bottlenecks, of the traditional computer storage technology and computing capacity (Figure 2) [5] [6] . By sharing hardware resources and information to cluster network nodes, large-scale parallel can be achieved and distributed computing to enhance the overall computing power of the system. Combined with the study content of the paper, the cloud computing is applied to the mining of petrophysical data, which can meet the computing requirements of the mining algorithm to solve the problem of insufficient processing capacity of the client nodes in the traditional C/S mode which is the conversion basis of B/S distributed online processing mode.
Research on Realization of Petrophysical Data Mining Based on Big Data...
277
Figure 2: Cloud computing architecture.
The Combination and Application of Data Mining Methods Clustering Mining Method Data clustering is one of the important tasks of data mining. Through clustering, it is possible to clearly identify the regions between inter-class and intra-class of data concentration, which is convenient to understand the global distribution pattern and to discover the correlation between data attributes [7] . In the pattern space S, if given N samples X1, X2, ∙∙∙, Xn, the clustering is defined to find the corresponding regions R1, R2, ∙∙∙, Rm, according to the similarity degree of each other; any of Xi (i = 1, 2, ∙∙∙, N) is classified into only one instead of the two classes at the same time, to wit, and [8] . Clustering analysis is mainly based on some features of the data set to achieve division according to the specific requirements or rules, which satisfies the following two characteristics under normal circumstances: intra-class similarity, namely, that data items in the cluster should be as similar as possible; interclass dissimilarity, namely, that data items in the heterogeneous cluster should be as different as possible [9] .
Petrophysical Data Clustering Mining Analysis At present, the analysis and accurate description of sedimentary facies, subfacies and microfacies for favorable reservoir facies zones are an important work in current oilfield exploration and development. The study of sedimentary facies is carried out on the basis of the composition, structure
278
Data Analysis and Information Processing
and sedimentary parameters under the guidance of phase pattern and phase sequence. The petrophysical data contains much potential stratigraphic information, and the lithology of the strata often leads to a certain difference in the sampling value of the logging curve. This difference can be seen as the common effects of many factors, such as the lithological mineral composition, its structure and the fluid properties contained in the pores. Because of this, one logging physical value also means some particular lithology of corresponding strata. Coupled with the difference of the formation period and the background, then the combination of the inherent physical characteristics of rock stratum in different geological periods and some random noise is used to achieve the purpose of lithological and stratigraphic division.
MINING BASED ON K-MEANS CLUSTERING ANALYSIS K-Means Algorithm Principle Assuming that there is a set of elements, the goal of K-means is to divide the elements of the set into K clusters or classes so that the elements within each cluster have a high degree of similarity while the similarity of elements of different clusters is low, namely, similar elements are clustered into a collection, eventually forming the multiple clustering clustered by featuresimilar elements [10] . K-means first randomly generates k objects from n data as the initial clustering center while the rest of the data objects are clustered by calculating the similarity (distance) of each data to the clustering centers (minimum distance between the two points), to divide the data object into the class, and then to recalculate the new cluster of the class center formed by each cluster (cluster the mean of all data objects) to update the cluster class center as the next class center of the iterations. It repeats the clustering process until the criterion function begins to converge. In this paper, the Euclidean distance is taken as the discriminant condition of similarity measure, and the criterion function Er is defined as the error sum of the squares of all the data objects to the class center. Obviously, the purpose of the K-means algorithm is to find K divisions of the data set based on the optimal criterion function.
Research on Realization of Petrophysical Data Mining Based on Big Data...
279
(1)
Here, X represents a data object in the data set; Ci represents the ith cluster, and
represents the mean of cluster Ci.
Lithological Division Based on K-Means The logging physics values of the same layer lithology are relatively stable and generally do not exceed an allowable error. The mean value of the samples in the same layer can be used to represent the overall true value of the similar parts of the surrounding lithology. When the difference between the value of the adjacent sampling point and the mean is within the given error range, the lithological type of the point can be replaced by the lithology corresponding to the mean. Otherwise, it will proceed with the search for the home class until the division of all sampling points is completed. In order to facilitate the study, this paper selects the natural gamma logging curve with strong longitudinal resolution for the division of lithology, while the other passive curves are selected to adjust the division results to improve the accuracy of the decision outcomes in the completion of the lithological division at the same time. For any two points in the plane (X1, Y1) and (X2, Y2), the Euclidean distance is as follows,
(2)
Here, Figure 3 is taken as the example to show the clustering process of K-means petrophysical data. In Figure 3(a), the black triangles are labeled in two-dimensional space with two-dimensional eigenvectors as coordinates. They can be regarded as examples reflected by two-dimensional data (composed of the data of two logging curves), that is, primitive petrophysical data sets in need of clustering. Three different colored boxes represent the clustering center points (analogical to some lithology) given by random initialization. Figure 3(b) shows the results of the completion of clustering, that is, to achieve the goal of lithological division. Figure 3(c) shows the trajectory of the centroid in the iterative process.
280
Data Analysis and Information Processing
Figure 3: Clustering process of petrophysical data.
The program flow chart is shown in Figure 4 as follow.
Figure 4: Program flow chart.
Research on Realization of Petrophysical Data Mining Based on Big Data...
281
Software Implementation Distributed Architecture and Cloud Computing Environment Hadoop operates three modes―the stand-alone, pseudo-distributed and fully distributed. Taking into account the test environment required for the simulation software operation and the main content of this study with the combination of methods to the application, the test environment adopts Hadoop’s fully distributed mode in which VMware vSphere 5.5 is used to build another two virtual machines with the CentOS 6 Linux system in the high-performance server equipped with CentOS 6 and the distributed computing is done by three nodes in the cluster (Table 1). Different from the physical node, the cluster node is the use of software virtual composition and the actual operation of the process with differences in performance. Table 1: Description of the hosts and terminals in the cluster Hosttype
Host name
OS
IPaddress
Nodetype
terminal
localhost
Windows 7
10.102.10.35
-
hostmachine
test.com
CentOS 6
10.211.6.1
-
virtualmachine_1
master
CentOS 6
10.211.40.7
master
virtualmachine_2
slave_1
CentOS 6
10.211.40.8
slave
virtualmachine_3
slave_2
CentOS 6
10.211.40.9
slave
Application and Analysis A total of three production wells in the SZ development Zone of an oilfield are selected to complete the conventional logging interpretation pretreatment by using the collected core material of core section of well walls, relatively complete logging data, geological and drilling data combined with the actual geological conditions, in which the samples with possible existence of the borehole diameter, too large proportion of mud and too high viscosity, leading to measurement curve distortion of the logging instrument, are selected to choose the sample data with the true reflection of the strata information. Then, according to the description of the reservoir performance and the actual division of the corresponding oil and gas standards, the lithology of the working area is divided into four distinct divisions, namely, sandstone,
282
Data Analysis and Information Processing
argillaceous sandstone, sandy mudstone and mudstone combined with the core material. After the K-means algorithm and the lithological judgment condition are programmed, the B/S mode and the cloud computing technology are used to divide the well section lithology of the petrophysical data mining program in the cluster in the built fully distributed simulation environment of Hadoop. The accuracy of lithology is about 78%, and the accuracy of sandstone and mudstone is relatively high which is more than 85%, and the results are shown in Figure 5.
Figure 5: Data mining result.
The data in Figure 5 shows that the same notions of SPLI values indicate that they are of the same layer, i.e., lithological consistency or similarity viewed from the results of artificial stratification. Compared with the data mining results, due to the difference of the value of the empirical coefficient in the stratigraphic age, the division results based on the discriminant conditions are different in some logging sampling points. According to the correction process of the core data, the result is related to the value of the empirical coefficient. For a certain section of stratum, the value of 2 may have a relatively high degree of coincidence. Similarly, the value of 3.7 of some layers have a relatively high degree of coincidence. This also shows that the general selection of the single empirical coefficient may have an impact on the accuracy of the interpretation results. On the one hand, viewed form the upper and lower adjacent types, the results are identified correctly which belong to the same kind of lithology. On the other hand, viewed from the comparison of the results of the left and right division, the lithological division has changed while the corresponding SPLI value is not exactly the same, indicating that the data have the value of further fine study.
Research on Realization of Petrophysical Data Mining Based on Big Data...
283
Therefore, in-depth study of inconsistent results of lithological division can help to find valuable unexpected pattern in petrophysical data. Compared with the experimental methods and empirical methods, this method of “Data to talk”, from which the potential correlation and extract knowledge are explored, leads to the more objection and science for some regional empirical parameter values and empirical formulas in scientific induction and summary. Figure 6 shows the time consumed by executing the same program in a stand-alone node and in a distributed environment, which is 616,273 ms and 282,697 ms respectively. Here, the optimization of the algorithm, compiler selection and hardware device performance differences and other factors are considered comprehensively, and only described from the qualitative perspective, the use of distributed computing can reduce or significantly reduce the time spent on large-scale data processing to improve the overall performance of the system to a certain extent, thus the feasibility of using the big data technology to realize the petrophysical data mining is verified.
Figure 6: Running time of program in Windows and HDFS.
CONCLUSIONS 1)
The advantages of distributed architecture and cloud computing are used to improve the overall processing capacity of the system, and in the process of large-scale petrophysical data processing, the B/S mode is integrated to achieve data mining to combine big data analysis and processing mechanism with conventional interpretation. The exploratory research idea of the new method
Data Analysis and Information Processing
284
2)
of logging interpretation is put forward, with the starting point of discovering the novel knowledge, to provide a scientific reference for the routine widening applied by the interpretation work and data analysis methods. The combination of multidisciplinary knowledge and the rational application of cross technology can perfect the deficiencies in the existing logging interpretation to a certain extent, making the interpretation work of qualitative analysis and quantitative computing of logging data more scientific, with favorable theory and practical guidance significance.
ACKNOWLEDGEMENTS This work is supported by Yangtze University Open Fund Project of key laboratory of exploration technologies for oil and gas resources of ministry of education (K2016-14).
Research on Realization of Petrophysical Data Mining Based on Big Data...
285
REFERENCES 1. 2.
Wang, H.C. (2006) DIT and Information. Science Press, Beijing. Wang, L.W. (2008) The Summarization of Present Situation of Data Mining Research. Library and Information, 5, 41-46. 3. Pan, H.P., Zhao, Y.G. and Niu, Y.X. (2010) The Conventional Well Logging Database of CCSD. Chinese Journal of Engineering Geophysics, 7, 525-528. 4. Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003) The Google File System. ACM SIGOPS Operating Systems Review, 37, 29-43. https:// doi.org/10.1145/1165389.945450 5. Sakr, S., Liu, A., Batista, D.M., et al. (2011) A Survey of Large Scale Data Management Approaches in Cloud Environments. IEEE Communications Surveys & Tutorials, 13, 311-336. https://doi. org/10.1109/SURV.2011.032211.00087 6. Low, Y., Bickson, D., Gonzalez, J., et al. (2012) Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. Proceedings of the VLDB Endowment, 5, 716-727. https://doi. org/10.14778/2212351.2212354 7. Song, Y., Chen, H.W. and Zhang, X.H. (2007) Short Term Electric Load Forecasting Model Integrating Multi Intelligent Computing Approach. Computer Engineering and Application, 43, 185-188. 8. Abraham, B. and Ledolter, J. (1983) Statistical Methods for Forecasting. John Wiley & Sons, Inc., NewJersey. 9. Farnstrom, F., Lewis, J. and Elkan, C. (2000) Scalability for Clustering Algorithms Revisited. AcmSigkdd Explorations Newsletter, 2, 51-57. https://doi.org/10.1145/360402.360419 10. Rose, K., Gurewitz, E. and Fox, G.C. (1990) A Deterministic Annealing Approach to Clustering. Information Theory, 11, 373.
SECTION 4: INFORMATION PROCESSING METHODS
Chapter 15
Application of Spatial Digital Information Fusion Technology in Information Processing of National Traditional Sports Xiang Fu, Ye Zhang, and Ling Qin School of Physical Education Guangdong Polytechnic Normal University, Guangzhou 510000, China
ABSTRACT The rapid development of digital informatization has led to an increasing degree of reliance on informatization in various industries. Similarly, the development of national traditional sports is also inseparable from the support of information technology. In order to improve the informatization development of national traditional sports, this paper studies the fusion process of multisource vector image data and proposes an adjustment and merging algorithm based on topological relationship and shape correction for the mismatched points that constitute entities with the same name. The algorithm is based on topological relationship. The shape of the adjustment Citation: Xiang Fu, Ye Zhang, Ling Qin, “Application of Spatial Digital Information Fusion Technology in Information Processing of National Traditional Sports”, Mobile Information Systems, vol. 2022, Article ID 4386985, 10 pages, 2022. https://doi. org/10.1155/2022/4386985. Copyright: © 2022 by Authors. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
290
Data Analysis and Information Processing
and merging algorithm is modified, and finally, a set of national traditional sports information processing system is constructed by combining the digital information fusion technology. Experiments have proved that the method of national traditional sports information processing based on digital information fusion technology proposed in this paper is effective and can play a good role in the digital development of national traditional sports.
INTRODUCTION National traditional sports, as the carrier of China’s excellent traditional culture, has been preserved in the state of a living fossil after changes in the times and social development. The 16th National Congress of the Communist Party of China regards the construction of socialist politics, economy, and culture with Chinese characteristics as the basic program of the primary stage of socialism and cultural prosperity as an important symbol of comprehensive national strength. Therefore, cultural construction will also be a focus of today’s new urbanization process. If you want to achieve cultural development in urbanization, you need to use a medium, and traditional national sports is a good medium. Traditional national sports equipment is multifunctional, including fitness, entertainment, and education. Moreover, the content is rich, the forms are diversified, and the forms of sports are eclectic, regardless of age, suitable for men, women, and children. It can be said that traditional national sports activities are an indispensable part of people’s lives. An information platform is an environment created for the construction, application, and development of information technology, including the development and utilization of information resources, the construction of information networks, the promotion of information technology applications, the development of information technology and industries, the cultivation of information technology talents, and the formulation and improvement of information technology policy systems [1]. The network platform has the greatest impact and the best effect in the construction of the information platform. Therefore, it is necessary to make full use of network technology, communication technology, control technology, and information security technology to build a comprehensive and nonprofit information platform for the traditional Mongolian sports culture. The information platform includes organizations, regulatory documents, categories, inheritors, news updates, columns, performance videos, and protection forums. Moreover, it uses text, pictures, videos, etc., to clearly promote and display the unique ethnicity of
Application of Spatial Digital Information Fusion Technology in ...
291
Mongolian traditional sports culture in terms of clothing, event ceremonies, etiquette, techniques, customs, and historical inheritance. In addition, its display has been circulated and fused with ethnic imprints of different historical periods, regions, nations, and classes. The information platform can popularize relevant knowledge, push and forward messages, publish hot topics and comments, and enhance the interaction between users. Therefore, the platform has become a carrier for the public to obtain knowledge of Mongolian national traditional sports culture and a shortcut for consultation and communication [2]. The emergence and application of computer technology mark the beginning of a new digital era for human beings, which has greatly changed the existing way of life and the form of information circulation [3]. In-depth study of the main problems and obstacles in the development of digital sports in my country and active exploration of development paths and strategies will not only promote the rapid development of the sports performance market, the sports fitness market, and the sports goods market but also expand the space for sports development. Improving the physical fitness of the whole people, cultivating sports reserve talents, and enriching the sports and cultural life of the masses have a certain impact. At the same time, advanced sports culture, healthy sports lifestyle, and scientific exercise methods will be integrated into the daily life of the public, attracting more people to participate in sports activities, experience the vitality of sports, and enjoy the joy of sports. Eventually, realize the great idea of transforming a sports power into a sports power, and complete the leap-forward development of the nationwide fitness industry and the sports industry. This article combines digital information fusion technology to construct the national traditional sports information processing system, so as to improve the development effect of national traditional sports in the information age.
RELATED WORK Integrating “digitalization” into sports can understand the concept of “digital sports” from both narrow and broad directions [4]. In a broad sense, digital sports is a new physical exercise method that combines computer information technology with scientific physical exercise content and methods. It can help exercisers improve their sports skills, enhance physical fitness, enrich social leisure life, and promote the purpose of spiritual civilization construction. In a narrow sense, digital sports is a related activity that combines traditional
292
Data Analysis and Information Processing
sports with modern digital means. Through advanced digital technology, traditional sports exercises are reformed and sublimated, so as to achieve the purpose of scientifically disseminating sports knowledge and effectively improving physical skills [5]. Digital sports is a brand-new concept. It realizes the perfect combination of digital game form with competitive fitness, physical exercise, and interactive entertainment through technical means such as the Internet, communications, and computers [6]. It comes from the combination of traditional sports and digital technology [7]. At the same time, digital sports also involves cross fields such as cultural content, computer information, and sports. The emergence of digital sports has freed the public from the limitation of venues in traditional physical exercise. Volkswagen is no longer limited to wide and diversified sports venues and helps the public make best use of existing sports venues such as community open space, small squares, street roads, and parks to the fullest extent. For example, the Wii, a digital home game console sold by Nintendo of Japan, features an unprecedented use of a stick-shaped mobile controller “Wii Remote” and classic sports games. It uses Wii Remote’s motion sensing and pointing positioning to detect three-dimensional space. Flip and complete the movement “somatosensory operation.” Wii Remote is used as a fishing rod, baton, tennis racket, drum stick, and other tools in different games to help players complete exercise through somatosensory operations such as shooting, chopping, swinging, and spinning [8]. Undoubtedly, digital sports methods that get rid of the constraints of the geographical environment can not only increase the enthusiasm of all people to participate in sports but also expand the existing sports population. What is more important is that digital sports can break free from geographical constraints and no longer be restricted by past stadiums [9]. Digital sports are conducive to meeting the sports participation needs of different groups of people. For companies that develop digital sports, whoever first captures the digital sports market of special groups such as middle-aged and elderly, children, and women will occupy the commanding heights on the battlefield of digital sports [10]. The emergence of digital sports brings advanced and scientific sports training methods and exercise content to ordinary sports enthusiasts, changing the disadvantages of the past biased research on young people and more satisfying the needs of different groups such as children, the elderly, and women. At the same time, its appearance can also help different sports hobby groups set multilevel exercise goals, find the
Application of Spatial Digital Information Fusion Technology in ...
293
best exercise plan, and form a serialized and intelligent digital sports service system [11]. Regardless of the age of the participants, high or moderate weight, and female or male, digital sports methods will provide them with the most suitable activity method to help different sports enthusiasts complete exercise and demonstrate the charm of sports [12]. Digital sports deeply analyzes the activity habits or exercise methods of the elderly, women, children, and other special groups and provides more suitable sports services for every sports enthusiast [13]. Through local computing, digital sports accurately locates and perceives the personalized and unstructured data of different audience groups and conducts comprehensive analysis and processing of various data information in a short period of time, forming a portable mobile device for each sports group. In order to find out the real needs of more sports audiences, put forward effective exercise suggestions to help different exercise groups reach the best exercise state [14]. Through the connection of the bracelet and the digital sports terminal, the public can also see the comparison chart of the comprehensive sports data of different participating groups more intuitively, assist the public to set personalized sports goals, and urge each athlete to complete their own exercise volume. In the end, every exerciser’s exercise method and exercise effect will be improved scientifically and reasonably over time [15].
SPACE DIGITAL FUSION TECHNOLOGY This article applies spatial digital fusion technology to the national sports information processing. Combining the reality and needs of national sports, this article analyzes the spatial digital integration technology. First, the digital coordinate system is established. The mathematical formula for transforming digital coordinates to spatial rectangular coordinates is shown in formula (1) [16].
(1)
Among them, B is the latitude of the earth, L is the longitude of the earth, H is the height of the earth, (X,Y,Z) is the rectangular coordinates of the space, is the radius of curvature of the ellipsoid, and is the eccentricity of the ellipse (a and b represent the long and short radii of the ellipse, respectively) [17].
294
Data Analysis and Information Processing
When converting from spatial rectangular coordinates to digital coordinates, the geodetic longitude can be obtained directly. However, the calculation of the geodetic latitude B and the geodetic height H is more complicated, and it is often necessary to use an iterative method to solve the problem. From formula (1), the iterative formula can be solved as
(2)
In the iterative process, the initial value is . According to formula (2), B can be obtained by approximately four generations, and then, H can be obtained. Figure 1 shows
two
spatial rectangular coordinate systems . Among them, the same point in the two rectangular coordinate systems has the following correspondence [18]:
(3)
Figure 1: Conversion between two spatial rectangular coordinate systems.
Application of Spatial Digital Information Fusion Technology in ...
295
Among them, there are
(4)
is the coordinate translation parameter, is the coordinate rotation parameter, and k is the coordinate scale coefficient. In practical applications, the determination of the conversion parameters in the above-mentioned two Cartesian coordinate conversion relations can be determined by using the least squares method of the common point coordinates. In mathematics, projection refers to the establishment of a one-toone mapping relationship between two point sets. Image projection is to express the graticule on the sphere of the earth onto a plane in accordance with a certain mathematical law. A one-to-one correspondence function is established between the digital coordinates (B,L) of a point on the ellipsoid and the rectangular coordinates (x,y) of the corresponding point on the image. The general projection formula can be expressed as [19]
(5)
In the formula, (B,L) is the digitized coordinates (longitude, latitude) of a point on the ellipsoid, and (x,y) is the rectangular coordinates of the point projected on the plane. The transformation of the positive solution of Gaussian projection is as follows: given the digitized coordinates (B,L), the plane rectangular coordinates (x,y) under the Gaussian projection are solved. The formula is shown in
296
Data Analysis and Information Processing
(6) In the formula, X represents the arc length of the meridian from the equator to latitude represents the radius of curvature of the circle, L0 represents the longitude of the origin, represents the difference between the longitude of the ellipsoid point and the corresponding central meridian, and the auxiliary variables , respectively, represent the long radius, short radius, and second eccentricity of the reference ellipsoid. The inverse solution transformation of the Gaussian projection is as follows: the plane rectangular coordinates (x,y) under the Gaussian projection are known, and the digitized coordinates (B,L) are solved. The calculation formula is shown in [20]
(7)
Among them, the variable represents the latitude of the location, that is, the latitude value corresponding to the meridian calculated from the equator. The longitude and latitude values obtained by the inverse Gaussian transformation method are actually a relative quantity, which is the difference between the longitude and latitude relative to the lower left corner of the figure. Therefore, to get the final correct digitized coordinates, the longitude and latitude values of the lower left corner point need to be added.
Application of Spatial Digital Information Fusion Technology in ...
297
The transformation of the positive solution of the Mercator projection is as follows: given the digitized coordinates (B,L), the plane rectangular coordinates (x,y) under the Mercator projection are calculated, and the formula is as shown in [21]
In
the
formula, L0 is
the
longitude
of
(8) the
origin,
is called the reference latitude. When B=0, the cylinder is tangent to the earth ellipsoid, and the radius of the tangent cylinder is a. The inverse solution transformation of the Mercator projection is as follows: given the plane rectangular coordinates (x,y) under the Mercator projection, the digitized coordinates (B,L) are calculated, and the formula is shown in
(9)
In the formula, exp is the natural logarithm base, and the latitude B is quickly closed by iterative calculation. For the geometric matching method of point entities, the commonly used matching similarity index is Euclidean distance. The algorithm compares the calculated Euclidean distance between the two with a threshold, and the one within the threshold is determined to be an entity with the same name or may be an entity with the same name. If multiple entities with the same name are obtained by matching, then repeated matching can be performed by reducing the threshold or reverse matching. If the entity with the same name cannot be matched, it can be adjusted by appropriately increasing the threshold until the entity with the same name is matched. The calculation formula of Euclidean distance is shown in (10)
298
Data Analysis and Information Processing
where D is the Euclidean distance between two point entities . For the geometric matching method of line entities, the total length L of the line, the direction θ of the line, the maximum chord Lmax of the line, etc., are usually used as the matching similarity index. Its definition and calculation formula are as follows:
The Total Length of the Line The total length of the line is defined as the sum of the lengths of the subline segments that make up the line segment. We assume that the points that make up the line are the line is calculated as
, as shown in Figure 2. The total length of
(11)
Figure 2: The direction of the line.
The Direction of the Line The direction of the line is defined as the angle between the line between the first end point and the end point and the x-axis. It is specified that clockwise is positive and counterclockwise is negative, as shown in Figure 3. We assume that the first end point of the line is , and the end point is , and then, the calculation formula for the direction of the line is shown in formula (11).
Application of Spatial Digital Information Fusion Technology in ...
299
Figure 3: The direction of the line.
The Maximum Chord of the Line The maximum chord of a line is defined as the distance between the two furthest points that make up the line, as shown in Figure 4, and the calculation formula is shown in
(12)
Figure 4: Maximum chord of line.
The main idea of the graph data adjustment and merging algorithm based on the same-name point triangulation is as follows: in the “reference map”
300
Data Analysis and Information Processing
and “adjustment map,” a topologically isomorphic Delaunay triangulation is constructed with the matched points of the same name as the starting point. After the corresponding regions in the two figures are divided into small triangles, the coordinate conversion relationship is established through the three vertices of each small triangle, and the other points falling into the triangle undergo coordinate conversion according to this relationship. Figure 5 is a part of the triangulation network constructed according to the above method in the two vector diagrams, where ΔABC and ΔA′B′C′ are, respectively, triangles formed by pairs of points with the same name in the two vector diagrams.
Figure 5: Example of partial division of triangulation.
The linear transformation relationship between △ABC and △A′B′C′ is shown
(13)
Among them, are the vertex coordinates of △ABC and △A′B′C′, respectively. Bring the coordinates of the three vertices of the triangle into formula (13), then the coefficients in F can be obtained, and the transformation formula can be obtained. The basic idea of the graph adjustment merging algorithm based on the principle of adjustment is as follows: The algorithm takes the coordinate adjustment values of the points that constitute the entity as the parameter
Application of Spatial Digital Information Fusion Technology in ...
301
to be solved, that is, the adjustment value correction number. Various error formulas, such as displacement formula, shape formula, relative displacement formula, area formula, parallel line formula, line segment length formula, and distance formula of adjacent entities, are established according to actual application needs. Finally, the calculation is carried out according to the principle of least squares method of interrogation adjustment, and the calculation formula is shown
(14)
(15) In the formula, constraintk is the limit value of the k factor, is the adjustment of the i-th entity coordinate point, and n is the total number of entity coordinate points. A is the coefficient matrix of the adjustment model, and v, x, and l are the corresponding residual value, parameter vector, and constant vector, respectively. The adjustment and merging algorithm based on topological relations is mainly used to adjust the geometric positions of unmatched points in entities with the same name. The basic idea is as follows: first, the algorithm determines that the unmatched points that need to be adjusted are affected by the matched points with the same name. Secondly, the algorithm analyzes and calculates the geometric position adjustment of each matched point with the same name. Finally, the algorithm uses the weighted average method to calculate the total geometric position adjustment of the unmatched points. We assume that the position adjustment of the last matched point P is affected by N matched points with the same name , and the distance from P$ to each matched point Qi with the same name is . We assume that the coordinate adjustment amount of the matched point Qi to the point P is , and then, the total adjustment amount of the coordinate of the P point is calculated as
302
Data Analysis and Information Processing
Among them, the weight
(16)
.
Adjust and Merge Algorithm Based on Multiple Evaluation Factors This article divides the points that constitute entities with the same name into two categories: points with the same name that are successfully matched and points that are not successfully matched. The point with the same name refers to the description of the same point on the entity with the same name in different vector images. The point of the same name that is successfully matched means that the point of the same name that constitutes the entity of the same name on one of the vector images can find the matching point of the same name on the corresponding entity of the same name in the other vector image. The unmatched point means that the point that constitutes the entity with the same name on one of the vector images cannot find a matching point on the corresponding entity with the same name on the other vector image. Because the points that constitute the entities with the same name inevitably have positioning errors, it is inevitable that there will be missing matches during the matching process of the points with the same name. Therefore, there are two situations for the unmatched points; one is the point with the same name, but the match is missed; the other is the point with the same name. The classification of the points constituting the entity with the same name is shown in Figure 6.
Application of Spatial Digital Information Fusion Technology in ...
303
Figure 6: Classification of points constituting entities with the same name.
The average angle difference refers to the absolute average value of the change of each turning angle of the entity before and after the entity is adjusted, and its calculation formula is shown in formula (17). The average angle difference can quantitatively describe the degree of change in the shape of the entity before and after adjustment. The larger the average angle difference, the greater the change in the shape of the graph before and after the adjustment, and vice versa, and the smaller the change in the shape of the graph before and after the adjustment.
(17)
In the formula, θifront and θiafter, respectively, represent the angle value before and after the adjustment of the i-th turning angle that constitutes the entity, and r represents the total number of turning angles that constitute the entity. In order to enable the entity adjustment and merging algorithm based on topological relations to maintain the consistency of the shape of irregular entities before and after the adjustment and merging, the wood text is an indicator of the size of the shape change; that is, starting from the average angle difference, an adjustment and merging algorithm based on topological relations and shape correction is proposed for the points that are not successfully matched on the entities with the same name. The detailed steps of the algorithm are as follows:
Data Analysis and Information Processing
304
(1)
The algorithm first calculates the point that is not matched successfully according to the adjustment and merging algorithm based on the topological relationship; that is, the adjusted position coordinates are calculated by formula (16) (2) Based on the adjustment and merging algorithm of the topological relationship, the shape correction is performed. According to the principle that the last matched point on the entity with the same name before and after the adjustment should maintain the same angle as the two nearest matched points with the same name, the are calculated. As in Figure adjusted position coordinates 7, we assume that A1, B1, A2, and B2 are the point pairs with the same name that are successfully matched on the entities with the same name in vector image 1 and vector image 2, where A1 matches A2 and B1 matches B2. They are adjusted to A′, B′ after being processed by the entity adjustment and merging algorithm. In the figure, X is the last matched point in vector image 1, and the two matched points closest to X in this figure are A1 and B1, respectively. Now, the algorithm adjusts and merges the point X that is not successfully matched and finds its adjusted position . Before the adjustment and merger, the angle between X, A1, and B1 is ∠ A1XB1. In order to ensure that the included angle remains unchanged before and after the adjustment and merger, the adjusted and merged included angle ∠A′X′B′ should be equal to ∠A1XB1. Therefore, make the parallel lines of l1 and l2 through A′ and B′, respectively, and the intersection point of the two straight lines .
obtained is the desired
Application of Spatial Digital Information Fusion Technology in ...
305
Figure 7: Schematic diagram of shape correction.
It should be noted that before the entity is adjusted and merged, if A1, B1, and x are on the same straight line, then the shape, the mausoleum, and the next step of this step can be omitted, and we only need to adjust the merged result and directly take the result in formula (1). (3) Using the weighted average method, the algorithm calculates the final adjusted and merged position coordinates , as shown in formulas (6)–(18). In this way, the adjustment and merging algorithm based on the topological relationship realizes the correction of the entity shape. (18) In the formula, a1 and a2 are weights, respectively, and their values are determined according to specific data, applications, and experience
INFORMATION PROCESSING OF NATIONAL TRADITIONAL SPORTS BASED ON SPATIAL DIGITAL INFORMATION FUSION The software function of the national sports digital training system is to collect the athlete’s action video, upload it to the computer, and process the video through the software, as shown in Figure 8.
306
Data Analysis and Information Processing
Digital video repository CCD camera 1
Digital video repository
Touch the display
CCD camera 2
CCD camera 1
Main processor
Touch the display CCD camera n CCD camera 2 Graphic image analysis software Main processor
Figure 8: National traditional sports training system based on spatial digital information fusion.
Figure 8: CCD National camera n traditional sports training system based on spatial digital Video collection card information fusion. Graphic image analysis software
Industrial The hardware part of the national sports digital information system camera 1 includes video capture cards, industrial computers, industrial cameras, touch Figure 8: National traditional sports training system based on spatial digital information fusion. screens, and racks, as shown in Figure 9. Touchable display
Video collection card Objective
Industrial camera 1 Industrial camera 2
Objective
Industrial computer
Touchable display
Industrial computer Industrial camera 2
Figure 9: System hardware structure diagram.
Application of Spatial Digital Information Fusion Technology in ...
307
The main development method of the system in this paper is shown in Figure 10. Computer development method
Structuring SDLC
Technology
Prototyping method
Process-oriented approach (structured method) Data-oriented method (information engineering method)
Computer information system development environment/tools
Visualization technology
Computer-aided software engineering
Computer-aided software engineering
Software development environment
Software reuse technology Other technologies
Integrated project /program support environment
Central resource database
The object-oriented method (OO method)
Figure 10: System development method.
After constructing the system of this paper, the model of this paper is tested and verified, and the system model is verified through simulation design. Through simulation research, the national traditional sports information processing system based on digital information fusion proposed in this paper is studied, and the effectiveness of the method proposed in this paper and the traditional method is compared, and the results are shown in 9 Figure 11.
Information processing effect
96 94 92 90 88 86 84 82
65 61 57 53 49 45 41 37 33 29 25 21 17 13 9 5 1 Number Digital effect
Figure 11: Verification of the effectiveness of the information processing of traditional national sports based on digital information fusion.
308
Data Analysis and Information Processing
It can be seen from the above that the effect of the traditional national sports information processing method based on digital information fusion proposed in this article is relatively significant. On this basis, the spatial digital processing of this method is evaluated, and the results shown in Table 1 and Figure 12 are obtained. Table 1: Evaluation of the spatial digital processing effect of the national traditional sports information processing method based on digital information fusion Number
Digital effect
Number
Digital effect
Number
Digital effect
1
87.54
23
86.55
45
86.56
2
88.11
24
88.81
46
91.47
3
93.22
25
89.53
47
88.05
4
92.99
26
88.00
48
87.30
5
89.74
27
86.75
49
93.32
6
87.11
28
88.87
50
88.40
7
86.24
29
88.66
51
92.70
8
91.30
30
86.26
52
91.14
9
88.40
31
86.80
53
89.77
10
86.92
32
86.03
54
89.12
11
92.18
33
89.42
55
88.36
12
88.95
34
87.65
56
93.39
13
92.52
35
88.80
57
91.97
14
90.00
36
90.60
58
87.88
15
90.23
37
88.12
59
86.76
16
86.97
38
89.69
60
91.24
17
88.55
39
91.62
61
93.62
18
90.56
40
89.35
62
90.12
19
87.73
41
88.46
63
93.59
20
86.26
42
91.01
64
92.30
21
89.78
43
87.52
65
92.50
22
93.69
44
89.18
66
88.78
Application of Spatial Digital Information Fusion Technology in ... 9
309
Information processing effect
96 94 92 90 88 86 84 82 65 61 57 53 49 45 41 37 33 29 25 21 17 13 9 5 1
Number Digital effect
Figure 12: Statistical diagram of the spatial digital processing effect of the national traditional sports information processing method based on digital information fusion.
From the above research, it can be seen that the national traditional sports information processing method based on digital information fusion proposed in this article also has a good effect in the digital processing of the national traditional sports space.
CONCLUSION Information technology has emerged in the field of sports, and brandnew sports activities such as sports digitalization and sports resource informationization have emerged. Unlike traditional online games and e-sports, which involve finger movements and eye-moving relatively static activities, digital sports put more emphasis on “sweating” body movements. Moreover, it uses digital technologies such as motion capture devices and motion sensors to transform and upgrade traditional sports to achieve interaction and entertainment among humans, machines, and the Internet. Digital sports will also play a particularly important role in social criticism and cultural value orientation. This article combines digital information fusion technology to construct the national traditional sports information processing system and improve the development effect of national traditional sports in the information age. The research results show that the national traditional sports information processing method based on digital information fusion proposed in this paper has a good effect in the digital processing of the national traditional sports space.
310
Data Analysis and Information Processing
REFERENCES 1.
K. Aso, D. H. Hwang, and H. Koike, “Portable 3D human pose estimation for human-human interaction using a chest-mounted fisheye camera,” in Augmented Humans Conference 2021, pp. 116– 120, Finland, February 2021. 2. A. Bakshi, D. Sheikh, Y. Ansari, C. Sharma, and H. Naik, “Pose estimate based yoga instructor,” International Journal of Recent Advances in Multidisciplinary Topics, vol. 2, no. 2, pp. 70–73, 2021. 3. S. L. Colyer, M. Evans, D. P. Cosker, and A. I. Salo, “A review of the evolution of vision-based motion analysis and the integration of advanced computer vision methods towards developing a markerless system,” Sports Medicine-Open, vol. 4, no. 1, pp. 1–15, 2018. 4. Q. Dang, J. Yin, B. Wang, and W. Zheng, “Deep learning based 2d human pose estimation: a survey,” Tsinghua Science and Technology, vol. 24, no. 6, pp. 663–676, 2019. 5. R. G. Díaz, F. Laamarti, and A. El Saddik, “DTCoach: your digital twin coach on the edge during COVID-19 and beyond,” IEEE Instrumentation & Measurement Magazine, vol. 24, no. 6, pp. 22–28, 2021. 6. S. Ershadi-Nasab, E. Noury, S. Kasaei, and E. Sanaei, “Multiple human 3d pose estimation from multiview images,” Multimedia Tools and Applications, vol. 77, no. 12, pp. 15573–15601, 2018. 7. R. Gu, G. Wang, Z. Jiang, and J. N. Hwang, “Multi-person hierarchical 3d pose estimation in natural videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 11, pp. 4245–4257, 2019. 8. G. Hua, L. Li, and S. Liu, “Multipath affinage stacked—hourglass networks for human pose estimation,” Frontiers of Computer Science, vol. 14, no. 4, pp. 1–12, 2020. 9. M. Li, Z. Zhou, and X. Liu, “Multi-person pose estimation using bounding box constraint and LSTM,” IEEE Transactions on Multimedia, vol. 21, no. 10, pp. 2653–2663, 2019. 10. S. Liu, Y. Li, and G. Hua, “Human pose estimation in video via structured space learning and halfway temporal evaluation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 7, pp. 2029–2038, 2019.
Application of Spatial Digital Information Fusion Technology in ...
311
11. A. Martínez-González, M. Villamizar, O. Canévet, and J. M. Odobez, “Efficient convolutional neural networks for depth-based multi-person pose estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 11, pp. 4207–4221, 2019. 12. W. McNally, A. Wong, and J. McPhee, “Action recognition using deep convolutional neural networks and compressed spatio-temporal pose encodings,” Journal of Computational Vision and Imaging Systems, vol. 4, no. 1, pp. 3–3, 2018. 13. D. Mehta, S. Sridhar, O. Sotnychenko et al., “VNect,” ACM Transactions on Graphics (TOG), vol. 36, no. 4, pp. 1–14, 2017. 14. M. Nasr, H. Ayman, N. Ebrahim, R. Osama, N. Mosaad, and A. Mounir, “Realtime multi-person 2D pose estimation,” International Journal of Advanced Networking and Applications, vol. 11, no. 6, pp. 4501–4508, 2020. 15. X. Nie, J. Feng, J. Xing, S. Xiao, and S. Yan, “Hierarchical contextual refinement networks for human pose estimation,” IEEE Transactions on Image Processing, vol. 28, no. 2, pp. 924–936, 2019. 16. Y. Nie, J. Lee, S. Yoon, and D. S. Park, “A multi-stage convolution machine with scaling and dilation for human pose estimation,” KSII Transactions on Internet and Information Systems (TIIS), vol. 13, no. 6, pp. 3182–3198, 2019. 17. I. Petrov, V. Shakhuro, and A. Konushin, “Deep probabilistic human pose estimation,” IET Computer Vision, vol. 12, no. 5, pp. 578–585, 2018. 18. G. Szűcs and B. Tamás, “Body part extraction and pose estimation method in rowing videos,” Journal of Computing and Information Technology, vol. 26, no. 1, pp. 29–43, 2018. 19. N. T. Thành and P. T. Công, “An evaluation of pose estimation in video of traditional martial arts presentation,” Journal of Research and Development on Information and Communication Technology, vol. 2019, no. 2, pp. 114–126, 2019. 20. J. Xu, K. Tasaka, and M. Yamaguchi, “Fast and accurate whole-body pose estimation in the wild and its applications,” ITE Transactions on Media Technology and Applications, vol. 9, no. 1, pp. 63–70, 2021. 21. A. Zarkeshev and C. Csiszár, “Rescue method based on V2X communication and human pose estimation,” Periodica Polytechnica Civil Engineering, vol. 63, no. 4, pp. 1139–1146, 2015.
Chapter 16
Effects of Quality and Quantity of Information Processing on Design Coordination Performance
R. Zhang1, A. M. M. Liu2, I. Y. S. Chan2 Department of Quantity Survey, School of Construction Management and Real Estate, Chongqing University, Chongqing, China.
1
Department of Real Estate and Construction, Faculty of Architecture, The University of Hong Kong, Hong Kong, China
2
ABSTRACT It is acknowledged that lacking of interdisciplinary communication amongst designers can result in poor coordination performance in building design. Viewing communication as information processing activity, this paper aims to explore the relationship between interdisciplinary information processing (IP) and design coordination performance. Both amount and quality are concerned regarding information processing. 698 project based samples are collected by questionnaire survey from design institutes in mainland China. Citation: Zhang, R. , M. M. Liu, A. and Y. S. Chan, I. (2018), “Effects of Quality and Quantity of Information Processing on Design Coordination Performance”. World Journal of Engineering and Technology, 6, 41-49. doi: 10.4236/wjet.2018.62B005. Copyright: © 2018 by authors and Scientific Research Publishing Inc. This work is licensed under the Creative Commons Attribution International License (CC BY). http:// creativecommons.org/licenses/by/4.0
314
Data Analysis and Information Processing
Statistical data analysis shows that the relationship between information processing amount and design coordination performance follows a nonlinear exponential expression: performance = 3.691 (1-0.235IP amount) rather than reverted U curve. It implies that design period is too short to allow information overload. It indicates that the main problem in interdisciplinary communication in design institute in China is insufficient information. In additional, it is found the correlation between IP quality and coordination process performance is much stronger than that between IP amount and coordination process performance. For practitioners, it reminds design mangers to pay more attention to information processing quality rather than amount. Keywords: Inter-Disciplinary, Information Processing, Design Coordination Performance
INTRODUCTION Changes in construction projects are very common and could lead to project delays and cost overruns. Lu and Issa believe that the most frequent and most costly changes are often related to design, such as design changes and design errors [1]. Hence, design stage is of primary importance in construction project life-cycle [2]. Common types of design deficiencies include design information inconsistency (e.g. location of a specific wall differing on the architectural and structural drawings), mismatches/physical interference between connected components (e.g. duct dimensions in building service drawings not matching related pass/hole dimensions in structural drawings), and component malfunctions (e.g. designing a room’s electrical supply to suit classroom activities, while architectural drawings designate the room as a computer lab) [3] [4]. Based on a questionnaire survey of 12 leading Canadian design firms, Hegazy, Khalifa and Zaneldin report eight common problems―all of which are due to insufficient and inadequate communication and information exchange (e.g., delay in obtaining information, not everyone on the team getting design change information) [5]. Communication and information exchange is termed information processing in this paper. Information processing includes the collection, processing and distribution of information [6], and can be either personal or impersonal (e.g. accomplished using a program) [7] [8]. Building design is a multi-disciplinary task. The process of designing a building is the process of integrating information from multiple disciplinary
Effects of Quality and Quantity of Information Processing on Design ...
315
professionals (e.g. architects, structure engineers, building service engineers, surveyors). Compared to intra-disciplinary coordination, inter-disciplinary coordination is much more challenging. Information processing in the latter situation might come across knowledge boundary, geographical remoteness, goal heterogeneity, as well as organization boundary (in most of western project design team). In Mainland China, most of time, all disciplinary team are employed in the same design institute. Hence, this paper does not consider the effect of organization boundary in the context of Mainland China.It is acknowledged that lacking of interdisciplinary information processing amongst designers can result in poor design coordination performance (e.g. suboptimal solutions, design change, construction delay).
Information Processing Amount and Design Coordination Performance Although there is a number of studies showing that an increase in communication or a shift in the nature of information communicated is related to good performance in high workload situations [9], it is incorrect to posit a linear positive relationship between information processing amount and performance, as too much information processing leads to information overload. Unrestricted communication can also detract from project efficiency and effectiveness [10]. It is well-acknowledged that too little information processing will result in poor performance (e.g. problems in new project development, project failures), as it cannot supply necessary information [11]. However, too much information exchange may allow good performance, but low effectiveness and, even worse, may tax performance due to information overload [12] [13]. Redundant information processing overloads people’s cognitive capacity, which impedes the normal processing of necessary information. Processing more information than necessary may help to ensure good quality, but it does so at the cost of reduced effectiveness. Coordination and information processing impose additional task loads on project team actors, and should be kept to the minimum necessary to achieve integration. In theory, the relationship between information processing amount and design coordination performance follows a reverted U curve. Due to tight design schedule, the situation in most of design institutes are lacking interdisciplinary communication. The overloaded communication is quite few.
316
Data Analysis and Information Processing
Hence, it is hypothesized that: The relationship between information processing amount and design coordination performance follows a nonlinear exponential expressione of:
(1)
What is the relationship between information processing amount and information processing quality? According to Chinese philosophy, the accumulation of amount increase brings the improvement of quality. Under the context of interdisciplinary information processing in Chinese design institute, it is hypothesized that: The relationship between information processing amount and perceived information quality follows a nonlinear exponential expression of Literature in the field of communication studies is reviewed here to investigate the concept of information processing quality, for two reasons. The first is that information process quality should be constructed as a multidimensional construct to properly investigate its rich content; however, little research within the information processing theory literature discusses the multiple dimensions of information processing quality, perhaps due to the short history of information processing theory. Fortunately, in the communication study community, researchers have deeply discussed the content of information quality in communication [14] [15]. Usually, communication refers to communication between people; here, communication is not limited to people talking to other people directly, but also includes people getting information from media sources, such as online management systems on which other people have posted information. Information processing in design coordination includes both personal communication, and communication through programming; in this sense, communication and information processing are the same issue, which is the second reason why research findings from the communication studies field can be used. Perceived information quality (PIQ) is a concept applied, in communication literature, to measure information processing quality, and refers to the extent to which an individual perceives information received from a sender as being valuable. At the cognitive level, people choose sources that are perceived to have a greater probability of providing information that will be relevant, reliable, and helpful to the problem at hand―attributes that may be summarize under the label perceived source quality [16].
Effects of Quality and Quantity of Information Processing on Design ...
317
A substantial body of literature suggests that a receiver’s perceptions of information quality influences the degree to which he or she is willing to act on it. Six critical communication variables are identified by Thomas et al. [15], four of which are highly related to information processing quality: clarity (how clear the information received is, as indicated by the frequency of conflicting instructions, poor communications, and lack of coordination); understanding (shared with supervisors and other groups) of information expectations; timeliness (of the information received, including design and schedule changes); and, completeness (the amount of relevant information received). Four similar variables are used by Maltz [14], in discussing perceived information quality: credibility (the degree to which information is perceived by the receiver to be a reliable refection of the truth); comprehensibility (perceived clarity of the information received); relevance (the degree to which information is deemed appropriate for the user’s task or application); and timeliness (whether information is transmitted quickly enough to be utilized). In a study of coordination quality in construction management, Chang and Shen [17] used two dimensions: perceived utility (i.e., the relevance, credibility and completeness of information) and clarity of communication (i.e., comprehensibility, conciseness, and consistency of representation). The two concepts seem of have some overlap. In this study, accuracy, relevance, understanding and timeliness have been selected to represent the multi-dimensional construct, PIQ, as shown in Table 1. Accuracy herein refers to the degree to which information is perceived by the receiver to be a reliable refection of the real situation. Relevance denotes the degree to which the information is appropriate for the user’s task. Understanding refers to the perceived clarity of the information received. Timeliness represents whether the information is transmitted quickly enough to allow the receiver to complete the task on time.
METHODS Data Collection Method Web-based questionnaire survey is applied to collect data for this investigation.
318
Data Analysis and Information Processing
Table 1: Measurement scale of perceived information quality accuracy relevance understanding
timeliness
The information sent by them is accurate. They sent me conflicting information. (R) They communicated important details of design information. They provided information necessary in design decision making. It is easy to follow their logic. Their terminology and concepts are easy to understand. They presented their ideas clearly. They provided information in a timely manner. Their information on design change is too late. They gave me information that are “old hat”.
The target respondents in this survey are participants in a building design project team from a design institute in Mainland China. Respondents are chosen based on three criteria; specifically, respondents should: 1) have participated in a project that had been completed within the past year (as they would be asked to recall design coordination activity); 2) have been either a project design manager, discipline leader, or designer/engineer (top managers is excluded); and 3) have been in one of the following disciplines during the project ? project management, architecture, structure engineering, mechanical engineering, electrical engineering, plumbing engineering, or BIM engineering. 1174 questionnaire responses are received, of which 219 are completely ansared, yielding a completion rate of 18.7%. 10 of completely questionnaires are dropped in data analysis as obvious data outliers. Each respondent reported data on his/her dyadic interdisciplinary design coordination with from two to seven disciplines (see Table 2). As the level of analysis is dyadic interdisciplinary design coordination, each questionnaire is split into two to seven samples. Data on both intra- and inter-discipline coordination are collected, although the study’s focus is on inter-discipline coordination. The total sample size in the inter-disciplinary coordination data set is 698 (sum of figures in grey background).
Measurement Interdisciplinary Communication Amount Interdisciplinary communication frequency is applied to measure Interdisciplinary communication amount, using a five-point scale. Respondents are asked to indicate the frequency with which they
Effects of Quality and Quantity of Information Processing on Design ...
319
communicated with designers from other discipline teams (1 = zero, 2 = less than once monthly, 3 = several times monthly, 4 = several times weekly, 5 = several times daily). Generally, building design has three stages: conceptual design, preliminary design, and detailed design. In each of the different stages, information exchange frequency differs. The conceptual design stage is dominated by architects, with the exception of limited advice-seeking from other disciplines. The most frequent information exchange happens in the detailed design stage, where all disciplines are heavily involved, with each producing detailed designs to ensure the final product can function well. Hence, frequency of interdisciplinary communication in the detailed design stage is used to test hypotheses in this study. Table 2: Matrix of dyadic coordination samples GD
Archi.
SE
ME
EE
PE
BIM
GD
2
5
5
2
2
2
2
Archi.
15
65
65
44
44
44
44
SE
12
66
66
38
38
38
38
ME
2
15
15
6
6
6
6
EE
1
12
12
2
2
2
2
PE
2
11
11
7
7
7
7
BIM
3
19
19
9
9
9
9
Notes: GD: General Drawing; Archi.: Architecture; SE: Structure Engineering; ME: Mechanical Engineering; EE: Electrical Engineering; PE: Plumping Engineering; BIM: Building Information Modelling.
Coordination Performance Coordination process performance refers to the extent to which the respondent (focal unit a) has effective information processing with another person in the design team (unit j). It is a dyadic concept, and the five-item dyadic coordination performance scale used by Sherman and Keller [18] is applied. The scale includes items examining: 1) the extent to which the focal unit a had an effective working relationship with unit j; 2) the extent to which unit j fulfilled its responsibilities to unit a; 3) the extent to which unit a fulfilled its responsibilities to unit j; 4) the extent to which the coordination
320
Data Analysis and Information Processing
is satisfactory; and, 5) the positive or negative effect on productivity, as a result of the coordination.
DATA ANALYSIS Information Processing Amount and Design Coordination Performance For coordination process performance (Table 3), b1 and b2 are quite significant in Model 1. This suggests that the relationship between frequency of interdisciplinary communication in the detailed design stage and coordination process performance can be expressed as: Performance=3.691(1−0.235x)Performance=3.691(1−0.235x) H1 is thus strongly supported. In Models 2, 3 and 4, b2 is not significant. One possible reason is that, many other factors influence design project performance, besides coordination process performance. Table 3: Information processing amount and design coordination performance Model 1
Model 2
Model 3
Dependent variable
Coordination process performance
Design quality
Design schedule Design cost control
Independent variable
Frequency of interdisciplinary communication
Frequency of interdisciplinary communication
Frequency of Frequency of interdisciplinary interdisciplinary communication communication
b1
3.691***
3.745***
3.626***
3.539***
(0.0571)
(0.0472)
(0.0468)
(0.0472)
0.235***
0.0133
−0.00138
−0.0509
(0.0448)
(0.0603)
(0.0618)
(0.0630)
N
642
445
445
444
Adjusted R-squared
0.904
0.937
0.934
0.928
b2
Model 4
Standard errors in parentheses *p < 0.05, **p < 0.01, ***p < 0.001; Difference on sample size due to missing data.
Effects of Quality and Quantity of Information Processing on Design ...
321
Information Processing Amount and Information Processing Quality As for the relationship between information processing (IP) amount and information processing quality, the nonlinear exponential expression of Performance=b1(1−b2x)Performance=b1(1−b2x) is computed, as shown in Table 4. Both b1 and b2 are significant. IP quality = 3.353 (1−0.311IP amount ). H2 is strongly supported. With the assumption that both information processing amount and information processing quality is positively related with design coordination process performance, an explorative study is conducted to compare the correlation using regression function. The results are showed in Table 5. As it is an explorative study, P value less than 0.1 is accepted as significant. The results show that 1) Both IP amount and IP quality are positively related with coordination process performance; 2) In addition, it is found the correlation between IP quality and coordination process performance is much stronger than that between IP amount and coordination process performance.
DISCUSSION On one hand, insufficient interdisciplinary communication will lead to coordination failure. On the other hand, too much information processing will lead to information overload as well as coordination cost overrun. The challenge for cross-functional teams is to ensure the level of information exchange amongst team members allows them to optimize their performance [11]. In this study, it is found that information processing amount is positively related to coordination process performance; specifically, it is found that the relationship between the frequency of interdisciplinary communication in the detailed design stage and coordination process performance followed the nonlinear exponential expression of performance = 3.691 (1−0.235IP amount ). Whether the finding can be used in other areas rather than Mainland China need further study.
Data Analysis and Information Processing
322
Table 4: Information processing amount and information processing quality Dependent variable
Perceived information quality
Independent variable
Frequency of interdisciplinary communication
b1
3.353*** (0.0530)
b2
0.311*** (0.0371)
N Adjusted R-squared
852 0.892
Standard errors in parentheses *p < 0.05, **p < 0.01, ***p < 0.001; Difference on sample size due to missing data. Table 5: Information processing amount, information processing quality and coordination process performance Path
Beta
Std.Err. z
P>z
90% Conf. Interval
designer
cp < −iefd 0.136 0.071
1.930 0.054 −0.002
0.275
disciplinary leader
cp < −iefd 0.128 0.042
3.070 0.002 0.046
0.210
designer
cp < −PIQ 0.614 0.046
13.23 0.000 0.523
0.705
disciplinary leader
cp < −PIQ 0.748 0.027
27.89 0.000 0.695
0.800
Although both IP amount and IP quality are positively related with coordination process performance, the correlation between IP quality and coordination process performance is much stronger than that between IP amount and coordination process performance. The result is consistent with former research on decision effectiveness, in which the impact of information quality is stronger [19]. It suggests that more attention should be paid on improving information processing quality. To improve information processing quality, effort can be made on improving information accuracy, relevance, understanding and timeliness. The role of building information modelling in improving interdisciplinary communication could be investigated in the future.
CONCLUSION This paper explores the relationship between interdisciplinary communication and design coordination performance in design institutes in Mainland China.
Effects of Quality and Quantity of Information Processing on Design ...
323
From information processing perspective, interdisciplinary communication is viewed as information processing activity. Both information processing amount and quality are concerned. Information processing quality is measured by four dimensions: perceived information accuracy, relevance, understanding and timeliness. Based on 698 samples of quantitative survey data in project level, it is found that the relationship between information processing amount and design coordination process performance follows a nonlinear exponential expression: performance = 3.691 (1−0.235IP amount) rather than reverted U curve. It implies that design period is too short to allow information overload. It indicates that the main problem in interdisciplinary communication in design institute in China is insufficient information. In additional, it is found the correlation between IP quality and coordination process performance is much stronger than that between IP amount and coordination process performance. For practitioners, it reminds design mangers to pay more attention to information processing quality rather than amount.
324
Data Analysis and Information Processing
REFERENCES 1.
Lu, H. and Issa, R.R. (2005) Extended Production Integration for Construction: A Loosely Coupled Project Model for Building Construction. Journal of Computing in Civil Engineering, 19, 58-68. https://doi.org/10.1061/(ASCE)0887-3801(2005)19:1(58) 2. Harpum, P. (Ed.) (2004) Design Management. John Wiley and Sons Ltd., USA. https://doi.org/10.1002/9780470172391.ch18 3. Korman, T., Fischer, M. and Tatum, C. (2003) Knowledge and Reasoning for MEP Coordination. Journal of Construction Engineering and Management, 129, 627-634. https://doi.org/doi:10.1061/(ASCE)07339364(2003)129:6(627) 4. Mokhtar, A.H. (2002) Coordinating and Customizing Design Information through the Internet. Engineering Construction and Architectural Management, 9, 222-231. https://doi.org/10.1108/ eb021217 5. Hegazy, T., Khalifa, J. and Zaneldin, E. (1998) Towards Effective Design Coordination: A Questionnaire Survey 1. Canadian Journal of Civil Engineering, 25, 595-603. https://doi.org/10.1139/l97-115 6. Tushman, M.L. and Nadler, D.A. (1978) Information Processing as an Integrating Concept in Organizational Design. Academy of Management Review, 613-624. 7. Dietrich, P., Kujala, J. and Artto, K. (2013) Inter-Team Coordination Patterns and Outcomes in Multi-Team Projects. Project Management Journal, 44, 6-19. https://doi.org/10.1002/pmj.21377 8. Van de Ven, A.H., Delbecq, A.L. and Koenig Jr., R. (1976) Determinants of Coordination Modes within Organizations. American Sociological Review, 322-338. https://doi.org/10.2307/2094477 9. Mathieu, J.E., Heffner, T.S., Goodwin, G.F., Salas, E. and CannonBowers, J.A. (2000) The Influence of Shared Mental Models on Team Process and Performance. Journal of Applied Psychology, 85, 273. https://doi.org/10.1037/0021-9010.85.2.273 10. Katz, D. and Kahn, R.L. (1978) The Social Psychology of Organizations. 11. Patrashkova, R.R. and McComb, S.A. (2004) Exploring Why More Communication Is Not Better: Insights from a Computational Model of Cross-Functional Teams. Journal of Engineering and Technology Management, 21, 83-114. https://doi.org/10.1016/j. jengtecman.2003.12.005
Effects of Quality and Quantity of Information Processing on Design ...
325
12. Goodman, P.S. and Leyden, D.P. (1991) Familiarity and Group Productivity. Journal of Applied Psychology, 76, 578. https://doi. org/10.1037/0021-9010.76.4.578 13. Boisot, M.H. (1995) Information Space. Int. Thomson Business Press. 14. Maltz, E. (2000) Is All Communication Created Equal? An Investigation into the Effects of Communication Mode on Perceived Information Quality. Journal of Product Innovation Management, 17, 110-127. https://doi.org/10.1016/S0737-6782(99)00030-2 15. Thomas, S.R., Tucker, R.L. and Kelly, W.R. (1998) Critical Communications Variables. Journal of Construction Engineering and Management, 124, 58-66. https://doi.org/10.1061/(ASCE)07339364(1998)124:1(58) 16. Choo, C.W. (2005) The Knowing Organization. Oxford University Press. https://doi.org/10.1093/acprof:oso/9780195176780.001.0001 17. Chang, A.S. and Shen, F.-Y. (2014) Effectiveness of Coordination Methods in Construction Projects. Journal of Management in Engineering. https://doi.org/10.1061/(ASCE)ME.1943-5479.0000222 18. Sherman, J.D. and Keller, R.T. (2011) Suboptimal Assessment of Interunit Task Interdependence: Modes of Integration and Information Processing for Coordination Performance. Organization Science, 22, 245-261. https://doi.org/10.1287/orsc.1090.0506 19. Keller, K.L. and Staelin, R. (1987) Effects of Quality and Quantity of Information on Decision Effectiveness. Journal of Consumer Research, 14, 200-213. https://doi.org/10.1086/209106
Chapter 17
Neural Network Optimization Method and Its Application in Information Processing
Pin Wang1, Peng Wang2, and En Fan3 1 School of Mechanical and Electrical Engineering, Shenzhen Polytechnic, Shenzhen 518055, Guangdong, China
Garden Center, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou 510650, Guangdong, China 2
Department of Computer Science and Engineering, Shaoxing University, Shaoxing 312000, Zhejiang, China 3
ABSTRACT Neural network theory is the basis of massive information parallel processing and large-scale parallel computing. Neural network is not only a highly nonlinear dynamic system but also an adaptive organization system, which can be used to describe the intelligent behavior of cognition, decisionmaking, and control. The purpose of this paper is to explore the optimization Citation: Pin Wang, Peng Wang, En Fan, “Neural Network Optimization Method and Its Application in Information Processing”, Mathematical Problems in Engineering, vol. 2021, Article ID 6665703, 10 pages, 2021. https://doi.org/10.1155/2021/6665703. Copyright: © 2021 by Authors. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
328
Data Analysis and Information Processing
method of neural network and its application in information processing. This paper uses the characteristic of SOM feature map neural network to preserve the topological order to estimate the direction of arrival of the array signal. For the estimation of the direction of arrival of single-source signals in array signal processing, this paper establishes a uniform linear array and arbitrary array models based on the distance difference vector to detect DOA. The relationship between the DDOA vector and the direction of arrival angle is regarded as a mapping from the DDOA space to the AOA space. For this mapping, through derivation and analysis, it is found that there is a similar topological distribution between the two variables of the sampled signal. In this paper, the network is trained by uniformly distributed simulated source signals, and then the trained network is used to perform AOA estimation effect tests on simulated noiseless signals, simulated Gaussian noise signals, and measured signals of sound sources in the lake. Neural network and multisignal classification algorithms are compared. This paper proposes a DOA estimation method using two-layer SOM neural network and theoretically verifies the reliability of the method. Experimental research shows that when the signal-to-noise ratio drops from 20 dB to 1 dB in the experiment with Gaussian noise, the absolute error of the AOA prediction is small and the fluctuation is not large, indicating that the prediction effect of the SOM network optimization method established in this paper does not vary. The signal-to-noise ratio drops and decreases, and it has a strong ability to adapt to noise.
INTRODUCTION In the information society, the increase in information generation is getting bigger [1]. To make information available in a timely manner to serve the development of the national economy, science and technology, and defense industry, it is necessary to collect, process, transmit, store, and make decisions on information data. Theoretical innovation and implementation are carried out to meet the needs of the social development situation. Therefore, neural networks have extremely extensive research significance and application value in information science fields such as communications, radar, sonar, electronic measuring instruments, biomedical engineering, vibration engineering, seismic prospecting, and image processing. This article focuses on the study of neural network optimization methods and their applications in intelligent information processing.
Neural Network Optimization Method and Its Application in Information ...
329
Based on the research of neural network optimization method and its information processing, many foreign scholars have studied it and achieved good results. For example, Al Mamun MA has developed a new method of image restoration using neural network technology, which overcomes to a certain extent the above shortcomings of traditional methods. In addition, neural networks have also been widely used in image edge detection, image segmentation, and image compression [2]. Hamza MF proposed a BP algorithm to train RBF weights. The BP algorithm with additional momentum factor can improve the training coefficient of the network and avoid the occurrence of oscillations, which improves the training rate of the network [3]. Tom B proposed an RBF-PLSR model based on genetic clustering. This model uses the clustering analysis of genetic algorithm to determine the number of hidden layer nodes and the center of hidden nodes in the RBF network, and the PLSR method is used to determine the network’s right to connect [4, 5]. In my country, an adaptive linear component model is proposed. They also made Adaline into hardware and successfully applied it to offset the echo and noise in communications. Quan proposed the error backpropagation algorithm, the BP algorithm, which in principle solves the problem of the multilayer neural network training method, which makes the neural network have strong computing power and greatly increases the vitality of the artificial neural network [6]. Cheng uses mathematical theory to prove the fundamental limitation of single-layer perceptron in computing. However, for multilayer neural networks with hidden layers, an effective learning algorithm has not yet been found [7]. In this paper, the problem of single-signal source azimuth detection under uniform linear sensor array and arbitrary array is studied, and the direction-of-arrival detection array model is established, respectively. In the case of a uniform linear array, this paper establishes a two-layer SOM neural network. First, explain the theoretical basis of this neural network, that is, the homotopological structure between the input vector and the output result. For this reason, we separately analyzed the topological structure of the DDOA vector and the predicted value of the AOA in the case of a uniform linear array. Through derivation and simulation data, we can see that the two do have similar topological structures, which led us to establish the SOM neural network system. It can be applied to AOA prediction problems based on DDOA. Finally, simulation experiments and lake water experiments verify the practical feasibility of this method.
330
Data Analysis and Information Processing
NEURAL NETWORK OPTIMIZATION METHOD AND ITS RESEARCH IN INFORMATION PROCESSING Array Optimization and Orientation Based on DDOA and SOM Neural Network Signal and information processing mainly includes three main processes: information acquisition, information processing, and information transmission [8, 9]. The array signal processing can be regarded as an important branch of modern data signal processing. Its main research object is the signal transmitted in the form of spatial transmission wave. It receives the wave signal through a sensor array with a certain spatial distribution and performs information on the received signal extract. This paper mainly studies the algorithm of the sensor array to detect the sound wave’s azimuth, namely, the direction of arrival (DOA).
Array Signal Model Array signal processing is often based on a strict mathematical theoretical model based on a series of assumptions about the observed signal. The objects explored in this article are all two-dimensional spatial signal problems. These assumptions stem from the abstraction and generalization of the observed signal and noise. (1) Narrowband signal: when the bandwidth of the spatial source signal is much smaller than its center frequency, we call this spatial source signal a narrowband signal; that is, the general requirement is met. (1) where WB is the signal bandwidth and fo is the signal center frequency. A single-frequency signal with a center frequency of fo can be used to simulate a narrowband signal. The sine signal as we know it is a typical narrowband signal. The analog signals used in this article are all single-frequency sine signals. (2) Array signal processing model: suppose that there is a sensor array in the plane, in which M sensor array elements with arbitrary directivity are arranged, and K narrowband plane waves are distributed in this plane. The center frequencies of these plane
Neural Network Optimization Method and Its Application in Information ...
331
waves are all w0 and the wavelength is λ, and suppose that M > K (that is, the number of array elements is greater than the number of incident signals). The signal output received by the k-th element at time t is the sum of K plane waves; namely, (2) where ak(θi) is the sound pressure response coefficient of element k to source i, si (t − τk(θi )) is the signal wavefront of source i, and τk(θi) is the relative value of element k to the reference element time delay. According to the assumption of narrowband waves, the time delay only affects the wavefront by the phase change, (3) Therefore, formula (2) can be rewritten as (4) Write the output of M sensors in vector form; the model becomes (5) Among them,
(6) It is called the direction vector of the incoming wave direction 0. Let . The other measurement noise is n (t); then the above array model can be expressed as (7) Among them, matrix of the array model.
is the direction
332
Data Analysis and Information Processing
Subspace Decomposition Based on Eigendecomposition of Array Covariance Matrix The DOA estimation problem is the estimation of the direction of arrival angle and the parameter θi (i = 1, 2, . . . , K) in natural space, which requires the covariance information between the different elements of the array for analysis. For this, first calculate the spatial covariance matrix output by the array:
(8)
where E{.} represents statistical expectation; let
(9)
(10)
It is the covariance matrix of noise. It is assumed that the noise received by all elements has a common variance, and is also the noise power. From equations (9) and (10), we can get
(11)
It can be proved that R is a nonsingular matrix and a positive definite Hermitian square matrix; that is, RH = R. ,erefore, the singular value decomposition of R can be performed to achieve diagonalization, and the eigendecomposition can be written as follows: (12) where U is the transformation unitary matrix, so that matrix R is diagonalized into a real-valued matrix Λ = diag(λ1, λ2, . . . , λM), and the eigenvalues are ordered as follows:
(13)
From equation (13), it can be seen that any vector orthogonal to A is an eigenvector of matrix R belonging to the eigenvalue .
RBF Neural Network Estimates the Direction of Arrival RBF neural network is a method that can perform curve fitting or interpolation in high-dimensional space. If the relationship between the input space and the output space is regarded as a mapping, this mapping can be regarded as defined in the high-dimensional space. A hypersurface of the input data and
Neural Network Optimization Method and Its Application in Information ...
333
a designed RBF neural network are equivalent to the height fitting of this hypersurface. It establishes an approximate hypersurface by interpolating the input data points [10, 11]. The sensor array is equivalent to a mapping from the DOA space to the sensor array output space , a mapping :
(14)
where K is the number of source signals, M is the number of elements of the uniform linear array, ak is the complex amplitude of the k-th signal, α is the initial phase, ω0 is the signal center frequency, d is the element spacing, and c is the propagation speed of the source signal [12, 13]. When the number of information sources has been estimated as K, the function of the neural network on this problem is equivalent to the inverse problem of the above mapping, that is, the inverse mapping . To obtain this mapping, it is necessary to establish a neural network structure in which the preprocessed data based on the incident signal is used as the network input, and the corresponding DOA is used as the network output after the hidden layer activation function is applied. The whole process is a targeted training process, and the process of fitting the mapping with the RBF neural network is equivalent to an interpolation process.
Estimation of Direction of Arrival of Uniform Linear Array SOM Neural Network Kohonen Self-Organizing Neural Network A SOM neural network consists of two layers: the input layer and the competition layer (also called the output layer). The number of nodes in the input layer is equal to the dimension of the input vector, and the neurons in the competing layer are usually arranged in a rectangle or hexagon on a two-dimensional plane. The output node j and the input node are connected by weights:
(15)
The training steps of the Kohonen SOM neural network used in this article are as follows: the first step is network initialization [14, 15].
334
Data Analysis and Information Processing
Normalize the input vector x to
such that
:
(16) where x = [x1, x2, . . . , xm] T is the training sample vector of the network. Initialize the network weight wj (j = 1, 2, . . . , K) to be the same as the partially normalized input vector e’. The second step is to calculate the Euclidean distance between the input vector and the corresponding weight vector ωj of each competing layer neuron to obtain the winning neuron ωc [16, 17]. The selection principle of the winning neuron is as follows: (17) ,e third step is to adjust the weight of the winning neuron ωc and its neighborhood ωj. ,e adjustment method is as follows:
(18)
Among them, η(t) is the learning rate function, which decreases with the number of iteration steps t [18, 19]. ,e function Uc(t) is the neighborhood function; here is the Gaussian function:
(19) where r is the position of the neurons in the competition layer on a twodimensional plane and σ is the smoothing factor, which is a normal number.
DOA Estimation Model Based on SOM Neural Network Build a two-layer SOM neural network. The first layer of SOM neural network is the sorting layer, which maps the input training data into a twodimensional space. According to the activation of neuron nodes on the first two-dimensional grid, the output of the corresponding neuron node in the second grid is defined by the following rules:
Neural Network Optimization Method and Its Application in Information ...
335
(1) If the neuron node j is activated by only one training sample vector and the signal position corresponding to this sample is , then the output of the corresponding node of the second layer of grid is the direction angle of this signal [20, 21], namely, (20) (2) If the neuron node j is activated by more than one training sample vector, that is, nj > 1, and the signal positions corresponding to these samples are , then the output of the corresponding node of the second layer of grid is the average value of the direction angle of these signals [22, 23], namely, (21) (3) If the neuron node j has never been activated by any training sample vector, the corresponding output neuron node is regarded as an invalid node. When this node is activated by a new input vector, the output value is defined as the output direction angle of the valid node closest to this node.
Method Reliability Analysis The establishment process of the two-layer SOM neural network we proposed above shows that the topological order of AOA is similar to the topological distribution of DDOA vectors. In other words, when the Euclidean distance between two DDOA vectors is small, the Euclidean distance of the corresponding AOA value must also be small. This is the theoretical basis for our proposed method, and we will conduct a detailed analysis on this nature. Suppose that the DDOA vectors of two adjacent source signals are d and d1 = d + Δ d, and the corresponding AOAs are θ and θ1 = θ + Δθ, respectively. The DDOA increment and AOA increment are
(22)
336
Data Analysis and Information Processing
where ; obviously the function di,j+1 at the 2 point (x, y) ∈ R is differentiable, which shows that the DDOA vector d and the AOA value θ have a consistent trend [24, 25]. In other words, when the DDOA vectors of two source signals are similar, their arrival direction angles AOA must also be similar. Therefore, the topological orders and distributions of DDOA vector and AOA are basically the same.
Genetic Clustering Method In cluster analysis, the K-means clustering method is a clustering method that is often used. Generally, when determining the structure of the RBF network, this method is used to determine the number of hidden layer nodes of the network and the center of the node.
Chromosome Coding and Population Initialization In order to accelerate the speed of convergence, we use real number coding [26]. For samples with m-dimensional dimensions, if the number of classes to be classified is n, the centers of n classes are encoded, and the dimensions of each center are m-dimensional; then the length of the chromosome is n × m. In this way, a chromosome represents a complete classification strategy. Initialize the preset number of chromosomes to get the initial population.
Determination of Fitness Function and Selection of Fitness For each chromosome, according to the classification information carried on it, according to the idea of distance classification, the classification of each sample in the original data can be determined, and the distance between the sample and its category center (here is Euclidean distance) can also be determined [27, 28]. After determining the classification of the sample, the sum of the distances within the class can be calculated: (23) At the same time, the sum of the distances between classes can also be found:
(24)
Neural Network Optimization Method and Its Application in Information ...
337
where F is the sum of distances within classes, Q is the sum of distances between classes, k is the number of classes in the classification, ni is the number of samples belonging to the i-th class, xj is the j-th sample of the i-th class, and Ci is the center of the i-th class.
NEURAL NETWORK OPTIMIZATION METHOD AND ITS EXPERIMENTAL RESEARCH IN INFORMATION PROCESSING Underwater Experimental Research in the Lake The underwater experiment is carried out in the lake. The average depth of the lake water is 50 meters to 60 meters. The area of open water is more than 300 meters *1200 meters, and the water body is relatively stable and suitable for DOA estimation experiments. The experimental equipment used this time is a uniform linear array composed of 4 acoustic pressure hydrophones with an array spacing of 0.472 meters.
Experimental Methods and Data Collection No Noise In order to verify the effectiveness of the two-layer SOM neural network established in this paper for arbitrary array conditions, we conducted a simulation experiment of detecting the direction of acoustic signals with arbitrary sensor arrays underwater. Assuming that the sensor array contains 4 sensors, the frequency of a single sound source signal is f = 2 kHz, the propagation speed of the sound signal in water is c = 1500 m/s, and the distance between two adjacent sensors is Δi = 0.375, which is the wavelength half. The positions of the four sensor array elements are (x1 = 0.y1 = 0), (x2 = 0.3, y2 = 0.225), (x3 = 0.5, y3 = −0.0922), and (x4 = 0.6, y4 = 0.2692). In order to obtain the training vector, we uniformly collect 60 × 30 points from the rectangular area [−20, 20] × [0, 20] ∈ R2 as the emission positions of 1800 simulated sound source signals, which can calculate 1800 DDOA vectors r, and input them into the network as training vectors of the network. Calculate the value of Rmax(x, y):
(25)
338
Data Analysis and Information Processing
Except for the few points near the origin (0, 0), the function Rmax(x, y) at most of the remaining points has a common upper bound, which belongs to the second case.
Noise In practice, the signal data collected by the sensor array is often noisy, and the energy of noise is generally large. The signal-to-noise ratio between signal and noise often reaches very low values, even below 0 dB; that is, the signal is overwhelmed by environmental noise that is much stronger than its strength. When the signal-to-noise ratio is particularly small, people usually perform a denoising filtering process artificially in advance to make the filtered signal-to-noise ratio at least above 0 dB. Therefore, a good model that can be applied to practice must be applicable to noisy environments.
Performing Genetic Clustering on Standardized Training Sample Data The number of preselected clustering categories is in the interval between 1/7 and 1/4 of the total number of samples (in order to facilitate the training of the network, too few or too many categories will result in poor training effects), take the population number as 30, the crossover rate is 75%, and the mutation rate is 5%. The fitness function is selected so that the ratio of the interclass distance to the intraclass distance increases with the increase of the fitness function, and a convergent solution can be obtained in about 50 generations. In the interval class, the number of classes is changed one by one until the fitness function is minimized. The number of categories at this time is the number of hidden nodes in the RBF network, and the center of each category is the center of the node.
NEURAL NETWORK OPTIMIZATION METHOD AND ITS EXPERIMENTAL RESEARCH ANALYSIS IN INFORMATION PROCESSING Noise-Free Simulation Experiment To test the performance of the network, we select six sets of source signals with different distances from the origin, that is, six sets of points as the test. The distances to the origin of the coordinates are 8 meters, 16 meters, 20 meters, 30 meters, 50 meters, and 100 meters. Each group contains 21
Neural Network Optimization Method and Its Application in Information ...
339
simulated signals with different AOA values. Calculate the DDOA vectors corresponding to these simulated signal emission points, and then input these vectors as test vectors into the trained two-layer SOM neural network. The output of the network is the corresponding AOA predicted value. The experimental results are shown in Figure 1.
Figure 1: Absolute error of AOA predicted value of source signal at different distances using SOM neural network.
The absolute error value of the AOA prediction result is shown in Figure 1. It can be seen that not only can the SOM network trained with the nearfield simulation signal (the signal position is within the area [0, 21] × [0, 21]) perform the training in the near-field (4 m, 8 m, and 12 m) but also the AOA prediction effect of the test signal is good. Except for individual points, the AOA prediction accuracy of the test signal (16 m, 32 m, and 64 m) in the far field is also very high, and the error is basically controlled in the interval , the error is smaller than the near field, the effect is better, and the error fluctuation is smaller. To illustrate the effectiveness and scalability of this method in predicting AOA, we set up an RBF neural network for comparison. The RBF neural network established here uses the DDOA vector of the same simulation signal (within the area [0, 20] [0, 20]) as the input vector of the network
340
Data Analysis and Information Processing
training and the corresponding AOA value as the target output of the network training. As shown in Table 1 and Figure 2, the average of the absolute value of the AOA prediction error of the noise-free signal in the simulation experiment is approximately 0.1° to 0.4°, the minimum is 0.122°, and the maximum is only 0.242°, and most of the test signals (accounting for the absolute value of the prediction error of the number of test signals (70% ∼ 80%)) are less than 0.1°. Table 1: SOM neural network prediction results of noise-free signal AOA Distance
4
8
12
16
32
64
Average error Pr (err