132 100 21MB
English Pages 546 [529] Year 2021
Studies in Computational Intelligence 966
Julian Andres Zapata-Cortes Giner Alor-Hernández Cuauhtémoc Sánchez-Ramírez Jorge Luis García-Alcaraz Editors
New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques
Studies in Computational Intelligence Volume 966
Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/7092
Julian Andres Zapata-Cortes · Giner Alor-Hernández · Cuauhtémoc Sánchez-Ramírez · Jorge Luis García-Alcaraz Editors
New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques
Editors Julian Andres Zapata-Cortes CEIPA Business School Fundación Universitaria CEIPA Sabaneta, Colombia Cuauhtémoc Sánchez-Ramírez Division of Research and Postgraduate Studies Instituto Tecnológico de Orizaba Tecnológico Nacional de México Orizaba, Mexico
Giner Alor-Hernández Division of Research and Postgraduate Studies Instituto Tecnológico de Orizaba Tecnológico Nacional de México Orizaba, Mexico Jorge Luis García-Alcaraz Autonomous University of Ciudad Juarez Ciudad Juárez, Chihuahua, Mexico Division of Research and Postgraduate Studies Tecnológico Nacional de México/Instituto Tecnológico de Ciudad Juárez Ciudad Juárez, Chihuahua, Mexico
ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-71114-6 ISBN 978-3-030-71115-3 (eBook) https://doi.org/10.1007/978-3-030-71115-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
With the increase in computing power produced in the last decade, the ability to analyze large datasets and the development of new algorithms have produced the possibility to develop mechanisms that allow decision-making processes that in past years were thought impossible to be carried out. These mechanisms have the capacity to surpass the human capacity for these tasks with the ability to carry out the analysis autonomously and to learn from themselves, allowing them to be conceived as artificial intelligence. Artificial Intelligence (AI) techniques cover the automation of cognitive and physical tasks. These techniques help people to perform tasks faster and make better decisions. It enables the automation of decision-making process without human intervention. AI techniques can enhance automation by reducing intensive human labor and tedious tasks. There are more forms in which Artificial Intelligence is making a difference for enterprises in decision-making marketing, customer relationship management, recommender systems, problem-solving, opinion mining, augmented analytics, to mention but a few. In marketing, it is necessary to understand customer needs and desires and aligning products to those needs and desires. A handle on changing customer behavior is vital to make the best marketing decisions. AI simulation and modeling techniques provide reliable insight into the consumers’ persona. This will help predicting consumers’ behavior. Through real-time data gathering, trend analysis, and forecasting, an AI system can help businesses make insightful marketing decisions. Furthermore, organizations can identify a consumer’s lifetime value with the help of AI’s buyer persona modeling. It can help organizations manage multiple inputs. During a complex decision-making process, AI can efficiently manage and control different factors at the same point in time. It can source and process large amounts of data within minutes while providing valuable business-based insights. While we humans face decision fatigue, algorithms do not have such limitations, which make AI-based decisions faster and better. Also, AI techniques have provided businesses invaluable insight about consumers, which helps them enhance their communication with the consumers. It also helps retailers predict product demand and respond to it quickly. For this end, opinion mining helps businesses understand why people feel v
vi
Preface
the way they feel. Most often a single customer’s concerns might be common among others. When sufficient opinions are gathered and analyzed correctly, the information gleaned will help organizations gauge and predict the concerns of the silent majority. AI has improved this mining process through automation, which is quicker and more reliable, helping organizations in making critical business decisions. In e-commerce, an AI system learns a consumer’s preference based on “explicit” or “implicit” feedbacks. This kind of systems are called recommender systems. A recommender system can provide information helping the organization to reduce bounce rate and craft better customer-specific targeted content. Wise business decisions are made when business executives and decision-makers have reliable data and recommendations. AI not only improves the performance of both the individual members of the team but also the competitive edge of the business. In medicine, the artificial intelligence allows the improvement of many tasks, ranging from better management of resources such as rooms and personnel, as well as achieving more accurate and faster medical diagnoses, such as computer-assisted radiologies or the use of robots to perform less invasive surgeries. In the service industry, AI is used in automated online assistants that allow faster and lower cost service, which has the ability to learn from past interactions, thus improving customer service. Based on this understanding, there are many other applications of artificial intelligence in many knowledge fields, and this is the main reason and the core of the present book, which is aimed to disseminate current trends among innovative and highquality research regarding the implementation of conceptual frameworks, strategies, techniques, methodologies, informatics platforms, and models about Enterprise Decision-Making Applying Artificial Intelligence Techniques. The specific objectives of the book can be summarized as follows: • Generate a dissemination venue from both the academia and the industry in the topics studied in the book, presenting cases of new approaches, applications, methods, and techniques of the application of Artificial Intelligence in enterprise decision-making. • Generate a collection of theoretical and practical research works in the field of Artificial Intelligence Techniques applied in enterprise decision-making. • Establish the state of the art in the field of Artificial Intelligence Techniques applied in enterprise decision-making. This book is composed of a set of chapters, each of the kind of regular research papers. These works have been edited according to the norms and guidelines of Springer Verlag Editorial. Several calls for chapters were distributed among the main mailing lists of the field for researchers to submit their works to this issue. 25 expressions of interest in the form of abstracts were received in total, which were subject to a screening process to ensure their clarity, authenticity, and relevancy to this book. These proposals came from several countries such as Colombia, Mexico, Spain, Perú, and Ukraine. After the abstract reviewing process, 25 proposals were accepted and asked to submit full versions. These versions were reviewed by at least two pairs in order to
Preface
vii
ensure the relevance and quality of the documents. After this process, 21 chapters were finally accepted for their publication once the corrections requested by the pairs and the editors were completed. The book content is structured in three parts: (1) Industrial Applications, (2) Decision-Making Systems for Industry, and (3) Artificial Intelligence Techniques. The chapters in each of these parts are as follows. Part I Industrial Applications: This part contains seven chapters. Chapter 1, entitled “Merging Event Logs for Inter-organizational Process Mining,” presents a methodology to merge the event logs of the different partners of a collaborative business process, in order to serve as input for the process mining algorithm. On the one hand, the methodology consists of a method set for searching the correlation between events of the log of different partners involved in the collaboration. These methods are implemented at the trace level and the activity level. On the other hand, the methodology consists of a set of methods and rules for discovering process choreography. From the knowledge gained by the above methods, message-type tasks are identified and marked in each event log, and then using a formal set of rules, the message task sub-type (send or receive) is discovered. Finally, links using message sequence flow connectors between message tasks identified as pair activities in event logs are automatically defined. The proposed approach is tested using a real-life event log that confirms their effectiveness and efficiency in the automatic specification of message flows of the process choreography discovered, allowing to build a collaborative business process model. Chapter 2, entitled “Towards Association Rule-Based Item Selection Strategy in Computerized Adaptive Testing,” proposes the integration of Association Rule Mining as an item selection criterion in a CAT system. Specifically, we present the analysis of association rule mining algorithms such as Apriori, FPGrowth, PredictiveApriori, and Tertius into three datasets obtained from the subject databases to know the advantages and disadvantages of each algorithm and choose the most suitable one to employ in an association rule-based CAT system that is being developed as a Ph.D. project. We compare the algorithms considering the number of rules discovered, average support and confidence, lift, and velocity. According to the experiments, Apriori found rules with greater confidence, support, lift, and in less time. Chapter 3, entitled “Uncertainty Linguistic Summarizer to Evaluate the Performance of Investment Funds,” proposes a methodology to implement the uncertain linguistic summarizer posed in Liu’s uncertain logic to measure the performance of investment funds in the Colombian capital market. The algorithm extracts a truth value for a set of linguistic summaries, written as propositions in predicate logic, where the terms for the quantifier, subject, and predicate are unsharp. The linguistic summarizer proves to be autonomous, successful, efficient, and close to human language. Furthermore, the implementation has a general scope and could become a data mining tool under uncertainty. The propositions found characterize with plenty of sense the investment funds data. Finally, a corollary that allows accelerating the obtention of the summaries is presented.
viii
Preface
Chapter 4, entitled “Map-Bot: Mapping Model of Indoor Work Environments in Mobile Robotics,” presents a mapping model of indoor work environments in mobile robotics, called Map-Bot. The model integrates hardware and software modules for navigation, data acquisition and transfer and mapping. Additionally, the model incorporates a computer that runs the software responsible for the construction of twodimensional representations of the environment (Vespucci module), a mobile robot that collects sensory information from the workplace and a wireless communication module for data transfer between the computer and the robot. The results obtained allow the implementation of the reactive behavior “follow walls” located on its right side on paths of 560 cm. The model allowed to reach a safe and stable navigation for indoor work environments using this distributed approach. Chapter 5, entitled “Production Analysis of the Beekeeping Chain in Vichada, Colombia. A System Dynamics Approach,” presents a model of beekeeping production in the region of Vichada in Colombia. The beekeeping chain was chosen because it is a sector of great economic importance in the mentioned region which has the highest indices of multidimensional poverty in Colombia, but also it is one of the places with the greatest conservation of its biodiversity. A systems dynamics approach is used from a causal diagram to explain the interactions among bee rearing, wax production, honey production, and transformation, and then simulations were performed to determine the behavior of inventories concerning the production and demand. This model highlights the dynamics of the system and the management of the supply chain and is presented as a useful tool to predict production-demand scenarios in the beekeeping sector where similar studies are scarce. As future research, it is recommended to include the economic nature of the products in this kind of models so that scenarios can be proposed to help beekeepers make production decisions according to demand, and develop inventory policies. Chapter 6, entitled “Effect of TPM and OEE on the Social Performance of Companies,” reports a model of structural equations that integrates three independent variables: Total productive maintenance, Just in time and Overall equipment efficiency, and the relationship they have with Social sustainability as a dependent variable. The four variables are related through six hypotheses that are validated with information gathered from 239 questionnaires answered by executives laboring at Mexican maquiladora industry. The partial least squares technique is used to statistically validate the relationships among variables. Findings indicate that Total predictive maintenance has a strong impact on Overall equipment efficiency and Just in time, and the variables that most influence Social sustainability are Total predictive maintenance and Just in time. It is concluded that Social sustainability can be obtained through proper use and maintenance of the machines and with timely fulfillment of production orders. Chapter 7, entitled “ENERMONGRID: Intelligent Energy Monitoring, Visualization and Fraud Detection for Smart Grids,” presents a tool used for intelligent energy monitoring, data visualization, and fraud detection in electric networks to generate rich information in near real time, which can be used to make decisions for the optimal energy production, generation, distribution, and consumption. This tool allows solving many problems that can arise when dealing with energy load
Preface
ix
estimates, loss estimates, as well as fraud detection and prevention, in the entities in charge of managing an electric network, with the focus of doing so in smart grids. Part II Decision-Making Systems for Industry: This part contains seven chapters. Chapter 8, entitled “Measuring Violence Levels in Mexico Through Tweets,” proposes a novel way to evaluate what people say and how they feel about violence by analyzing Twitter data. To do this, we describe a methodology to create an indicator to denote social perception. This methodology uses technologies like Big Data, Twitter analytics, Web mining, and Semantic Web by manipulating software like ELK (Elasticsearch for data storage, Logstash for collecting data, and Kibana for data visualization); SPSS and R for statistical data analysis; and Atlas.ti, Ghephi, and Wordle for semantic analysis. At the end of the chapter, we show our results with a word cloud, a social graph, and the indicator of social perception of violence in Mexico at the federal entity and metropolitan zone levels. Chapter 9, entitled “Technology Transfer from a Tacit Knowledge Conservation Model into Explicit Knowledge in the Field of Data Envelopment Analysis,” presents a preservation model of production engineers’ tacit knowledge of Data Envelopment Analysis (DEA). Their expertise was explicitly coded into a computer system, and the model was developed by applying techniques and procedures from the fields of engineering and knowledge management. Technology transfer enables to solve the problem of selecting criteria and interpreting results with DEA techniques, when the efficiency of similar organizations is compared using an efficient frontier derived from non-parametric approximations of such techniques. Misunderstanding the techniques leads to misinterpretations of DEA results. This model was created by applying Knowledge Engineering, which enables to preserve and extend specific experiences and expertise in time by means of computer solutions. The model had an efficient and positive impact on strategic self-learning processes for the community interested in production engineering, knowledge transfer, and management. Chapter 10, entitled “Performance Analysis of Decision Aid Mechanisms for Hardware Bots Based on ELECTRE III and Compensatory Fuzzy Logic,” proposes two novel cognitive preference models viable for hardware with few memory cells and small processing capacity. These are the Outranking Relations (or OR) and Compensatory Fuzzy Logic (or CFL) which are two techniques that lack a study of their hardware performance as intelligence modelers, and despite their simple definition and generalization capacity, hardware agents almost never use them. The chapter analyzes the feasibility of implementing Outranking Relations (OR) and Compensatory Fuzzy Logic (CFL) in hardware platforms with low resources, highlighting that competitiveness of the proposed tools and the arising of new research lines. Chapter 11, entitled “A Brief Review of Performance and Interpretability in Fuzzy Inference Systems,” presents a scoping review related to fuzzy logic emphasizing in Compensatory Fuzzy Logic (CFL), Archimedean Compensatory Fuzzy Logic (ACFL), and inference systems. It presents the literature analysis on surveys and general reviews to contrast the scoping review. The chapter also presents the research analysis through a case study, a comparison of compensatory fuzzy logic with other fuzzy logic structures and other related works.
x
Preface
Chapter 12, entitled “Quality and Human Resources, Two JIT Critical Success Factors,” presents a structural equation model associating JIT elements of quality planning and quality management with human resources and economic performance in the context of Mexican maquiladoras (cross-border assembly plants). Results indicate that even though there is no direct relationship between quality planning and management and economic benefits, these variables are indirectly related through human resources. In conclusion, human resources are key to achieving financial success of maquiladoras. Chapter 13, entitled “Operational Risks Management in the Reverse Logistics of Lead-Acid Batteries,” develops an operational risk identification and prioritization in the reverse logistics of lead-acid batteries in Colombia using the questionnaires and FQFD (fuzzy quality function deployment) approaches. We identified operational risks in the reverse logistics of lead batteries and a probability-impact matrix to define which of these risks should be prioritized with FQFD. In this way, we established the priority of the risks considered. These prioritized risks help organizations related to this activity to develop action plans to mitigate these risks. Once the most critical risks are defined, actions to mitigate or eliminate them were proposed. Chapter 14, entitled “Dynamic Evaluation of the Livestock Feed Supply Chain from the Use of Ethanol Vinasses,” proposes a conceptual design of the vinasse-based livestock feed supply chain using system dynamics to identify the key variables of the chain and assess how vinasse can be efficiently used to produce animal feed and other products such as ethanol. Both industrial and scientific efforts are made to find alternative uses for vinasse in energy generation, soil fertilization, and livestock feed production. However, this last alternative has not been sufficiently explored. The results demonstrate that a continuous supply of molasses ensures a continuous production of ethanol, which in turn guarantees constant vinasse availability to produce livestock feed. Part III Artificial Intelligence Techniques: This part contains seven chapters. Chapter 15, entitled “Comparative Analysis of Decision Tree Algorithms for Data Warehouse Fragmentation,” presents the analysis of different decision tree algorithms to select the best one to implement the fragmentation method. Such analysis was performed under version 3.9.4 of Weka considering four evaluation metrics (precision, ROC area, recall, and F-measure) for different selected datasets using the SSB (Star Schema Benchmark). Several experiments were carried out using two attribute selection methods, Best First and Greedy Stepwise, the datasets were preprocessed using the Class Conditional Probabilities filter, and the analysis of two datasets (24 and 50 queries) with this filter was included to know the behavior of the decision tree algorithms for each dataset. Once the analysis was concluded, we can determine that for 24 queries dataset the best algorithm was RandomTree since it won in two methods. On the other hand, in the dataset of 50 queries, the best decision tree algorithms were LMT and RandomForest because they obtained the best performance for all methods tested. Finally, J48 was the selected algorithm when neither an attribute selection method nor the class probabilities filter is used. But, if
Preface
xi
only the latter is applied to the dataset, the best performance is given by the LMT algorithm. Chapter 16, entitled “Data Analytics in Financial Portfolio Recovery Management,” presents the application of data analytics and machine learning techniques to predict the behavior of the loan default in a non-financial entity. Five classification algorithms (neural networks, decision trees, support vector machines, logistic regression, and K neighbors) were run on a dataset of credit behavior data. Decision trees have shown the best prediction performance to determine whether a loan will be paid or become irrecoverable after running five predictive models. Chapter 17, entitled “Task Thesaurus as a Tool for Modeling of User Information Needs,” describes some examples of task thesaurus usage in intelligent applications for adaptation to user needs. Task thesaurus is an element of user model that reflects dynamic aspects of user’s current work. Such thesaurus is based on domain ontology and contains the subset of its concepts that are dealt with user task. These can be generated automatically by analysis of task description or with the help of semantic similarity estimations of ontological concepts. Task thesaurus represents personalized user view on domain and depends on his/her abilities, experience, and aims. Chapter 18, entitled “NHC_MDynamics: High-Throughput Tools for Simulations of Complex Fluids Using Nosé-Hoover Chains and Big Data Analytics,” is focused on the implementation of the NVT algorithm of molecular dynamics based on the Nosé-Hoover thermostat chains with the high-performance computing technology such as Graphical Processing Units (GPUs) and Big Data analytics for the generation of knowledge to help understand the functioning of thermodynamic properties in simulated systems of Lennard-Jones fluids, as well as the study of the behavior of proteins, such as diabetes, to refine their structures to formulate and integrate them into the improvement of food in an improved diet for people with this condition. Chapter 19, entitled “Determination of Competitive Management Perception in Family Business Leaders Using Data Mining,” seeks to determine competitive management perception of family business leaders, in order to establish working assumptions in new research and propose improvement and consolidation initiatives for these types of companies. This non-probabilistic, intentional study applied an instrument with 10 dimensions and 94 variables to a sample of 133 family business leaders from an intermediate city and a large city in Colombia. Data collection was achieved using supervised machine learning algorithms in the Python programming language, using techniques such as Cronbach’s Alpha Test, KMO, Levene, Bartlett, Discriminant Analysis, and Decision Trees. The results allow us to identify four main components in 19 variables: Management and technology, Quality Management, Compensation, and Country competitiveness. Chapter 20, entitled “A Genetic Algorithm for Solving the Inventory Routing Problem with Time Windows,” presents a genetic algorithm that allows to simultaneously optimize inventory allocation and transport routes to supply a set of customers for a specific time horizon. This model allows to obtain a minimum total cost as a result of a better combination of the inventories at customers’ facilities and the transportation required to supply them. The proposed model and the algorithm developed
xii
Preface
for its solution allowed to obtain significant savings compared to the routing optimization to supply all customers in each period using the vehicle routing problem model with time windows, which allows to optimize customers’ inventory and minimize transport costs in each period. However, when this solution is compared with the total distribution cost throughout the time horizon, it generates higher costs than the solution generated by the IRP with time windows presented in this work. Finally, Chap. 21, entitled “Emotion Detection from Text in Learning Environments: A Review,” introduces a literature review of text-based emotion detection in learning environments. We analyze the main APIs and tools available today for emotion detection and discuss their key characteristics. Also, we introduce a case study to detect the positive and negative polarities of two educational resources to identify the accuracy of the results obtained from five selected APIs. Finally, we discuss our conclusions and suggestions for future work. Once a brief summary of chapters has been provided, the editors would like to express their gratitude to the reviewers who kindly accepted to contribute in the chapters’ evaluation at all stages of the editing process. Medellín, Colombia Orizaba, Mexico Orizaba, Mexico Ciudad Juárez, Mexico
Julian Andres Zapata-Cortes Giner Alor-Hernández Cuauhtémoc Sánchez-Ramírez Jorge Luis García-Alcaraz
Acknowledgements
This book is part of the effort of several organizations that support research as the CEIPA Business School, National Council of Science and Technology in Mexico (CONACYT), PRODEP, Tecnológico Nacional de Mexico/IT Orizaba, and the Autonomous University of Ciudad Juarez (UACJ). Also, we, the editors, want to express our gratitude to other organizations that share the same purpose but are not mentioned here due to the long list. We also appreciate and recognize the trust, effort, and collaboration and patience that all authors and co-workers endorse us as editors. Finally, we are thankful and appreciate the Springer Publishing experts, especially to Thomas Ditzinger for his invaluable support at any moment, for providing good tips for editing from their experience, patience, and advice to materialize this book.
xiii
Contents
Part I
Industrial Applications
1
Merging Event Logs for Inter-organizational Process Mining . . . . . . Jaciel David Hernandez-Resendiz, Edgar Tello-Leal, Heidy Marisol Marin-Castro, Ulises Manuel Ramirez-Alcocer, and Jonathan Alfonso Mata-Torres
2
Towards Association Rule-Based Item Selection Strategy in Computerized Adaptive Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Josué Pacheco-Ortiz, Lisbeth Rodríguez-Mazahua, Jezreel Mejía-Miranda, Isaac Machorro-Cano, and Ulises Juárez-Martínez
3
4
5
6
Uncertainty Linguistic Summarizer to Evaluate the Performance of Investment Funds . . . . . . . . . . . . . . . . . . . . . . . . . . . Carlos Alexander Grajales and Santiago Medina Hurtado Map-Bot: Mapping Model of Indoor Work Environments in Mobile Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gustavo Alonso Acosta-Amaya, Andrés Felipe Acosta-Gil, Julián López-Velásquez, and Jovani Alberto Jiménez-Builes Production Analysis of the Beekeeping Chain in Vichada, Colombia. A System Dynamics Approach . . . . . . . . . . . . . . . . . . . . . . . Lizeth Castro-Mercado, Juan Carlos Osorio-Gómez, and Juan José Bravo-Bastidas
3
27
55
75
97
Effect of TPM and OEE on the Social Performance of Companies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Adrián Salvador Morales-García, José Roberto Díaz-Reza, and Jorge Luis García-Alcaraz
xv
xvi
7
Contents
ENERMONGRID: Intelligent Energy Monitoring, Visualization and Fraud Detection for Smart Grids . . . . . . . . . . . . . . . 143 Miguel Lagares-Lemos, Yuliana Perez-Gallardo, Angel Lagares-Lemos, and Juan Miguel Gómez-Berbís
Part II
Decision-Making Systems for Industry
8
Measuring Violence Levels in Mexico Through Tweets . . . . . . . . . . . . 169 Manuel Suárez-Gutiérrez, José Luis Sánchez-Cervantes, Mario Andrés Paredes-Valverde, Erasto Alfonso Marín-Lozano, Héctor Guzmán-Coutiño, and Luis Rolando Guarneros-Nolasco
9
Technology Transfer from a Tacit Knowledge Conservation Model into Explicit Knowledge in the Field of Data Envelopment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Diana María Montoya-Quintero, Olga Lucia Larrea-Serna, and Jovani Alberto Jiménez-Builes
10 Performance Analysis of Decision Aid Mechanisms for Hardware Bots Based on ELECTRE III and Compensatory Fuzzy Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Claudia Castillo-Ramírez, Nelson Rangel-Valdez, Claudia Gómez-Santillán, M. Lucila Morales-Rodríguez, Laura Cruz-Reyes, and Héctor J. Fraire-Huacuja 11 A Brief Review of Performance and Interpretability in Fuzzy Inference Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 José Fernando Padrón-Tristán, Laura Cruz-Reyes, Rafael Alejandro Espín-Andrade, and Carlos Eric Llorente-Peralta 12 Quality and Human Resources, Two JIT Critical Success Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Jorge Luis García-Alcaraz, José Luis Rodríguez-Álvarez, Jesús Alfonso Gil-López, Mara Luzia Matavelli de Araujo, and Roberto Díaz-Reza 13 Operational Risks Management in the Reverse Logistics of Lead-Acid Batteries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 Daniela Sarria-Cruz, Fabio Andrés Álvarez-López, Carolina Lima-Rivera, and Juan Carlos Osorio-Gómez 14 Dynamic Evaluation of Livestock Feed Supply Chain from the Use of Ethanol Vinasses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Rocío Ramos-Hernández, Cuauhtémoc Sánchez-Ramírez, Yara Anahí Jiménez-Nieto, Adolfo Rodríguez-Parada, Martín Mancilla-Gómez, and Juan Carlos Nuñez-Dorantes
Contents
xvii
Part III Artificial Intelligence Techniques 15 Comparative Analysis of Decision Tree Algorithms for Data Warehouse Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Nidia Rodríguez-Mazahua, Lisbeth Rodríguez-Mazahua, Asdrúbal López-Chau, Giner Alor-Hernández, and S. Gustavo Peláez-Camarena 16 Data Analytics in Financial Portfolio Recovery Management . . . . . . 365 Jonathan Steven Herrera Román, John W. Branch, and Martin Darío Arango-Serna 17 Task Thesaurus as a Tool for Modeling of User Information Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 J. Rogushina and A. Gladun 18 NHC_MDynamics: High-Throughput Tools for Simulations of Complex Fluids Using Nosé-Hoover Chains and Big Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Luis Rolando Guarneros-Nolasco, Manuel Suárez-Gutiérrez, Jorge Mulia-Rodríguez, Roberto López-Rendón, Francisco Villanueva-Mejía, and José Luis Sánchez-Cervantes 19 Determination of Competitive Management Perception in Family Business Leaders Using Data Mining . . . . . . . . . . . . . . . . . . 435 Ángel Rodrigo Vélez-Bedoya, Liliana Adriana Mendoza-Saboyá, and Jenny Lorena Luna-Eraso 20 A Genetic Algorithm for Solving the Inventory Routing Problem with Time Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 Julian Andres Zapata-Cortes, Martin Darío Arango-Serna, Conrado Augusto Serna-Úran, and Hermenegildo Gil-Gómez 21 Emotion Detection from Text in Learning Environments: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 Maritza Bustos-López, Nicandro Cruz-Ramírez, Alejandro Guerra-Hernández, Laura Nely Sánchez-Morales, and Giner Alor-Hernández
Contributors
Gustavo Alonso Acosta-Amaya Facultad de Ingeniería, Departamento de Instrumentación y Control, Politecnico Colombiano, Medellín, Antioquia, Colombia Andrés Felipe Acosta-Gil Facultad de Minas, Departamento de Ciencias de la Computación y de la Decisión, Universi-dad Nacional de Colombia, Medellín, Antioquia, Colombia Giner Alor-Hernández Tecnológico Nacional de México/ IT Orizaba, Orizaba, Veracruz, Mexico Fabio Andrés Álvarez-López Escuela de Ingeniería Industrial, Universidad del Valle, Cali, Colombia Martin Darío Arango-Serna Facultad de Minas, Universidad Nacional de Colombia, Medellín, Antioquia, Colombia John W. Branch Universidad Nacional de Colombia, Medellín, Colombia Juan José Bravo-Bastidas Valle del Cauca, Escuela de Ingeniería Industrial, Logistic and Production Research Group, Universidad del Valle, Cali, Colombia Maritza Bustos-López Centro de Investigación en Inteligencia Artificial, Universidad Veracruzana, Xalapa, México Claudia Castillo-Ramírez Tecnológico Nacional de México, Instituto Tecnológico de Ciudad Madero, Ciudad Madero, Tamaulipas, Mexico Lizeth Castro-Mercado Valle del Cauca, Escuela de Ingeniería Industrial, Logistic and Production Research Group, Universidad del Valle, Cali, Colombia Nicandro Cruz-Ramírez Centro de Investigación en Inteligencia Artificial, Universidad Veracruzana, Xalapa, México Laura Cruz-Reyes Tecnológico Nacional de México, Instituto Tecnológico de Ciudad Madero, Ciudad Madero, Tamaulipas, Mexico Mara Luzia Matavelli de Araujo Department of Business and Economy, University of La Rioja, Logroño, Spain xix
xx
Contributors
Roberto Díaz-Reza Department of Electric Engineering and Computation, Universidad Autónoma de Ciudad Juárez, Chihuahua, México José Roberto Díaz-Reza Department of Electric Engineering and Computation, Universidad Autónoma de Ciudad Juárez, Ciudad Juárez, Chihuahua, Mexico Rafael Alejandro Espín-Andrade Autonomous University of Coahuila, Saltillo, Mexico Héctor J. Fraire-Huacuja Tecnológico Nacional de México, Instituto Tecnológico de Ciudad Madero, Ciudad Madero, Tamaulipas, Mexico Jorge Luis García-Alcaraz Department of Industrial Engineering and Manufacturing, Universidad Autónoma de Ciudad Juárez, Ciudad Juárez, Chihuahua, Mexico Hermenegildo Gil-Gómez Universidad Politécnica de Valencia, Valencia, España Jesús Alfonso Gil-López Department of Business and Economy, University of La Rioja, Logroño, Spain A. Gladun International Research and Training Center of Information Technologies and Systems of National Academy of Sciences of Ukraine and Ministry of Education and Science of Ukraine, Kyiv, Ukraine Juan Miguel Gómez-Berbís Department of Computer Science, Universidad Carlos III de Madrid, Madrid, Spain Claudia Gómez-Santillán Tecnológico Nacional de México, Instituto Tecnológico de Ciudad Madero, Ciudad Madero, Tamaulipas, Mexico Carlos Alexander Grajales Universidad de Antioquia, Medellín, Colombia Luis Rolando Guarneros-Nolasco Tecnológico Nacional de México/I. T. Orizaba, Orizaba, Veracruz, Mexico Alejandro Guerra-Hernández Centro de Investigación en Inteligencia Artificial, Universidad Veracruzana, Xalapa, México Héctor Guzmán-Coutiño Universidad Veracruzana, Xalapa, Veracruz, México Jaciel David Hernandez-Resendiz Unidad Académica Multidisciplinaria Reynosa-RODHE, Universidad Autónoma de Tamaulipas, Reynosa, Tamaulipas, México Jonathan Steven Herrera Román Universidad Nacional de Colombia, Medellín, Colombia Jovani Alberto Jiménez-Builes Facultad de Minas, Departamento de Ciencias de la Computación y de la Decisión, Universi-dad Nacional de Colombia, Medellín, Antioquia, Colombia
Contributors
xxi
Yara Anahí Jiménez-Nieto Faculty of Accounting and Administration, Universidad Veracruzana Campus Ixtaczoquitlán, Ixtaczoquitlán, Veracruz, Mexico Ulises Juárez-Martínez Tecnológico Nacional de México/I. T. Orizaba, Orizaba, Veracruz, Mexico Angel Lagares-Lemos Department of Computer Science, Universidad Carlos III de Madrid, Madrid, Spain Miguel Lagares-Lemos Department of Computer Science, Universidad Carlos III de Madrid, Madrid, Spain Olga Lucia Larrea-Serna Departamento de Calidad y Producción, Facultad de Ciencias Económicas y Administrativas, Instituto Tecnológico Metropolitano, Medellín, Antioquia, Colombia Carolina Lima-Rivera Escuela de Ingeniería Industrial, Universidad del Valle, Cali, Colombia Carlos Eric Llorente-Peralta Tecnológico Tecnológico de Tijuana, Tijuana, Mexico
Nacional
de
México,
Instituto
M. Lucila Morales-Rodríguez Tecnológico Nacional de México, Instituto Tecnológico de Ciudad Madero, Ciudad Madero, Tamaulipas, Mexico Jenny Lorena Luna-Eraso Universidad de Nariño, Pasto, Colombia Asdrúbal López-Chau Universidad Autónoma Del Estado de México, Centro Universitario UAEM Zumpango, Estado de México, Mexico Roberto López-Rendón Universidad Autónoma del Estado de México, Toluca, Mexico Julián López-Velásquez Facultad de Ingeniería, Departamento de Instrumentación y Control, Politecnico Colombiano, Medellín, Antioquia, Colombia Isaac Machorro-Cano Universidad del Papaloapan, Tuxtepec, Oaxaca, Mexico Martín Mancilla-Gómez Faculty of Accounting and Administration, Universidad Veracruzana Campus Ixtaczoquitlán, Ixtaczoquitlán, Veracruz, Mexico Heidy Marisol Marin-Castro Cátedras CONACYT, Facultad de Ingeniería y Ciencias, Universidad Autónoma de Tamaulipas, Victoria, Tamaulipas, México Erasto Alfonso Marín-Lozano Universidad Veracruzana, Xalapa, Veracruz, México Jonathan Alfonso Mata-Torres Unidad Académica Multidisciplinaria ReynosaRODHE, Universidad Autónoma de Tamaulipas, Reynosa, Tamaulipas, México Santiago Medina Hurtado Universidad Colombia
Nacional
de
Colombia,
Medellín,
xxii
Contributors
Jezreel Mejía-Miranda Centro de Investigación en Matemáticas CIMAT, A.C, Guanajuato, Mexico Liliana Adriana Mendoza-Saboyá ISLP, International Statistics Institute, Bogotá, Colombia Diana María Montoya-Quintero Departamento de Calidad y Producción, Facultad de Ciencias Económicas y Administrativas, Instituto Tecnológico Metropolitano, Medellín, Antioquia, Colombia Adrián Salvador Morales-García Department of Industrial Engineering and Manufacturing, Universidad Autónoma de Ciudad Juárez, Ciudad Juárez, Chihuahua, Mexico Jorge Mulia-Rodríguez Universidad Autónoma del Estado de México, Toluca, Mexico Juan Carlos Nuñez-Dorantes Tecnológico Nacional de México/I. T. Orizaba, Orizaba, Mexico Juan Carlos Osorio-Gómez Valle del Cauca, Escuela de Ingeniería Industrial, Logistic and Production Research Group, Universidad del Valle, Cali, Colombia Josué Pacheco-Ortiz Tecnológico Nacional de México/I. T. Orizaba, Orizaba, Veracruz, Mexico José Fernando Padrón-Tristán Tecnológico Nacional de México, Instituto Tecnológico de Tijuana, Tijuana, Mexico Mario Andrés Paredes-Valverde Instituto Tecnológico Superior de Teziutlán, Teziutlán, Puebla, México S. Gustavo Peláez-Camarena Tecnológico Nacional de México/ IT Orizaba, Orizaba, Veracruz, Mexico Yuliana Perez-Gallardo Department of Computer Science, Universidad Carlos III de Madrid, Madrid, Spain Ulises Manuel Ramirez-Alcocer Unidad Académica Multidisciplinaria ReynosaRODHE, Universidad Autónoma de Tamaulipas, Reynosa, Tamaulipas, México Rocío Ramos-Hernández Tecnológico Nacional de México/I. T. Orizaba, Orizaba, Mexico Nelson Rangel-Valdez Cátedras CONACyT/Tecnológico Nacional de México, Instituto Tecnológico de Ciudad Madero, Ciudad Madero, Tamaulipas, Mexico Lisbeth Rodríguez-Mazahua Tecnológico Nacional de México/ IT Orizaba, Orizaba, Veracruz, Mexico Nidia Rodríguez-Mazahua Tecnológico Nacional de México/ IT Orizaba, Orizaba, Veracruz, Mexico
Contributors
xxiii
Adolfo Rodríguez-Parada Faculty of Accounting and Administration, Universidad Veracruzana Campus Ixtaczoquitlán, Ixtaczoquitlán, Veracruz, Mexico José Luis Rodríguez-Álvarez Doctoral Program in Engineering Sciences, Instituto Tecnológico y de Estudios Superiores de Occidente (ITESO), Tlaquepaque, Jalisco, México J. Rogushina Institute of Software Systems of National Academy of Sciences of Ukraine, Kyiv, Ukraine Daniela Sarria-Cruz Escuela de Ingeniería Industrial, Universidad del Valle, Cali, Colombia Conrado Augusto Serna-Úran Instituto Tecnológico Metropolitano, Medellín, Antioquia, Colombia Manuel Suárez-Gutiérrez Universidad Veracruzana, Xalapa, Veracruz, Mexico José Luis Sánchez-Cervantes CONACYT—Tecnológico Nacional de México/I. T. Orizaba, Orizaba, Veracruz, Mexico Laura Nely Sánchez-Morales Tecnológico Nacional de México/I. T. Orizaba, Orizaba, México Cuauhtémoc Sánchez-Ramírez Tecnológico Nacional de México/I. T. Orizaba, Orizaba, Mexico Edgar Tello-Leal Facultad de Ingeniería y Ciencias, Universidad Autónoma de Tamaulipas, Victoria, Tamaulipas, México Francisco Villanueva-Mejía Instituto Tecnológico de Aguascalientes, Aguascalientes, Mexico Ángel Rodrigo Vélez-Bedoya Fundación Universitaria CEIPA Business School, Antioquia, Colombia Julian Andres Zapata-Cortes Fundación Colombia
Universitaria
CEIPA,
Antioquia,
List of Figures
Fig. 1.1 Fig. 1.2 Fig. 1.3 Fig. 1.4 Fig. 1.5 Fig. 2.1 Fig. 2.2 Fig. 2.3 Fig. 2.4
Fig. 2.5 Fig. 2.6 Fig. 2.7
Fig. 2.8 Fig. 2.9 Fig. 2.10
Model of the process choreography within a collaboration diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of the methodology approach . . . . . . . . . . . . . . . . . . . Example of a bag-of-words at the case level generated by method 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intra-organizational business process model discovered for the M-Repair organization . . . . . . . . . . . . . . . . . . . . . . . . . . . Process choreography discovered from the merged event log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Traditional CAT process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Integration of association rule mining in the item selection phase of the CAT process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Apriori, FPGrowth, PredictiveApriori and Tertius algorithms in terms of support for Exa1 data set . . . Comparison of Apriori FPGrowth, PredictiveApriori and Tertius algorithms in terms of confidence for Exa1 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Apriori, FPGrowth, PredictiveApriori and Tertius algorithms in terms of time for Exa1 data set . . . . . Comparison of Apriori, FPGrowth, PredictiveApriori and Tertius algorithms in terms of support for Exa2 data set . . . Comparison of Apriori FPGrowth, PredictiveApriori and Tertius algorithms in terms of confidence for Exa2 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Apriori, FPGrowth, PredictiveApriori and Tertius algorithms in terms of time for Exa2 data set . . . . . Comparison of Apriori, FPGrowth, PredictiveApriori and Tertius algorithms in terms of support for Exa3 data set . . . Comparison of Apriori FPGrowth, PredictiveApriori and Tertius algorithms in terms of confidence for Exa3 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 10 15 19 21 30 37 46
46 47 47
48 48 49
49 xxv
xxvi
Fig. 2.11 Fig. 3.1 Fig. 3.2
Fig. 3.3 Fig. 3.4
Fig. 3.5
Fig. 3.6
Fig. 3.7
Fig. 4.1 Fig. 4.2 Fig. 4.3 Fig. 4.4 Fig. 4.5 Fig. 4.6 Fig. 4.7
Fig. 4.8
List of Figures
Comparison of Apriori, FPGrowth, PredictiveApriori and Tertius algorithms in terms of time for Exa3 data set . . . . . Uncertain sets and their membership functions: (a) ξ1 and μ1 ; (b) ξ2 and μ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Membership functions for the unsharp concepts most, young, and tall are represented, respectively, through λ(x), ν(y), and μ(z) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Histograms, from left to right, of the Equity Loss Ratio and Annual profitability of the investment funds . . . . . . . . . . . . The diagram is read clockwise starting at Financial Data. Liu’s uncertain data mining poses a linguistic summarizer on dataset A to evaluate the performance of the investment funds. Implementation of the summarizer is general in scope to ease the fit toward other problems . . . . . . . . . . . . . . Equity Loss Ratio vs Annual Profitability on investment funds. A posteriori human verification through filter 1 of the logical sense for the first linguistic summary found by the summarizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Equity Loss Ratio versus Annual Profitability on investment funds. A posteriori human verification through filters 1 and 2 of the logical sense for the first linguistic summary found by the summarizer . . . . . . . . . . . . . . . Estimation of the truth values T and probability p for the linguistic summaries found. The summaries describe the performance of investment funds in Colombia in the period of study. Calculations of T are made under the Liu’s uncertainty framework, and the summaries, written in Human language, are found by implementing the linguistic summarizer in (3.18) . . . . . . . . . Measurements of distances by Time of Flight (TOF) of the ultrasonic sensors. Source Authors . . . . . . . . . . . . . . . . . . Map-Bot architecture for mapping of indoor environments. Source Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . Four-tier architecture of the mobile robot Walling. Source Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Electronic diagram of the SCoI wireless communications subsystem. Source Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Distribution of components on the PCB. Source Authors . . . . . Typical navigation path of the robotic agent Walling in a structured environment. Source Authors . . . . . . . . . . . . . . . Electronic diagram for the exteroceptive perception and wireless communications of the Walling robot. Source Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Walling robot fuzzy navigation controller block diagram. Source Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50 59
60 64
67
69
70
72 78 79 80 83 83 84
85 85
List of Figures
Fig. 4.9 Fig. 4.10
Fig. 4.11
Fig. 4.12 Fig. 4.13
Fig. 4.14 Fig. 5.1
Fig. 5.2
Fig. 5.3 Fig. 5.4 Fig. 5.5 Fig. 5.6 Fig. 5.7 Fig. 5.8 Fig. 5.9
Fig. 5.10 Fig. 5.11 Fig. 5.12 Fig. 5.13 Fig. 5.14 Fig. 5.15
Fig. 5.16 Fig. 5.17 Fig. 5.18
xxvii
Sketch of the Walling robot control structure. Source Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Navigation tests on the Walling robot, a path of 560 cm with a spurious reading of zero centimeters from the S0 sensor after traveling 400 cm. Source Authors . . . . . . . . . . . . . . Navigation tests on the Walling robot, a path of 560 cm with a false reading greater than 100 cm after 420 cm of travel. Source Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Specular reflection-free navigation in the Walling robot. Source Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partition of linguistic variables in fuzzy sets and control surface of the fuzzy navigation system of the Walling robot. Source Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Visual environment and “Adquisicion” mode of operation of the Vespucci mapping module. Source Authors . . . . . . . . . . . Publication trend in the area of systems dynamic (a), system dynamic supply chain (b), and c systems dynamic food supply chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Visualization of network clusters of research topics in publications related to system dynamics with application in the food supply chain, period 2007– 2020. Note The minimum number of occurrences of a keyword is two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Countries with reported research on system dynamic in the food supply chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methodological approach (Aracil 1995) . . . . . . . . . . . . . . . . . . . Causal diagram of beekeeping production . . . . . . . . . . . . . . . . . Reinforcement loop R1 and R2 . . . . . . . . . . . . . . . . . . . . . . . . . . Balancing loop B1 y B2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Balancing loop B3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Forrester diagram of the beekeeping production in Vichada. a Bees for breeding and Bees for honey, bHoney hive inventory, Virgin honey collection center, Pasteurized honey, Diversified product . . . . . . . . . . . . . . . . . . . . Behavior at the stock of bees for breeding in two scenarios . . . Behavior at the stock of bees for honey in two scenarios . . . . . . Behavior at the flow bee feedback in two scenarios . . . . . . . . . . Behavior at the flow sales of bees in the two scenarios . . . . . . . Behavior in the wax inventory in the two scenarios . . . . . . . . . . Inventory behavior of pasteurized honey, collection center honey, and diversified product inventory in scenario 1 (a) and scenario 2 (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Virgin honey collection center inventory . . . . . . . . . . . . . . . . . . . Pasteurized honey inventory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inventory diversified product inventory . . . . . . . . . . . . . . . . . . . .
86
87
87 88
91 93
99
100 101 105 106 106 106 107
108 111 111 112 112 113
114 115 115 115
xxviii
Fig. 6.1 Fig. 6.2 Fig. 7.1 Fig. 7.2 Fig. 7.3 Fig. 7.4 Fig. 7.5 Fig. 7.6 Fig. 7.7 Fig. 7.8 Fig. 7.9 Fig. 7.10 Fig. 7.11 Fig. 7.12 Fig. 7.13 Fig. 7.14 Fig. 7.15 Fig. 7.16 Fig. 7.17 Fig. 7.18 Fig. 7.19 Fig. 7.20 Fig. 7.21 Fig. 7.22 Fig. 7.23 Fig. 7.24 Fig. 7.25 Fig. 8.1 Fig. 8.2 Fig. 8.3 Fig. 8.4 Fig. 8.5 Fig. 8.6 Fig. 8.7 Fig. 8.8 Fig. 8.9 Fig. 8.10 Fig. 9.1 Fig. 9.2 Fig. 9.3 Fig. 9.4
List of Figures
Proposed model and relationships among variables . . . . . . . . . . Evaluated model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Energy flow of transformation centers . . . . . . . . . . . . . . . . . . . . . Types of reports and data dependencies . . . . . . . . . . . . . . . . . . . Analysis of energy balances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Side effect of discarding measurements with quality bit other than ‘00’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Apparent profits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data from TC1 to TC4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data from TC5 to TC7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reading rates of September . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reading rate of TC1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reading rate of TC3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . UTLB readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Availability of reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . UTLA analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . UTLB analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IoT reference architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Main concepts of the SSN ontology . . . . . . . . . . . . . . . . . . . . . . INDIGO software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . Major parts of SSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software architecture layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indigo home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indigo current data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indigo read rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indigo multiple dashboards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Indigo datamodel menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model for knowledge acquisition from Twitter . . . . . . . . . . . . . . Methodology model applied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The cluster of keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Server cluster configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary of the pre-processing data layer . . . . . . . . . . . . . . . . . Tweets captured between June 11 and 13 of 2019 . . . . . . . . . . . Kibana Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Veracity degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Word Cloud for most used hashtags . . . . . . . . . . . . . . . . . . . . . . Hashtag correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Layer model for the conservation of tacit knowledge for DEA methodology. Source The authors . . . . . . . . . . . . . . . . Layer of human knowledge. Source The authors . . . . . . . . . . . . Layer of elements to find DEA processes. Source The authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Layer of declarative knowledge elements. Source The authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
126 135 146 147 148 149 150 150 151 151 152 153 153 153 154 154 155 157 160 161 161 162 163 163 164 164 165 173 174 174 178 179 180 181 184 187 188 198 206 208 208
List of Figures
Fig. 9.5 Fig. 9.6 Fig. 9.7 Fig. 10.1 Fig. 10.2 Fig. 11.1 Fig. 11.2 Fig. 11.3 Fig. 11.4 Fig. 11.5 Fig. 11.6 Fig. 11.7 Fig. 11.8
Fig. 11.9
Fig. 11.10 Fig. 12.1 Fig. 12.2 Fig. 13.1 Fig. 13.2 Fig. 13.3 Fig. 13.4 Fig. 13.5 Fig. 13.6 Fig. 13.7 Fig. 13.8 Fig. 13.9 Fig. 13.10
xxix
Layer of cognitive knowledge. Source The authors . . . . . . . . . . Layer of elements of knowledge transformation into processing. Source The authors . . . . . . . . . . . . . . . . . . . . . . Technology transfer of tacit knowledge in the DEA. Source The authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proposed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generation method for historical data with max entries (s stands for satisfaction) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inference system based on compensatory fuzzy logic (Espín-Andrade et al. 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Restrictions and interpretation criteria (Alonso et al. 2015) . . . Workflow for the order-picking problem using Eureka-Universe (Padrón-Tristán et al. 2020) . . . . . . . . . . . . . . . The balance between precision and interpretability (Cpałka 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Improving the balance between precision and interpretability (Megherbi et al. 2019) . . . . . . . . . . . . . . . . . Approaches to design rules considering accuracy and interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Execution of VS for the BPP (Padrón-Tristán 2020) . . . . . . . . . Results of the search in the number of papers using eight different queries combining four keywords (interpretability, interpretable, accuracy, and fuzzy) and two filters (review type and keyword location) . . . . . . . . . . Results of the search in several papers (blue) and citations (red) per year, using two queries that search in the paper title and differing in the keyword fuzzy . . . . . . . . . . . . . . . . . . . . Number of research papers per journal . . . . . . . . . . . . . . . . . . . . Proposed model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluated model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operational risk management system in supply chains (Manotas et al. 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methodological design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Risk identification approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Probability and impact matrix (Osorio-Gomez et al. 2018) . . . . Methodological approach to risk prioritization (Osorio-Gomez et al. 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reverse logistics network to lead-acid batteries . . . . . . . . . . . . . The battery recovery process in a Colombian company . . . . . . . Battery recovery process (https://www.ambientebogota. gov.co/, 2008) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Risk probability-impact matrix . . . . . . . . . . . . . . . . . . . . . . . . . . Cause effect diagram to operational risk in reverse logistics of lead-acid batteries . . . . . . . . . . . . . . . . . . . . . . . . . . .
209 210 213 221 227 243 245 248 253 254 255 257
258
258 259 273 280 291 291 292 293 294 295 296 296 298 300
xxx
Fig. 13.11 Fig. 13.12 Fig. 14.1 Fig. 14.2 Fig. 14.3 Fig. 14.4 Fig. 14.5 Fig. 14.6 Fig. 14.7 Fig. 14.8 Fig. 14.9 Fig. 14.10 Fig. 14.11 Fig. 14.12 Fig. 14.13 Fig. 14.14 Fig. 14.15 Fig. 14.16 Fig. 14.17 Fig. 14.18 Fig. 14.19 Fig. 15.1 Fig. 15.2 Fig. 15.3 Fig. 15.4 Fig. 15.5 Fig. 15.6 Fig. 15.7 Fig. 15.8 Fig. 15.9 Fig. 15.10 Fig. 15.11 Fig. 15.12 Fig. 15.13 Fig. 15.14
List of Figures
Questionnaire for knowledge validation in new income and monitoring in old workers . . . . . . . . . . . . . . . . . . . . . . . . . . . Personal protective equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . Most common industrial applications of sugarcane vinasses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A conceptual model of the vinasse-based LF supply chain . . . . Illustration of the vinasse-based LF supply chain . . . . . . . . . . . . Different ways to system feedback . . . . . . . . . . . . . . . . . . . . . . . Mass balance of LF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Balancing loop B1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Balancing loop B2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Balancing loop B3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behavior of utilities within the vinasse-based LF supply chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Causal diagram of vinasse-based LF production . . . . . . . . . . . . Feedback loop used in outlier test . . . . . . . . . . . . . . . . . . . . . . . . Model validation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Molasses storage at XYZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daily ethanol production at XYZ . . . . . . . . . . . . . . . . . . . . . . . . . Daily ethanol production versus vinasse generated . . . . . . . . . . Vinasse inventory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inventory for 500 h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behavior of LF production on a yearly basis . . . . . . . . . . . . . . . LF inventory behavior under different LF demand scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data set with 24 queries and 2 fragments . . . . . . . . . . . . . . . . . . Algorithm 1. Generation of data sets . . . . . . . . . . . . . . . . . . . . . . Algorithm 2. Get PPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of the Recall metric for 24 queries data sets . . . . . . . . . Results of the Precision metric for 24 queries data sets . . . . . . . Results of the ROC Area metric for 24 queries dataset . . . . . . . Results of the F-Measure metric for 24 queries dataset . . . . . . . Decision tree created by J48 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of decision tree algorithms for 24 queries and 2 fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of decision trees algorithms for 24 queries and 3 fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of decision trees algorithms for 24 queries and 4 fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of decision trees algorithms for 24 queries and 5 fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of decision trees algorithms for 50 queries and 2 fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of decision trees algorithms for 50 queries and 3 fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
302 303 312 313 315 315 316 319 319 320 320 321 324 325 326 327 328 329 329 330 332 346 346 347 349 350 350 351 352 353 354 354 355 355 356
List of Figures
Fig. 15.15 Fig. 15.16 Fig. 15.17 Fig. 16.1 Fig. 16.2 Fig. 16.3 Fig. 16.4 Fig. 17.1 Fig. 17.2 Fig. 17.3 Fig. 17.4 Fig. 17.5 Fig. 17.6 Fig. 18.1
Fig. 18.2
Fig. 18.3 Fig. 18.4 Fig. 18.5 Fig. 18.6 Fig. 18.7 Fig. 18.8 Fig. 18.9 Fig. 18.10 Fig. 18.11
Fig. 18.12
xxxi
Results of decision trees algorithms for 50 queries and 4 fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of decision trees algorithms for 50 queries and 5 fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Horizontal Fragmentation method diagram . . . . . . . . . . . . . . . . . Increase in loan of P2P lending since January 2016 (Superintendencia Financiera de Colombia 2019) . . . . . . . . . . . Personal loans default versus P2P lending default (Superintendencia Financiera de Colombia 2019) . . . . . . . . . . . Accuracy versus depth on decision trees . . . . . . . . . . . . . . . . . . . Sub-tree from the optimized decision tree . . . . . . . . . . . . . . . . . . Generalized algorithm of task thesaurus generation . . . . . . . . . . Use of OntoSearch for selection of ontological classes to initial task thesaurus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Expansion of thesaurus on base of taxonomic relation . . . . . . . . Expansion of thesaurus on base of various types of hierarchical relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Use of task thesaurus for semantic retrieval in MAIPS . . . . . . . e-VUE ontology and user interface . . . . . . . . . . . . . . . . . . . . . . . Interactions in the atom-atom model between different molecules where the atom a = 1 of molecule i interacts with the atoms of molecule j and so on with the other atoms . . Periodic boundary conditions in a periodic two-dimensional system The shaded cell corresponds to the central cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lennard-Jones potential. The blue line represents the attraction of atoms and the red line their repulsion . . . . . . . . NVT algorithm incorporating NHC thermostats. The NVE ensemble is located in the central part . . . . . . . . . . . . . . . . Algorithm of the application of the Nosé-Hoover chain thermostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Processing diagram of a simulation with molecular dynamics and big data analytics . . . . . . . . . . . . . . . . . . . . . . . . . . Big data logical diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Big data physical diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CQL sentences used in the creation of the database schema for Cassandra™ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Initial configuration of a 2048 particle LJ fluid system with temperature 2.0 and density 0.7 . . . . . . . . . . . . . . . . . . . . . . Representation of: A Kinetic energy; B Potential energy and, C Temperature calculated to balance the system. Note the initial values of each of them and later the constant values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3D model of the new positions and velocities . . . . . . . . . . . . . . .
357 358 359 366 367 380 380 393 396 397 397 400 401
410
412 414 415 416 417 419 420 421 423
423 424
xxxii
Fig. 18.13
Fig. 18.14 Fig. 18.15 Fig. 18.16
Fig. 18.17 Fig. 18.18
Fig. 18.19 Fig. 18.20 Fig. 18.21
Fig. 18.22 Fig. 18.23
Fig. 18.24 Fig. 19.1 Fig. 19.2 Fig. 19.3 Fig. 19.4 Fig. 19.5 Fig. 19.6 Fig. 19.7 Fig. 19.8 Fig. 19.9 Fig. 19.10 Fig. 19.11 Fig. 19.12 Fig. 19.13
List of Figures
Representation of: A Kinetic energy; B Potential energy and, C Calculated temperature in a production dynamic with 40,000 steps. In each one of them, the constant conservation of the calculated values is noted, characteristic of the NVT ensemble . . . . . . . . . . . . . . . . . 3D model of the new positions and velocities . . . . . . . . . . . . . . . Initial configuration of a 10,976 particle LJ fluid system with temperature 2.0 and density 0.7 . . . . . . . . . . . . . . . . . . . . . . Representation of: A Kinetic energy; B Potential energy and, C Temperature calculated to balance the system. Note the initial values of each of them and later the constant values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3D model of the new positions and velocities . . . . . . . . . . . . . . . Representation of: A Kinetic energy; B Potential energy and, C Calculated temperature in a production dynamic with 40,000 steps. In each one of them, the constant conservation of the calculated values is noted, characteristic of the NVT ensemble . . . . . . . . . . . . . . . . . 3D model of the new positions and velocities . . . . . . . . . . . . . . . Initial configuration of a 23,328 particle LJ fluid system with temperature 2.0 and density 0.7 . . . . . . . . . . . . . . . . . . . . . . Representation of: A Kinetic energy; B Potential energy and, C Temperature calculated to balance the system. Note the initial values of each of them and later the constant values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3D model of the new positions and velocities . . . . . . . . . . . . . . . Representation of: A Kinetic energy; B Potential energy and, C Calculated temperature in a production dynamic with 40,000 steps. In each one of them, the constant conservation of the calculated values is noted, characteristic of the NVT ensemble . . . . . . . . . . . . . . . . . 3D model of the new positions and velocities . . . . . . . . . . . . . . . Family entrepreneurship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Financial need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Economic solvency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variance contribution of each component . . . . . . . . . . . . . . . . . . Panel A origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Panel B origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Panel C origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Panel D origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Panel E origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Panel F origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Panel A generational change . . . . . . . . . . . . . . . . . . . . . . . . . . . . Panel B generational change . . . . . . . . . . . . . . . . . . . . . . . . . . . . Panel C generational change . . . . . . . . . . . . . . . . . . . . . . . . . . . .
424 425 426
427 427
428 429 429
430 431
431 432 443 443 444 445 446 447 447 448 448 449 449 450 450
List of Figures
Fig. 19.14 Fig. 19.15 Fig. 19.16 Fig. 19.17 Fig. 19.18 Fig. 19.19 Fig. 19.20 Fig. 20.1 Fig. 20.2 Fig. 20.3 Fig. 20.4 Fig. 20.5 Fig. 20.6 Fig. 21.1 Fig. 21.2
xxxiii
Panel D generational change . . . . . . . . . . . . . . . . . . . . . . . . . . . . Panel E generational change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Panel F generational change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tree for economic solvency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Generation of founders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Second generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Third generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Collaborative scheme for the VMI . . . . . . . . . . . . . . . . . . . . . . . . Chromosome used in the genetic algorithm . . . . . . . . . . . . . . . . Crossover operator used in the genetic algorithm . . . . . . . . . . . . Mutation operator used in the genetic algorithm . . . . . . . . . . . . Best individual obtained by the genetic algorithm . . . . . . . . . . . Solution representation of using the VRP with time windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Educational resource R298—Perimeter and area of geometric figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Educational resource R280—Area formulas for geometric shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
451 451 452 453 454 455 455 466 467 467 474 476 478 501 503
List of Tables
Table 1.1 Table 1.2 Table 1.3 Table 1.4 Table 1.5 Table 1.6 Table 2.1 Table 2.2 Table 2.3 Table 2.4 Table 2.5 Table 2.6 Table 2.7 Table 2.8 Table 2.9 Table 2.10 Table 3.1 Table 3.2 Table 3.3 Table 3.4 Table 3.5
Example of a case in the M-Parts participant’s event log . . . . Example of a case in the M-Repair participant’s event log . . . Score matrix at the case level . . . . . . . . . . . . . . . . . . . . . . . . . . . Score matrix at the activity level . . . . . . . . . . . . . . . . . . . . . . . . Type and sub-type of discovered tasks for the event log BPMNI of the M-Repair participant . . . . . . . . . . . . . . . . . . . . . Type and sub-type of discovered tasks for the event log BPMNR of the M-Parts participant . . . . . . . . . . . . . . . . . . . . . . Related works (A) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related works (B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related works (C) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test results for Apriori and FPGrowth for the Exa1 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test results for Apriori and FPGrowth for the Exa2 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test results for Apriori and FPGrowth for the Exa3 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test results for PredictiveApriori and Tertius for the Exa1 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test results for PredictiveApriori and Tertius for the Exa2 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test results for PredictiveApriori and Tertius for the Exa3 data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test results for Apriori and FPGrowth in terms of Lift for the Exa1, Exa2 and Exa3 data set . . . . . . . . . . . . . . . . . . . . Linguistics and m.f. of the uncertain quantifier . . . . . . . . . . . . Linguistics and m.f. of the uncertain subject . . . . . . . . . . . . . . Linguistics and m.f. of the uncertain predicate . . . . . . . . . . . . . Results: linguistic summaries—part I (Colombian investment funds) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results: linguistic summaries—part II (Colombian investment funds) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14 15 16 17 17 18 31 32 33 40 41 42 43 43 44 45 64 65 65 68 68 xxxv
xxxvi
Table 3.6 Table 4.1 Table 4.2 Table 4.3 Table 5.1 Table 5.2 Table 6.1 Table 6.2 Table 6.3 Table 6.4 Table 6.5 Table 6.6 Table 6.7 Table 6.8 Table 6.9 Table 8.1 Table 8.2 Table 8.3 Table 8.4 Table 8.5 Table 8.6 Table 8.7 Table 8.8 Table 8.9 Table 8.10 Table 8.11 Table 9.1 Table 10.1 Table 10.2 Table 10.3 Table 10.4 Table 10.5 Table 10.6 Table 10.7
List of Tables
Linguistic summaries - Truth value T vs. probability measure p . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Technical specifications of the SFR02 sensor. Source (Acosta 2010) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Characterization of linguistic variables for the fuzzy navigation controller of the robot walling . . . . . . . . . . . . . . . . . Simplified FAM for the fuzzy navigation controller of the robot walling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . State of the art on the systems dynamic in food supply chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Description and role of major variables used to model the Beekeeping Supply Chain . . . . . . . . . . . . . . . . . . . . . . . . . . Questionnaire validation indexes . . . . . . . . . . . . . . . . . . . . . . . . Model fit and quality indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . Gender versus years of experience . . . . . . . . . . . . . . . . . . . . . . Number of employees versus job position . . . . . . . . . . . . . . . . Validation of latent variables . . . . . . . . . . . . . . . . . . . . . . . . . . . Descriptive analysis of the items . . . . . . . . . . . . . . . . . . . . . . . . Direct effects contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Total effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Selection of related works about violence in social media . . . List of user key Twitter accounts . . . . . . . . . . . . . . . . . . . . . . . . Top 20 keyword validation list . . . . . . . . . . . . . . . . . . . . . . . . . . Detail of node configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . Principal emoticons and special characters deleted . . . . . . . . . Descriptive analysis of frequencies at the federal states’ level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Descriptive analysis of frequencies at the metropolitan zones level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification range of analysis intervals at a state level . . . . . Classification of analysis intervals at metropolitan zones . . . . Intervals for the indicators created at the federal entity level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intervals for the indicators created at metropolitan zones . . . . Multiplicative and envelopment form of the models . . . . . . . . Instance composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Range value considered for the sensors . . . . . . . . . . . . . . . . . . Training set of the historical data, HE . . . . . . . . . . . . . . . . . . . . Test set of the historical data, HP . . . . . . . . . . . . . . . . . . . . . . . Initial configuration of weights and thresholds used by the PDA strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ELECTRE III parameters’ values obtained through PDA strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chosen reference value for outranking . . . . . . . . . . . . . . . . . . .
71 82 90 92 102 109 128 130 132 132 133 134 136 137 138 172 176 177 178 182 185 185 186 186 189 191 212 225 225 226 226 228 228 230
List of Tables
Table 10.8 Table 10.9 Table 10.10 Table 10.11 Table 10.12 Table 11.1 Table 11.2 Table 11.3 Table 11.4 Table 11.5 Table 11.6 Table 12.1 Table 12.2 Table 12.3 Table 12.4 Table 12.5 Table 12.6 Table 12.7 Table 12.8 Table 12.9 Table 12.10 Table 13.1
Table 13.2 Table 13.3 Table 13.4 Table 13.5 Table 14.1 Table 14.2 Table 14.3 Table 15.1 Table 15.2 Table 15.3 Table 15.4 Table 15.5
xxxvii
CFL rules generated using EUREKA UNIVERSE . . . . . . . . . Value of S1 for HE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Value of S2 for HP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Description of the success case . . . . . . . . . . . . . . . . . . . . . . . . . Memory consumption in comparison with other mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of surveys and reviews related to this work . . . . . MF parameters values of linguistic states for premise 1 . . . . . Sample of classification accuracy for premise 1 . . . . . . . . . . . . Accuracy of premises obtained . . . . . . . . . . . . . . . . . . . . . . . . . Skill comparison by tasks with the principal fuzzy ways to use productively the natural language Logic . . . . . . . . . . . . Comparison of selected papers on accuracy and interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quality planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quality management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . JIT benefits for human resources . . . . . . . . . . . . . . . . . . . . . . . . JIT Economic benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Descriptive analysis of the sample . . . . . . . . . . . . . . . . . . . . . . Industrial sectors and number of employees . . . . . . . . . . . . . . . Latent variable coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . Descriptive analysis of items and variables . . . . . . . . . . . . . . . Sum of indirect effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Total effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linguistic scale for the risk identification and fuzzy equivalence for FQFD (Pastrana-jaramillo and Osorio-gómez 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Validation and weighted averages of operational risks . . . . . . Internal variables and their relative importance . . . . . . . . . . . . Weight of the how’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of prioritization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Physicochemical characteristics of raw vinasse . . . . . . . . . . . . Variables involved in the vinasse-based LF supply chain . . . . Scenarios for sensitivity analysis of LF demand . . . . . . . . . . . Comparative table of works on horizontal fragmentation (A) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparative table of works on horizontal fragmentation (B) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of decision trees algorithms with 50 queries for two fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of decision tree algorithms with 50 queries for three fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of decision tree algorithms with 50 queries for four fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
231 232 232 232 233 246 250 250 250 252 260 270 270 271 272 277 277 278 279 282 283
292 297 299 299 299 316 318 331 344 345 351 351 352
xxxviii
Table 15.6 Table 15.7 Table 16.1 Table 16.2 Table 16.3 Table 16.4 Table 16.5 Table 16.6 Table 16.7 Table 16.8 Table 18.1 Table 19.1 Table 19.2 Table 19.3 Table 20.1 Table 20.2 Table 20.3 Table 20.4 Table 20.5 Table 21.1 Table 21.2 Table 21.3 Table 21.4 Table 21.5
List of Tables
Results of decision trees algorithms with 50 queries for five fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of decision tree algorithms for 24 and 50 queries . . . . Health category for microcredits, according default age . . . . . Transition matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Date variables conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variables with null data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Variables variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation metrics comparison of the five techniques . . . . . . . Optimized decision tree performance . . . . . . . . . . . . . . . . . . . . Variables and its importance in prediction . . . . . . . . . . . . . . . . Comparative table of thermostats and architecture in which it is designed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear decomposition based on the original variables . . . . . . . Main components composition . . . . . . . . . . . . . . . . . . . . . . . . . Levels of the variables in the solvency classification . . . . . . . . Summary of selected works about IRPTW . . . . . . . . . . . . . . . . Data used to test the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inventory in each customer for every period . . . . . . . . . . . . . . Cost components of the IRPTW solution . . . . . . . . . . . . . . . . . Comparison between IRPTW and VRPTW . . . . . . . . . . . . . . . Comparative analysis of related works . . . . . . . . . . . . . . . . . . . APIs for emotion detection, sentiment analysis, and named-entities recognition . . . . . . . . . . . . . . . . . . . . . . . . . Programming languages for emotion detection, sentiment analysis, and named-entities recognition . . . . . . . . . Results of the comparative analysis among APIs for scenario 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of the comparative analysis of APIs in scenario 2 . . .
352 358 372 373 376 377 378 379 380 381 409 445 445 453 469 475 477 478 478 488 496 499 502 504
Part I
Industrial Applications
Chapter 1
Merging Event Logs for Inter-organizational Process Mining Jaciel David Hernandez-Resendiz, Edgar Tello-Leal, Heidy Marisol Marin-Castro, Ulises Manuel Ramirez-Alcocer, and Jonathan Alfonso Mata-Torres Abstract In an inter-organizational environment, the discovery of process choreography is challenging because the different organizations involved have to put together their partial knowledge about the overall collaborative business process. This chapter presents a methodology to merge the event logs of the different partners of a collaborative business process, in order to serve as input for the process mining algorithm. On the one hand, the methodology consists of a method set for searching the correlation between events of the log of different partners involved in the collaboration. These methods are implemented at the trace level and the activity level. On the other hand, the methodology consists of a set of methods and rules for discovering process choreography. From the knowledge gained by the above methods, messagetype tasks are identified and marked in each event log, then using a formal set of rules, the message task sub-type (send or receive) is discovered. Finally, links using message sequence flow connectors between message tasks identified as pair activities in event logs are automatically defined. The proposed approach is tested using a real-life event log that confirms their effectiveness and efficiency in the automatic
J. D. Hernandez-Resendiz · U. M. Ramirez-Alcocer · J. A. Mata-Torres Unidad Académica Multidisciplinaria Reynosa-RODHE, Universidad Autónoma de Tamaulipas, Reynosa, Tamaulipas, México e-mail: [email protected] U. M. Ramirez-Alcocer e-mail: [email protected] J. A. Mata-Torres e-mail: [email protected] E. Tello-Leal (B) Facultad de Ingeniería y Ciencias, Universidad Autónoma de Tamaulipas, Victoria, Tamaulipas, México e-mail: [email protected] H. M. Marin-Castro Cátedras CONACYT, Facultad de Ingeniería y Ciencias, Universidad Autónoma de Tamaulipas, Victoria, Tamaulipas, México e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_1
3
4
J. D. Hernandez-Resendiz et al.
specification of message flows of the process choreography discovered, allowing to build a collaborative business process model.
1.1 Introduction Public and private organizations require proper knowledge asset management to maintain a competitive advantage in current global markets. All-business are facing global competition, and it is imperative that they reduce costs, improve their operations, and the relationships with customers, suppliers, and partners, trying to reduce also delivering times through optimizing their logistics processes (Sousa et al. 2017). The adoption of new technologies in this context encourages organizations to establish close relationships of integration, cooperation, and collaboration between them, giving rise to inter-organizational collaborations (Pradabwong et al. 2015; Long 2017). This collaboration contributes to improving the efficiency of supply chain management, which focuses on the inter-organizational management of goods flows between independent companies in a supply chain (Sousa et al. 2017). Supply chain collaboration enables chain members to take advantage of business opportunities and enhance their competitiveness. This means that, in a supply chain collaboration, two or more chain members working together to create a competitive advantage by sharing information, joint decision making, and sharing benefits of increased profitability that result from satisfying customer needs (Salam 2017; Simatupang and Sridharan 2018), as well as executing the collaborative business processes (also called inter-organizational business processes) through process-aware information systems (Dumas et al. 2018a). In this sense, Business Process Management (BPM) is concerned with the interactions between processes and information systems, standing out the modeling and analysis of processes as important elements within of BPM approach (van der Aalst 2018). Thus, it can be considered that BPM is a set of methods, techniques, and tools to identify, discover, analyze, redesign, execute, and monitor business processes in order to optimize their performance (van der Aalst 2013; Dumas et al. 2018b). Business processes are the core of BPM approaches, and various languages have been introduced for business process modeling, with the Business Process Model and Notation (BPMN) language (BPMN 2.0 2011), being the de facto standard for business process notation. In the BPM domain, inter-organizational collaboration implies a process-oriented integration between heterogeneous and autonomous organizations, which can be achieved through the definition and execution of collaborative business processes. A collaborative business process defines the behavior of interactions between organizations and their roles from a global viewpoint (Tello-Leal et al. 2016; Barcelona et al. 2018), that is, how they coordinate their actions and exchange information, in order to make decisions together to achieve a common business goal (Tello-Leal et al. 2016; Köpke et al. 2019). In inter-organizational collaboration, the interaction between participants is done via message exchanges; the interaction is only archived
1 Merging Event Logs for Inter-organizational Process Mining
5
Fig. 1.1 Model of the process choreography within a collaboration diagram
by sending and receiving messages. In a BPMN-based collaboration diagram, the pools represent specific process participants or roles, such as the supplier role or customer role. The interactions of a set of business processes of multiple organizations are specified in a process choreography (Weske 2019). A collaboration diagram describes the control flow layer of the choreography and depicts what messages and in which sequence are exchanged. Figure 1.1 shows an interaction between a customer and a supplier with respect to a request for a demand forecast. The customer sends the forecast request to a chosen supplier, which internally processes it and sends the forecast-request response, which then is evaluated internally by the customer. Hence, choreographies have a central role in ensuring interoperability between business processes (Weske 2019), each of which is performed by a participant in inter-organizational collaboration. One of the current challenges in inter-organizational collaborations is the analysis of the execution of collaborative business processes. Process mining makes use of recorded historical data (event log) of the execution of the business process instances to discover and analyze as-is business process models (van der Aalst 2016). However, current process mining approaches focus on analyzing the execution of the business process from an intra-organizational perspective (Kalenkova et al. 2017; Nguyen et al. 2019; Mehdiyev et al. 2020). An analysis of the execution of a collaborative business process has high complexity, due to the correlation and synchrony that must be identified in the interactions between the parties involved in the collaboration, as well as to the difficulty to capture the behavior of artifacts (business documents, input, and output objects) that are distributed among the collaborative processes. Moreover, in an inter-organizational environment, the event logs are distributed over different sources, that is; one event log per enterprise that participates in the collaboration. Each event log encompasses partial information about the overall business process.
6
J. D. Hernandez-Resendiz et al.
Therefore, inter-organizational process mining requires these historical data to be merged into one structured event log. In this book chapter, we propose an inter-organizational process mining for discovering the interaction between business processes of multiple organizations via message exchange. In particular, we propose a methodology that guides the procedure of merging event logs and discovering process choreography. In this regard, we propose the integration of event logs through a merge at the case level and the activity level. This merge is performed by computing the cosine similarity at the case and activity level, allowing the correlation of the message tasks to be discovered. Furthermore, we define a formal rule set to identify the message-type tasks in each event log involved in the collaboration, as well as the sub-type of the message tasks (send or receive), which enables us to discover the sense of the message flow. Finally, message flow connectors are specified between message-type tasks contained in a diagram of the collaborative business process, enabling the discovery of process choreography among participants in inter-organizational collaboration. The proposed approach is validated in a real scenario using an event log generated from the execution of a purchase order management process. The remainder of this chapter is organized as follows: Sect. 1.2 reports the results achieved by implementing the proposed approach. Section 1.3 introduces the preliminary concepts and definitions that support the proposed approach. Section 1.4 details the phases that compose our methodology. Section 1.5 discusses the related work found in the literature. Finally, the conclusion and future work are given in Sect. 1.6.
1.2 Related Work In this section, we delineate the current literature from pre-processing and merging of logs, and end-to-end flow analysis of related processes by correlation methods. The technique proposed in Claes and Poels (2014) consists of an algorithm that searches for links between the data of the different partners suggesting the rules to the user on how to merge the data, and a method for configuring and executing the merge; as well as their implementation in the process mining tool ProM (Van Dongen et al. 2005). The algorithm discovers links between the two event logs to indicate which data in both event logs are considered to belong to the same process instance. The merging rules are formulated in terms of relations between attribute values at an event or trace level in the two event logs. The rules are based on four relationship operators and two logical operators (and/or). The approach is tested through seven scenarios involving two artificially created and three real-life event logs, for demonstration and evaluation of the method and algorithm. The test results show positive results for effectiveness and efficiency. In Raichelson et al. (2017), they describe an automated technique to merge event logs supporting two granularity levels. They generate a merged log with a focus on the case view and a merged log that reveals the end-to-end instance view. The matching of cases is based on temporal relations and text similarity, both structured
1 Merging Event Logs for Inter-organizational Process Mining
7
and unstructured attributes. Similarly, (Bala et al. 2018) propose a semi-automatic technique for discovering event relations that are semantically relevant for business process monitoring. In this proposal, the challenge that is faced that the events are available in different levels of granularity and that more than one event can correspond to an activity, proposing identifiers for events and relationships that are relevant to monitoring the process. The preceding, under the assumption that these events contain data that are relevant to monitoring, but without prior knowledge of the event schema, using a set of heterogeneous events as input. This approach can be considered a contribution of a technique for the pre-processing of event logs for subsequent analysis using process mining methods. On the other hand, in Engel et al. (2016), the authors propose an approach to associate Electronic Data Interchange (EDI) messages from different data sources belonging to the same case based on correlation conditions. This correlation is based on conjunctions and/or dis-junctions of attribute values. Such conditions are formulated in reference to the attribute values of the messages. In this sense, (Pourmirza et al. 2017) tackle the correlation problem by extracting models from logs without relying on case identifiers or when events are not associated with a case identifier by process graph rules, enabling detect which events belong to the same case. This approach only relies on event names and timestamps in the log, no additional event attributes are required. Similarly, in Cheng et al. (2017) presents a graph-based method that follows the filtering-and-verification principle, which aims to efficient event correlation analytics over large logs on distributed platforms. This approach incorporates light-weight filter units into candidate correlation rules to prune large numbers of non-interesting rules before verification. In the verification phase, the correlation rules are modelled as a graph and introduce a graph partitioning approach to decompose the potentially correlated events into chunks by exploring efficient data locality assignment. In Xu et al. (2018) presents an algorithm based on artificial immune algorithms and simulated annealing algorithms to merge event log files generated by different systems. The set of factors used in the affinity function, occurrence frequency, and temporal relation can express the characteristics of matching cases more accurately than some factors. The proposed algorithm has been implemented as a plugin into the ProM platform (Van Dongen et al. 2005). In summary, the approach we present here has advantages. First, it enables the merging of event logs based on event messages correlation through rule and pattern set. This renders to discover message tasks and the sub-type of the message task. Second, it provides the two abstraction levels of both case and activity. Third, the resulting combination of abstraction levels enables mining that is targeted at discovering process choreography among participants in inter-organizational collaboration. However, our proposal has the capacity to discover simple relationships, that is, oneto-one and one-to-many relationships, which can be considered a limitation in the algorithm. Then, it is necessary to define a set of rules so that the methods have the ability to identify complex many-to-one or many-to-many relationships, in the traces contained in each participant’s event log.
8
J. D. Hernandez-Resendiz et al.
1.3 Preliminaries and Definitions The process mining area that is aimed to discover, verify, and improve real business processes from events available in process-aware information systems (van der Aalst 2016), that is, aims at extracting process knowledge from event logs. An event log consists of a set of traces and each trace is composed of a sequence of events produced by the execution of one case. Each event captures relevant data about the execution of a given activity in the business process. Therefore, through the application of process mining, organizations can discover how the processes were executed, verify if the defined business rules and practices were followed, as well as gain insights into bottlenecks, resource utilization, and other performance-related aspects of processes (Rovani et al. 2015). The merge criterion for our proposal is based on using cosine similarity to derive a measure of traces similarity and activities similarity between two event logs. Cosine similarity analyzes the similarity by measuring the cosine of the angle between the two vectors of an inner product space (Han and Kamber 2012). It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. The cosine similarity measure (Manning and Raghavan 2008) cosSim(j,q) can be computed as follows (Eq. 1.1): t
(Wi j .Wiq ) Cos Sim( j, q) = i=1 t t ( i=1 Wi j )2 .( i=1 Wiq )2
(1.1)
t where i=1 (Wi j .Wiq ) is the sum of the scalar product of the weights of the term i of the word vector j by the weight of the term i of the word vector q, this divided by t t ( i=1 Wi j )2 .( i=1 Wiq )2 , which is the square root of the sum of the weights of the term i of the word vector j squared by the sum of the weights of the term i of the word vector q squared. The proposed approach is based on the following definitions, which make it possible to formalize the phases of the methodology, which enables the fusion of event logs and the discovery of the correlation of the messages exchanged in a collaboration. Definition 1.1 (Event log) Given an event log L, which is compound of a set of cases T, where T are all instances of the business process execution. Each case T i is composed of a finite set of activities A, which are the set of tasks contained in the business process. These tasks are described by a set of At attributes (for example, activity name, execution date, the user who executed it, among others), which detail the context of the execution of the business process activities. Definition 1.2 (The bag-of-words of cases) Refers to a vector of words that represents each case T i in the event L. This bag-of-words is generated from the unique values of the At attributes of each Aj T i activity.
1 Merging Event Logs for Inter-organizational Process Mining
9
Definition 1.3 (The bag-of-words of activities) Refers to a vector of words that represent each case Aj activity. This bag-of-words is generated from the unique values of the At k attributes composite Aj . Definition 1.4 (Inter-organizational relationship) Indicates that two or more organizations share event information from the same execution of a collaborative business process. This relationship is identified in the events contained in the traces of the event L and L’, which correspond to each participating organization in the interorganizational collaboration. In our proposal, two levels of the inter-organizational relationship are contemplated: a) the relationship at the case level occurs when in two cases T L and T’ L’, their similarity (calculated using the cosine measure) is located above a threshold Ut, b) the relationship at the activity level occurs when in two activities A T and A’ T’, their similarity (cosine measure) is above a threshold U a. Definition 1.5 (BPMN process model) This statement is based on the definition proposed by (Augusto et al. 2019), in which a BPMN process model is a component graph M = (i, o, T, G, E m ), where i is the start event, o is the end event, T is a non-empty set of tasks, G+ ∪ Gx ∪ G* is the union of the set of AND gateways (G+ ), the set of XOR gateways (Gx ) and the set of OR gateways (G*), and E m ⊆ (T ∪ G ∪ {i}) × (T ∪ G ∪ {o}) is the set of edges. Further, given g G, g is a split gateway if it has more than one outgoing edges, or join gateway if it has more than one incoming edges.
1.4 Methodology The proposed methodology for merging event logs and discovering process choreography in an inter-organizational collaboration environment is composed of processing of event logs, identifying the correlation between events, and collaboration discovery phases. Figure 1.2 illustrates the procedure of the proposed methodology, where the event records of the business processes involved in inter-organizational collaboration are used as input to phase 1, and as an output from phase 3, the collaborative business process model discovered.
1.4.1 Phase 1: Processing of Event Logs 1.4.1.1
Method 1: Construction of Bags-of-Words
This method consists of building bags-of-words per event log, which allows each bags-of-words to be represented in a matrix. This procedure is performed by every organization (one event log per organization) involved in the collaborative business
10
J. D. Hernandez-Resendiz et al.
Fig. 1.2 Overview of the methodology approach
process. For each event log L and L’ a matrix of bags-of-words BW, and BW ’ is generated, where one row (a trace in the event log) of the matrix is a bag-of-word of T L and T ’ L’, respectively. The length of each bag-of-words is equal to the number of attributes that describe the activities contained in a trace. The matrices BW and BW ’ are constructed as follows: • Let us consider any case T i L. • A bag-of-word BW i is built, with all the unique values of the At k attributes of each task Aj T i , based on Definition 1.2. • In BW i , stop-words are removed. • The remaining words of BW i compose the bag-of-words T i . • This procedure is carried out independently for the event logs L and L’. 1.4.1.2
Method 2: Generation of the Scoring Matrix
In this method for each of the BW i bags-of-words that belong to the BW bag-ofwords, they are compared to all the bags-of-words that compose the BW ’. This is performed using the cosine similarity according to Definition 1.4. The above allows us to generate a scoring matrix with the similarity between the vectors of BW and BW ’.
1 Merging Event Logs for Inter-organizational Process Mining
11
1.4.2 Phase 2: Identifying the Correlation Between Events 1.4.2.1
Method 3: Selection at the Case Level
The selection of the even cases consists of finding the cases T L and T ’ L’ with the closer the cosine value to 1. For each T i L case, the case T i ’ L’ with the greater cosine similarity in the scoring matrix is selected. This allows us to identify that in cases T L and T ’ L’ there is a relationship at the case level, according to Definition 1.4. This process is performed for each trace of the event log L.
1.4.2.2
Method 4: Selection at the Activity Level
From the cases selected by Method 3, the tasks or events (activities) that contain any coincident data are identified, allowing a relationship between the activities of the traces to be defined, calculating their cosine similarity, that is, similarity at the activity level, for which the following procedure is implemented: • The attribute or attributes that can provide information on the relationship between organizations are defined, for example, resource, timestamps, activity name, input, or output objects. • For each activity Aj T i and Aj ’ Ti’ of each pair of cases T i L and T i ’ L’ (which were selected in the previous method because their cosine similarity value exceeds the U t threshold), two vectors of activities A and A’ are obtained, which represent the bags-of-words BWA and BWA’ according to Definition 1.3, respectively. • For each of the BWAi bags-of-words, its measure of the cosine similarity distance with all the BWA’ bag-of-words is calculated. This measurement allows us to build a scoring matrix of the relationship identified at the activity level. If the measure of similarity of a pair of the bags-of-words BWAi and BWA’i exceeds the threshold U a , these are considered as message-type tasks. These tasks can be of sub-type send or receive (BPMN 2.0 2011). Then, a vector that contains the pair of message-type activities (PMA) that exceeded the threshold is generated.
1.4.3 Phase 3: Collaboration Discovery 1.4.3.1
Method 5: Business Process Discovery
The private business process model of each organization involved in interorganizational collaboration is discovered, from the event logs L and L’. In our proposal, the split-miner algorithm presented in Augusto et al. (2019) is reused to discover a business process model. This algorithm discovers the flow of the business process, with its behavior through its different paths, decision points, bifurcations,
12
J. D. Hernandez-Resendiz et al.
convergences, and unions. The algorithm generates a business process diagram based on the BPMN language. We obtain the BPMNI ← L model that corresponds to the organization that initiates the collaboration, and the BPMNR ← L’ model that corresponds to the receptor organization of the collaboration, according to Definition 1.5.
1.4.3.2
Method 6: Assigning the Type and Sub-type of Message Tasks
For each pair of activities in PMAi , where a PMAi and b PMAi , and a L and b L’, the message-type tasks and their sub-type must be found. Identifying the meaning of messages in a choreography of a collaborative business process is essential because the sub-type of message task a and b defines the flow and direction of interaction between participants in a collaboration. The following rules were defined to discover the message task and its sub-type. Rule 1. Considering the set A = {generate, approval, question, information request, notification, decision making} based on the proposal of process patterns to design integration processes (Lazarte et al. 2011). These patterns are intended to support interorganizational business message exchange and ensuring interoperability in the message exchange. In what follows, given a BPMNI or BPMNR model, if the name of an activity that precedes and is directly connected to a contains, or is composed of any of the values contained in the set A, then task a is of type message and sub-type send, and task b is of type message and sub-type receive. Rule 2. Given a BPMNI or BPMNR model, if the name of an activity that is consequent and is directly connected to b contains, or is composed of any of the values contained in the set C, then task b is of type message and sub-type receive, and task a is of type message and sub-type send, where the set C = {evaluate, approval, process, analyze, make a decision}. Rule 3. If neither of the above two rules is fulfilled for tasks a and b, the task type and sub-type will be assigned considering the following rules. On the one hand, given the BPMNI model, if an antecedent activity of a is a task of message type m and sub-type send, then task a will be marked as a task of sub-type receive. On the other hand, if an antecedent activity of a is a message type task m and of sub-type receive, then task a will be marked as a task of sub-type send. In the case of the message task of type b, the same rules apply using the BPMNR model. 1.4.3.3
Method 7: Discovery of Process Choreography
After marking the message-type tasks and the sub-types of this task, the two models BPMNI and BPMNR are added in a collaborative process model. The message connector notation, based on the BPMN language (BPMN 2.0 2011), is then added in order to define the choreography between the pools of the collaboration participants.
1 Merging Event Logs for Inter-organizational Process Mining
13
The message flow connectors are specified in the model agreeable to the information contained in PMA, and the task sub-type defines the direction of the message flow, that is; the send sub-type defines the start of the message sequence flow connector and the sub-type receive the end of the message sequence flow connector.
1.5 Results In this section, we present a real scenario of inter-organizational collaboration between enterprises in the domain of the telecommunications industry to demonstrate the proposed approach. The event log of each participant in the collaboration was generated by executing of the purchase order management collaborative business process. The event log contains instances of the process execution from 2017 to 2018. The organization M-Repair plays the role of customer, and the organization M-Parts plays the role of supplier of components. The collaborative business process has a business goal of reducing the time for managing the acquisition of components and accelerating the purchase process in M-Repair by automating confirmation decisions electronically by the supplier. The collaborative business process allows the parties to negotiate the delivery times of the components and to propose changes to the purchase order. The M-Repair materials procurement department estimates the components required to repair the different models of the brand, based on its repair forecasts and the repair forecasts of its subsidiaries. The procurement department generates global purchase orders to obtain better prices, based on the purchase volume, which it negotiates with its counterpart M-Parts. An organized set of activities depicted in the given customer-supplier collaborative business process are implemented by the two organizations in fulfilling their common business goal. Tables 1.1 and 1.2 and show an excerpt from the event log of the business process that manages purchase orders between parties. The data set contains traces with data on cases, activities, timestamps, originators, message name, input, and output data. An event log of a collaborative business process can contain attributes such as the activities performed, who is responsible or originator for each activity (human, system or equipment), the time of execution of the activity, the events that caused an activity, the messages exchanged between the parties, as well as the business documents that are exchanged through the messages. A case in an event log represents the flow or behavior that followed an execution of the business process, that is, an instance of the process. Table 1.1 shows a complete example of a case contained in the event log of the M-Parts organization. In this registry, private and public tasks performed by different resources (user, service, and system) are observed, as well as input or output business documents contained in public activities of the message type. Table 1.2 shows two cases that describe possible behaviors of executing the collaborative business process from the M-Repair organization viewpoint, displaying the different activities that compound the traces of the event log.
14
J. D. Hernandez-Resendiz et al.
Table 1.1 Example of a case in the M-Parts participant’s event log CaseID Activity
Timestamp
Originator
Message
Data output
2060
Propose purchase order
2018-12-14 System 15:10:10 message
2060
Evaluate purchase order
2018-12-14 Vendor 17:01:00
2060
Generate purchase order change
2018-12-14 Automated 18:21:01 service
2060
Propose purchase order change
2018-12-14 System 18:25:59 message
Purchase Purchase order change order change
2060
Propose purchase order change update
2018-12-15 System 10:31:00 Message
Purchase order change update
2060
Evaluate purchase order change update
2018-12-15 Selling 10:50:44 supervisor
2060
Accept-proposal 2018-12-15 System purchase order 11:40:25 message change update
2060
Generate purchase order confirmation
2018-12-15 Automated 11:44:12 service
2060
Confirm purchase order confirmation
2018-12-15 System 11:47:05 message
Purchase order
Data input Purchase order
Purchase order change update
Purchase Purchase order change order change update update
Purchase order confirmation
Purchase order confirmation
In our experiment, a threshold U t > 0.40 was defined to calculate the similarity between traces; by means of a threshold with low value, a greater quantity of traces can be included, which be able to provide important information for the fusion. On the other hand, the threshold U a > = 0.97 was configured with a high value, which allows ensuring greater accuracy in the correlation between the message-type tasks. As a result of the application of Methods 1 and 2, 100 bag-of-words at the case level were generated for each organization involved in the collaborative business process, with which a 100 × 100 similarity matrix was constructed, that is, 10,000 similarities were calculated between trace pairs. On the one hand, Fig. 1.3 shows the bag-of-words generated from CaseID 2060 (presented in Table 1.1), which is contained in the event log of the M-Parts participant. On the other hand, Table 1.3 shows an extract of the score matrix at the case level; the value of the cosine similarity calculated for each pair of cases is displayed. In case-level selection using Method 3, the similarity measure of 7,610 case pairs exceeded the U t threshold, out of a total of 10,000 case pairs. This allowed identifying the events in which there is a relationship at the case level between T L and T ’ L’.
1 Merging Event Logs for Inter-organizational Process Mining
15
Table 1.2 Example of a case in the M-Repair participant’s event log CaseID Activity
Timestamp
Originator
Message
Data output
921
Generate purchase order
2017-11-29 Automated 12:18:15 service
921
Propose purchase order
921
Data input
2017-11-29 System 12:23:30 message
Purchase order
Purchase order
Reject-proposal purchase order
2017-11-29 System 19:28:43 message
Purchase order response
2086
Generate purchase order
2018-12-19 Automated 09:13:31 service
2086
Propose purchase order
2018-12-19 System 09:18:19 message
Purchase order
2086
Accept-proposal 2018-12-19 System purchase order 16:38:05 message
Purchase order response
Purchase order response
2086
Confirm purchase order confirmation
2018-12-20 System 09:52:02 message
Purchase order confirmation
Purchase Order Confirmation
2086
Store purchase order confirmation
2018-12-20 Automated 10:14:19 service
Purchase order response
Purchase order
Fig. 1.3 Example of a bag-of-words at the case level generated by method 1
Through the implementation of Method 4, 53,270 bag-of-words were obtained at the activity level (for both M-Repair participant and M-Parts participant), considering that, on average, each case has 7 events per 7,610 case pairs that exceeded the U t threshold. In this scenario, for each case pair, there is a 7 × 7 scoring matrix. Table 1.4 shows an example of the score matrix at the activity level, exhibits the
…
…
…
…
… …
0.848 0.910 0.822 0.756 0.822 0.787 0.840 0.822 0.840 … 0.957
100
…
…
…
…
0.917 0.853 0.784 0.867 0.784 0.882 0.787 0.784 0.787 … 0.784
12
…
0.890 0.894 0.871 0.759 0.871 0.787 0.825 0.871 0.825 … 0.871
11 …
0.890 0.894 0.871 0.759 0.871 0.787 0.825 0.871 0.961 … 0.871
0.890 0.894 0.871 0.759 0.871 0.787 0.825 0.995 0.825 … 0.871
8
0.848 0.910 0.822 0.756 0.822 0.787 0.840 0.822 0.840 … 0.822
0.890 0.894 0.871 0.759 0.871 0.787 0.961 0.871 0.825 … 0.871
7
10
0.848 0.910 0.822 0.756 0.822 0.918 0.840 0.822 0.840 … 0.822
6
9
0.848 0.910 0.822 0.756 0.957 0.787 0.840 0.822 0.840 … 0.822
… 100
0.848 0.910 0.822 0.876 0.822 0.787 0.840 0.822 0.840 … 0.822
9
5
8
0.848 0.910 0.957 0.756 0.822 0.787 0.840 0.822 0.840 … 0.822
7
4
6
3
5
0.975 0.907 0.834 0.921 0.834 0.938 0.837 0.834 0.837 … 0.834
4
2
3
0.918 0.986 0.890 0.820 0.890 0.853 0.910 0.890 0.910 … 0.890
1
2
Cases contained in the event log of the M-Repair organization
Cases contained in the event log of the M-Parts organization CaseID 1
Table 1.3 Score matrix at the case level
16 J. D. Hernandez-Resendiz et al.
1 Merging Event Logs for Inter-organizational Process Mining
17
Table 1.4 Score matrix at the activity level Activities contained in the trace ID 1 (M-Repair) Activities contained in the trace ID 1 0.50 0.35 0.15 0.12 0.32 0.12 0.12 0.32 0.50 (M-Parts) 0.15 0.73 1.00 0.85 0.65 0.81 0.81 0.65 0.15 0.35 0.75 0.73 0.62 0.89 0.62 0.62 0.67 0.35 0.14 0.67 0.88 0.75 0.60 0.75 0.75 0.60 0.14 0.50 0.35 0.15 0.12 0.32 0.12 0.12 0.32 0.50
cosine similarity calculated for the pair of traces with identifier 1. Then, these bagof-words generated 372,890 activity-level similarities, of which 10 pairs of activities exceeded the threshold U a , considering these activities as message-type tasks. The first column of Tables 1.5 and 1.6 shows the activities selected as message-type tasks by applying Method 4 for M-Repair and M-parts, which is described in greater detail in the following paragraphs. Figure 1.4 shows the business process model for the M-Repair organization, discovered by executing Method 5. The split-miner algorithm that is implemented using Method 5 was configured with the values epsilon = 0.2 and eta = 0.4 as input variables, where epsilon is a parallelism threshold, and eta is a percentile for the frequency threshold. The discovered model only represents the intra-organizational business process of a participant involved in the collaboration. Table 1.5 Type and sub-type of discovered tasks for the event log BPMNI of the M-Repair participant Activity (message task)
Task sub-type (M-Repair)
Task sub-type (M-Parts)
Rule
Propose purchase order
Send
Receive
R1
Accept-proposal purchase order
Receive
Send
R3
Reject-proposal purchase order
Receive
Send
R3
Propose purchase order change
Receive
Send
R2
Confirm purchase order confirmation
Receive
Send
R2
Reject-proposal purchase order change
Send
Receive
R3
Accept-proposal purchase order change
Send
Receive
R3
Propose purchase order change update
Send
Receive
R1
Reject-proposal purchase order change update
Receive
Send
R3
Accept-proposal purchase order change update
Receive
Send
R3
18
J. D. Hernandez-Resendiz et al.
Table 1.6 Type and sub-type of discovered tasks for the event log BPMNR of the M-Parts participant Activity (message task)
Task sub-type (M-Parts)
Task sub-type (M-Repair)
Rule
Propose purchase order
Receive
Send
R2
Accept-proposal purchase order
Send
Receive
R3
Reject-proposal purchase order
Send
Receive
R3
Propose purchase order change
Send
Receive
R1
Confirm purchase order confirmation
Send
Receive
R1
Reject-proposal purchase order change
Receive
Send
R3
Accept-proposal purchase order change
Receive
Send
R3
Propose purchase order change update
Receive
Send
R2
Reject-proposal purchase order change update
Send
Receive
R3
Accept-proposal purchase order change update
Send
Receive
R3
Through Method 6 and its rules, the identification and marking of the type and subtype of a task are enabled for each participant in inter-organizational collaboration. Tables 1.5 and 1.6 show the activities that were identified as message-type tasks, and the sub-type that was discovered for each message-task, as well as the rule that allowed identifying the type and sub-type of the activity. From the M-Repair viewpoint (see Table 1.5), when applying the rules defined in Method 6, ten messagetasks were discovered in the event log L. The message- task called propose purchase order was identified by rule 1 with a sub-type send. Rule 1 is fulfilled because the propose purchase order task has an antecedent task that is part of set A and is directly connected to the message-task. On the other hand, from the M-Parts viewpoint (see Table 1.6), message tasks and sub-types for those tasks were identified. When comparing Tables 1.5 and 1.6, it can be seen that different rules detected the same task but with a distinct task sub-type, which is correct because the message sense is identified in the interaction with the counterproposal. Additionally, the task that was discovered may meet conditions on the antecedent or consequence task. For example, the propose purchase order task was identified with a sub-type receive through rule 2, because it finds a consequent task that belongs to set C, allowing the message task to be marked as a receive sub-type. Therefore, after discovering models BPMNI and BPMNR (Method 5), as well as marking the message type tasks and their sub-types on these models, the algorithm creates a collaborative model of the discovered process models (Method 7). As
1 Merging Event Logs for Inter-organizational Process Mining
Fig. 1.4 Intra-organizational business process model discovered for the M-Repair organization
19
20
J. D. Hernandez-Resendiz et al.
defined above, the merge is performed in a pair-wise manner, allowing to discover the choreography of the collaborative business process. A message sequence flow connector is used to show the flow of messages between two participants that are prepared to send and receive them (BPMN 2.0 2011). Message sequence flow connectors are depicted as dashed lines with an empty circle showing where the message originates and empty arrowhead where the message terminates, as illustrated in Fig. 1.5. These connectors represent the interactions between customer and supplier processes. Every interaction represents a message flow associated with the business document(s) sent and received between the two collaborating processes. We can see from Fig. 1.5, where the send message task propose purchase order from M-Repair participant is connected to the receive message task propose purchase order from M-Parts participant through a message flow connector called message_0. The message flow connector name is assigned in ascending order according to the position in which the pair of message tasks were discovered and marked in the model. First, all pairs of tasks that were marked by the rule 1 definition are specified with the message flow connector, in the sense of send to receive. Next, the message connector is added between the tasks that were identified by rule 2. Finally, in the collaborative model, the message flow connectors are added between the tasks that were detected by rule 3, which in this case are characterized by being located on outgoing sequence flows of all gateways contained in the diagram, as shown in Fig. 1.5. In the collaborative business process discovered (see Fig. 1.5), six different behaviors are identified, determined by the negotiations that can be carried out in this process. For example, in one of the behaviors, the process starts when the M-Repair participant generates the purchase order business document (see Table 1.2, column Data Output), sent to the M-Parts participant through the message propose purchase order. This document contains details of the required items and purchase policies. The M-parts participant evaluates the purchase order business document (see Table 1.1, column Data Input), and can respond with an accept-proposal purchase order message, thereby committing to provide the required components in the order. Then, the M-Parts participant generates a purchase order confirmation business document confirming the acceptance of the proposal through a document that contains the confirmation number, details of the required components, item quantities, stipulated delivery times, and sales policies. This document is sent using the confirm purchase order confirmation message (see Table 1.1, column Data Output), ending the process successfully. Another behavior observed in the collaborative process diagram is when the M-Parts participant receives a purchase order through a message and responds with a counterproposal of the purchase order of the M-Repair participant. Then, M-Parts generates the business document purchase order change, which contains the conditions under which the purchase requirements can be met. This document is sent through the propose purchase order change message (see Table 1.1, column Data Output), which starts a new negotiation between the parties, within the same process. The M-Repair participant evaluates the proposed changes in the document and can respond by accepting, or rejecting, or proposing changes
1 Merging Event Logs for Inter-organizational Process Mining
Fig. 1.5 Process choreography discovered from the merged event log
21
22
J. D. Hernandez-Resendiz et al.
to the counter-proposal purchase order. If the M-Repair participant responds with an accept-proposal message (see Table 1.2, column Data Output), thereby accepting the proposed changes in the purchase order. When the participant M-Parts receives the acceptance message, the document is processed and generates a business document confirming the acceptance of the purchase order. Then, the behavior of the process continues as in the first example. An instance of the behavior of the collaborative process executed between the participants M-Repair and M-Parts, where the negotiation of the proposal of the purchase order ends in an unsatisfactory way, is described below. When the M-Repair participant sends a message that contains the purchase order business document, the M-Parts participant evaluates the business document and determines that it is unable to comply with the proposal, and decides to respond with a reject-proposal purchase order message (see Table 1.1, column Data Output), by means of which it is indicated that it does not accept the conditions set out in the purchase order business document, finishing the execution of the process between the participants. As a result of the set of methods and rules proposed, the interaction of messages between the participants is described below, representing the process choreography discovered. This choreography is illustrated in the collaborative business process diagram shown in Fig. 1.5. The interaction begins when the customer (M-Repair) generates and proposes a purchase order to its supplier (M-Parts). This interaction between the parties is represented by the message flow connector message_0. The purchase order business document contains the identifiers of required parts, quantities, proposed delivery dates, number of deliveries to be made during the validity of the purchase order, and the number of parts per delivery. The supplier evaluates the proposal and can respond with either an acceptproposal message or a reject-proposal message or proposing changes to the purchase order submitted by the customer. This is indicated by the control flow segment with the exclusive data-based gateway with three paths containing each message task (see the XOR1 data-based gateway of Fig. 1.5). If the supplier sends an accept-proposal message (message flow connector message_4), it then creates a business document with the confirmation of the purchase order, which represents the acceptance of the terms of the purchase order. Next, the supplier sends a confirm message (message flow connector message_3) with the purchase order confirmation document. When M-Repair receives the confirmation, it processes and stores the information contained in the business document. Then, the purchase order management process ends successfully, and the customer will be waiting for the reception of the parts at the scheduled times. On the other hand, when M-Parts rejects the proposed purchase order, it notifies the customer through a message task called reject-proposal (message flow connector message_5). The above may occur when the supplier is not able to comply with the terms of the proposal, which is notified in the business document contained in the message, ending the collaborative process on both sides. In addition, the supplier can respond with a counterproposal, which indicates the conditions under which
1 Merging Event Logs for Inter-organizational Process Mining
23
the original proposal can be fulfilled and the proposed changes. These changes may be related to delivery times, quantities, or delivery numbers. For this, it generates a business document (purchase order change) and sends it as a proposal to the customer (message flow connector message_1). When the customer receives the proposal (message_1), it evaluates the business document with the proposed changes, can respond with either an accept-proposal message or a reject-proposal message (see the XOR2 data-based gateway of Fig. 1.5). In the last case, if the customer determines that the supplier’s proposal does not meet his requirements, it sends a reject-proposal message (message flow connector message_7), which finishes the negotiation between the participants. Otherwise, the customer sends an accept-proposal message (message flow connector message_6), indicating that the purchase order changes are accepted. Then, the supplier must generate the document with the confirmation of the purchase order and send it as the content of a confirmation message (message_3) to the customer. Moreover, the customer can respond with a business document containing changes to the supplier’s counterproposal, generating a document with the new proposal and sending it with a propose purchase order change update message (message flow connector message_2). This happens when the customer decides to acquire the components in which the supplier complies with the previously stipulated terms and suspends the purchase of the parts in which its requirements are not met. Finally, when M-Parts receives the update proposal message (message_2), it executes an activity to evaluate the proposal. It can decide to accept the proposal message (see the XOR3 data-based gateway of Fig. 1.5) by replying with an acceptproposal message (message flow connector message_8), and generate a document with the confirmation of the purchase order, which is sent within the content of a confirm message (message_3), with which the collaborative process ends satisfactorily. Otherwise, refuse the proposal message (see the XOR3 data-based gateway of Fig. 1.5) by replying with a reject-proposal message (message flow connector message_9), which finishes the negotiation of the collaborative business process unsatisfactorily. The results obtained when implementing the methodology with its methods and rules allowed us to confirm, through the exposed scenario, its efficiency to discover the correlation between the events contained in the records generated by the execution of a collaborative business process. The defined rules allowed to merge the event logs correctly and automatically, by identifying the public activities of the process, that is, the message type tasks and the subtype of each detected message task. The knowledge acquired by the deployment of the methods made it possible to discover the choreography of the process, in other words, the interaction of messages between the participants of a collaborative business process, respecting the principles defined in the BPMN language. The automatic discovery of process choreography is an important contribution to the inter-organizational collaboration domain.
24
J. D. Hernandez-Resendiz et al.
1.6 Conclusions The methodology based on merging methods and rules facilitates the fusion of historical process data at a structured event log level so that in the next phase, to discover the process choreography, which is a challenge relevant in an interorganizational process mining domain. The approach systematically analyzes and identifies the events of the log of each partner involved in the inter-organizational collaboration. The analysis is performed discovering from event logs a set of activities pair that is used to represent the correlation between message-type tasks according to a cosine similarity scoring matrix computed at the traces level and the activity level. The set of methods and rules defined allows identifying the message-type tasks, as well as the sub-type of each message task. In conjunction with the results of phase 2 of the methodology, the approach has the knowledge to specify message flow direction and assign the link between message tasks using a message sequence flow connector, enabling the discovery of process choreography among the participants of the interorganizational collaboration. The approach has been applied in a real-life event log of a collaborative business process of product purchasing. The obtained results have shown the effectiveness of the proposed approach in a real context, enabling process choreography discovery, after merging the event logs of the participants in an interorganizational environment. The results obtained allow us to conclude that the methods of phase 2 of the methodology are fundamental for the functionality of the approach; the selection at the case level and at the activity level allowed to identify the correlation between event logs of participants. This correlation is the basis for associating the public tasks between the business process models and enables the identification of the message type tasks, as well as the subtype for each message task, that is, send or receive. We reuse the split-miner algorithm to discover business processes from each participant’s event logs. In the models discovered, the identification and marking of the message type tasks were carried out, and the definition of the direction of the message flow was performed, which is implemented using the subtype of the message task. These functionalities make it possible to discover the relationship between the public tasks of the business process models involved in the collaboration, which is known as process choreography. In summary, a collaborative business process model is discovered, as verified in the scenario displayed. In the current implementation of the algorithm, we only consider simple relationships between event logs. Future research will mainly focus on investigating the possibilities of more complex merging rules and an algorithm implementation able to identify complex many-to-one and many-to-many relationships. Additionally, the possibility of merging event logs in scenarios with more than two participants in an inter-organizational collaboration should be considered.
1 Merging Event Logs for Inter-organizational Process Mining
25
Acknowledgements The authors are grateful to the Autonomous University of Tamaulipas, Mexico for supporting this work. This research chapter was also supported by the Mexico’s National Council of Science and Technology (CONACYT) under grant number 709404, as well as by the Cátedras CONACYT project 214.
References Augusto A, Conforti R, Dumas M et al (2019) Split miner: automated discovery of accurate and simple business process models from event logs. Knowl Inf Syst 59:251–284. https://doi.org/10. 1007/s10115-018-1214-x Bala S, Mendling J, Schimak M, Queteschiner P (2018) Case and activity identification for mining process models from middleware. In: Lecture notes in business information processing. Springer, pp 86–102 Barcelona MA, García-Borgoñón L, Escalona MJ, Ramos I (2018) CBG-Framework: a bottom-up model-based approach for Collaborative Business Process Management. Comput Ind 102:1–13. https://doi.org/10.1016/j.compind.2018.06.002 BPMN 2.0 (2011) Business Process Modeling Notation 2.0 Cheng L, Van Dongen BF, Van Der Aalst WMP (2017) Efficient event correlation over distributed systems. In: Proceedings-2017 17th IEEE/ACM international symposium on cluster, cloud and grid computing, CCGRID 2017. Institute of Electrical and Electronics Engineers Inc., pp 1–10 Claes J, Poels G (2014) Merging event logs for process mining: a rule based merging method and rule suggestion algorithm. Expert Syst Appl 41:7291–7306. https://doi.org/10.1016/j.eswa.2014. 06.012 Dumas M, La Rosa M, Mendling J et al (2018a) Process-aware information systems. Fundamentals of business process management. Springer, Berlin, pp 341–369 Dumas M, La Rosa M, Mendling J et al (2018b) Introduction to business process management. Fundamentals of business process management. Springer, Berlin, pp 1–33 Engel R, Krathu W, Zapletal M et al (2016) Analyzing inter-organizational business processes: process mining and business performance analysis using electronic data interchange messages. Inf Syst E-bus Manag 14:577–612. https://doi.org/10.1007/s10257-015-0295-2 Han J, Kamber MPJ (2012) Data mining: concepts and techniques. Morgan Kau, USA Kalenkova AA, van der Aalst WMP, Lomazova IA, Rubin VA (2017) Process mining using BPMN: relating event logs and process models. Softw Syst Model 16:1019–1048. https://doi.org/10.1007/ s10270-015-0502-0 Köpke J, Franceschetti M, Eder J (2019) Optimizing data-flow implementations for interorganizational processes. Distrib Parallel Databases 37:651–695. https://doi.org/10.1007/s10619018-7251-3 Lazarte IM, Villarreal PD, Chiotti O, Thom LH IC (2011) An MDA-based method for designing integration process models in B2B collaborations. In: Proceedings of the 13th international conference on enterprise information systems (ICEIS). SciTePress, pp 55–65 Long Q (2017) A framework for data-driven computational experiments of inter-organizational collaborations in supply chain networks. Inf Sci (Ny) 399:43–63. https://doi.org/10.1016/j.ins. 2017.03.008 Manning CD, Raghavan PSH (2008) Introduction to information retrieval. Cambridge University Press, USA Mehdiyev N, Evermann J, Fettke P (2020) A novel business process prediction model using a deep learning method. Bus Inf Syst Eng 62:143–157. https://doi.org/10.1007/s12599-018-0551-3 Nguyen H, Dumas M, ter Hofstede AHM et al (2019) Stage-based discovery of business process models from event logs. Inf Syst 84:214–237. https://doi.org/10.1016/j.is.2019.05.002
26
J. D. Hernandez-Resendiz et al.
Pourmirza S, Dijkman R, Grefen P (2017) Correlation miner: mining business process models and event correlations without case identifiers. Int J Coop Inf Syst 26. https://doi.org/10.1142/S02 18843017420023 Pradabwong J, Braziotis C, Pawar KS, Tannock J (2015) Business process management and supply chain collaboration: a critical comparison. Logist Res 8:1–20. https://doi.org/10.1007/s12159015-0123-6 Raichelson L, Soffer P, Verbeek E (2017) Merging event logs: combining granularity levels for process flow analysis. Inf Syst 71:211–227. https://doi.org/10.1016/j.is.2017.08.010 Rovani M, Maggi FM, De Leoni M, Van Der Aalst WMP (2015) Declarative process mining in healthcare. Expert Syst Appl 42:9236–9251. https://doi.org/10.1016/j.eswa.2015.07.040 Salam MA (2017) The mediating role of supply chain collaboration on the relationship between technology, trust and operational performance: an empirical investigation. Benchmarking 24:298– 317. https://doi.org/10.1108/BIJ-07-2015-0075 Simatupang TM, Sridharan R (2018) Complementarities in supply chain collaboration. Ind Eng Manag Syst 17:30–42. https://doi.org/10.7232/iems.2018.17.1.030 Sousa MJ, Cruz R, Dias I, Caracol C (2017) Information management systems in the supply chain. In: Handbook of research on information management for effective logistics and supply chains. IGI Global, pp 469–485 Tello-Leal E, Villarreal PD, Chiotti O et al (2016) A technological solution to provide integrated and process-oriented care services in healthcare organizations. IEEE Trans Ind Inform 12:1508–1518. https://doi.org/10.1109/TII.2016.2587765 van der Aalst W (2016) Data science in action. Process mining. Springer, Berlin, pp 3–23 van der Aalst WMP (2018) Business process management. Encyclopedia of database systems. Springer, New York, pp 370–374 van der Aalst WMP (2013) Business process management: a comprehensive survey. ISRN Softw Eng 2013:1–37. https://doi.org/10.1155/2013/507984 Van Dongen BF, De Medeiros AKA, Verbeek HMW et al (2005) The ProM framework: a new era in process mining tool support. In: Lecture notes in computer science. Springer, pp 444–454 Weske M (2019) Process choreographies. Business process management. Springer, Berlin, pp 259– 306 Xu Y, Yuan F, Lin Q, et al (2018) Merging event logs for process mining with a hybrid artificial immune algorithm. Ruan Jian Xue Bao/J Softw 29:396–416. https://doi.org/10.13328/j.cnki.jos. 005253
Chapter 2
Towards Association Rule-Based Item Selection Strategy in Computerized Adaptive Testing Josué Pacheco-Ortiz, Lisbeth Rodríguez-Mazahua, Jezreel Mejía-Miranda, Isaac Machorro-Cano, and Ulises Juárez-Martínez Abstract One of the most important stages of Computerized Adaptive Testing (CAT) is the selection of items, in which various methods are used, which have certain weaknesses at the time of implementation. Therefore, in this chapter, the integration of Association Rule Mining is proposed as an item selection criterion in a CAT system. Specifically, we present the analysis of association rule mining algorithms such as Apriori, FPGrowth, PredictiveApriori, and Tertius into three data sets obtained from the subject Databases, to know the advantages and disadvantages of each algorithm and choose the most suitable one to employ in an association rule-based CAT system that is being developed as a Ph.D. project. We compare the algorithms considering the number of rules discovered, average support and confidence, lift, and velocity. According to the experiments, Apriori found rules with greater confidence, support, lift, and in less time. Keywords Computerized adaptive testing · Association rules · E-Learning · Intelligent systems
J. Pacheco-Ortiz · L. Rodríguez-Mazahua (B) · U. Juárez-Martínez Tecnológico Nacional de México/I. T. Orizaba, Orizaba, Veracruz, Mexico e-mail: [email protected] J. Pacheco-Ortiz e-mail: [email protected] U. Juárez-Martínez e-mail: [email protected] J. Mejía-Miranda Centro de Investigación en Matemáticas CIMAT, A.C, Guanajuato, Mexico e-mail: [email protected] I. Machorro-Cano Universidad del Papaloapan, Tuxtepec, Oaxaca, Mexico e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_2
27
28
J. Pacheco-Ortiz et al.
2.1 Introduction The development of technology in recent years has revolutionized the way in which various sectors or areas carry out their activities, providing new methods and means that allow them to simplify, adapt and improve the processes carried out in a traditional way. One of the sectors that has made a profit is education, which has found a great source of help in electronic tools. As they rightly establish (García-Peñalvo and Seoane Pardo 2015), the emergence of Information and Communication Technologies as an educational tool is a point of conceptual and methodological inflection in the way that institutions, educational or not, face educational processes and learning management, especially with regard to the concept of distance education, which evolves, in a more or less significant way, by adopting the Internet as a means, resulting in the term e-Learning. E-Learning is the abbreviated English term for electronic learning, which refers to online teaching and learning through the Internet and technology. This system contributes to improving interactivity and collaboration between those who learn, and/or between them and those who teach. It also allows the customization of learning programs to the particular characteristics of each student, as well as self-assessment. Evaluating is a fundamental piece within the teaching-learning process, since regardless of providing a grade to the student, it gives information to teachers about the student’s strengths and weaknesses, so actions can be taken that help a better quality of teaching. Over time, tests have generally been the most common and effective way of evaluating the student’s knowledge or ability. While traditional education is commonly conducted physically on paper and is the same for all students belonging to a group, in the e-Learning system, it is possible to customize electronic tests depending on the level of knowledge each ward has, resulting in what is known as CAT (Computerized Adaptive Testing) (Chen et al. 2020). CAT dynamically select and manage the most appropriate questions depending on the previous answers given by those examined (i.e. those that actually provide useful information about their ability). Some advantages that CAT has over traditional exams is that the tests are independent, individual, and on-demand; scores are obtained immediately; presentation takes less time and costs are lower (Pan and Lin 2018). The basic elements that make up the general structure of this type of examination are: • A bank of items with estimated parameters from a given model. • A procedure that sets out how to start and end the test, as well as how to progressively select the best items, and • A statistical method of estimating knowledge levels. This chapter is focused on the item selection criterion (Miyazawa and Ueno 2020). Although the most widely used criterion is Fisher’s Maximum Information (Albano et al. 2019), it presents several weaknesses that generate a certain degree of mistrust, for example, the bias in the item selection, estimation errors at the start of the exam, or the same question being displayed repeatedly to the tested one (Sheng et al. 2018; Ye
2 Towards Association Rule-Based Item Selection Strategy …
29
and Sun 2018; Du et al. 2019; Lin and Chang 2019; Yigit et al. 2019). Therefore, in this chapter, the development of a CAT system that uses association rules for the selection of items is proposed, focusing on using the potential advantages of association rules to find relationships between the questions answered correctly or incorrectly and the questions answered correctly, and thus present the most appropriate questions (most likely to answer correctly) in the tests, according to the responses of the evaluated, considering the best rules (stored in the database of students who submitted the same test previously) with greater support, confidence and lift. Several research projects have used association rule mining (ARM) with different algorithms in their development. For example, in Rubio Delgado et al. (2018) authors applied Apriori, FPGrowth, PredictiveApriori, and Tertius, grouping them according to their configuration characteristics to compare them. So Apriori and FPGrowth were contrasted using different support and confidence values, whereas for PredictiveApriori and Tertius once specified the number of rules, the execution time, the number of generated rules, the support and confidence values were taken into account in all cases. In contrast, (Wang et al. 2018) worked with the Apriori algorithm, occupying for the comparison process the minimum support and confidence, whose set of generated rules were debugged based on the minimum Lift, Chi-squared test, and minimum improvement. In Prajapati et al. (2017) apart from support and confidence, all-confidence, cosine, the interestingness of a rule, lift, execution time, and conviction were used in the process of comparing the DFMP (Distributed Frequent Pattern Mining), CDA (Count Distribution Algorithm) and FDM (Fast Distributed Mining) algorithms. The objective of this chapter is to present a comparative analysis of various ARM algorithms that allows selecting the most suitable for the implementation in the proposed CAT system. The remaining of this work is organized as follows: Sect. 2.2 shows background. Section 2.3 describes the works related to this research. Section 2.4 presents the integration of ARM in the CAT process and the comparison method that was followed. Section 2.5 displays the results and the analysis of them. Section 2.6 shows some recommendations. Finally, Sect. 2.7 gives conclusions and future work.
2.2 Background The traditional process followed by the CAT is shown in Fig. 2.1, which begins with the initial estimate of knowledge of the person evaluated, then the first item is selected, which is shown to the student. Once the student’s answer is obtained, a new knowledge estimate is made. The next step is to check if the stop criterion is met. If it does not happen, the next item is selected and the cycle starts again, which is repeated until the stop criterion is met. Once this happens, the process ends.
30
J. Pacheco-Ortiz et al.
Fig. 2.1 Traditional CAT process
Over the years, in different projects, various tools have been applied in the development of the phases that make up the CATs. For example, the three-parameter logistic model (3PL) for item calibration (Lee et al. 2018); maximum likelihood estimation (MLE) for the evaluator’s skill estimation (Albano et al. 2019); and root mean square differences as an evaluation criterion (Stafford et al. 2019), among others. Specifically, for the item selection stage, work has been done to solve the problems presented by Fisher’s Maximum Information (MFI), using other selection strategies, for example, Bayesian networks (Tokusada and Hirose 2016), Greedy algorithm (Bengs and Kröhne 2018), Kullback–Leibler Information (KL) (Cheng et al. 2017), Minimum Expected Subsequent Variance (Rodríguez-Cuadrado et al. 2020), to mention a few, which, while they have achieved favorable results, most have only been in studies of simulation and not in a real application.
2.3 Related Work This section includes related work that has focused on the item selection phase in computerized adaptive testing systems over the past ten years, which are subsequently concentrated in Tables 2.1, 2.2, and 2.3. A new strategy for the item selection stage of CAT was proposed in Barla et al. (2010), which is a combination of three methods: the first method is based on the course structure and focuses on the selection of the most appropriate topic for learning; the second uses Item Response Theory to select the k-best questions with
2 Towards Association Rule-Based Item Selection Strategy … Table 2.1 Related works (A)
31
Work
Item selection strategy
Validation
Barla et al. (2010)
– MI (Maximum Real application information) – Selection based on structure – Based on prioritization
Cheng (2010)
MMGDI
Simulation
Ueno and Songmuang (2010)
Decision trees
Simulation
Wang and Chang (2011)
– KL – FI (Fisher information) – Mutual information method – Continuous Entropy method
Simulation
Wang et al. (2011)
KL Simplified
Simulation
Han (2012)
Efficiency balanced information
Simulation
Huang et al. (2012)
– FI – Progressive method – Proportional method
Simulation
Olea et al. (2012)
MI
Simulation
Moyer et al. (2012)
MI
Simulation
Frey et al. (2013)
Own algorithm with Bayesian focus
Simulation
Wang (2013)
Mutual information
Simulation
adequate difficulty for a particular learner, and the last is based on the usage history and prioritizes questions according to specific strategies. For its part, (Cheng 2010) also proposed a new item selection method called MMGDI (Modified maximum global discrimination index method), which captures two aspects of the appeal of an item: (a) the amount of contribution it can make toward adequate coverage of every attribute and (b) the amount of contribution it can make to recover the latent cognitive profile. In contrast, (Ueno and Songmuang 2010) used decision trees as an item selection strategy, which was compared to traditional selection methods. Another related work is Wang et al. (2011), which focused on analyzing and comparing four methods of item selection: D-optimality, KL information index, continuous entropy, and mutual information, the latter presented the best results in the simulation studies, as not only improved the overall estimation accuracy but also yielded the smallest conditional mean squared error. In the meantime, (Wang et al. 2011) relied
32
J. Pacheco-Ortiz et al.
Table 2.2 Related works (B) Work
Item selection strategy
Validation
Mao and Xin (2013)
Monte Carlo Method
Simulation
Kaplan et al. (2015)
– MPWKL – DGI
Simulation
Finkelman et al. (2014)
– PWKL per-time-unit – Mutual information per-time-unit
Simulation
Kröhne et al. (2014)
– MPI
Simulation
Koedsri et al. (2014)
– MI – Random selection – CW-ASTR
Simulation
Su and Huang (2015)
– MPI modified – PI (priority index)
Simulation
Wei and Lin (2015)
– MI
Simulation
Wang et al. (2015)
– RHA – KL expected discrimination
Simulation
Veldkamp (2016)
– MFI
Simulation
Cheng et al. (2017)
– MIT – S-MIT – Random selection
Simulation
Choe et al. (2018)
– BMIT – MIB – GMIT
Simulation
Tu et al. (2018)
– MKB method (modified method of posterior expected KL information) – MCEM (modified continuous entropy method)
Simulation
Bengs and Kröhne (2018) Greedy Algorithm
Simulation
Sheng et al. (2018)
– MFI progressive – Method of stratification of information
Simulation
Ye and Sun (2018)
– – – – –
Simulation
Yigit et al. (2019)
JSD
Simulation
Chen et al. (2020)
SDC
Simulation
Lin and Chang (2019)
SWDGDI
Simulation
D-Optimality Bayesian D-Optimality A-Optimality Bayesian A-Optimality Random selection
on the Kullback-Leibler method to propose a new strategy to select items called “KL Simplified”. In Han (2012), the author introduced a new item selection method using the “efficiency balanced information” criterion, which chooses items with low discrimination values while it eliminates the need for item pool stratification. In Huang et al. (2012), authors were responsible for carrying out an analysis of three item selection
2 Towards Association Rule-Based Item Selection Strategy … Table 2.3 Related works (C)
33
Work
Item selection strategy
Validation
Du et al. (2019)
– ASBT – MIB – GMIT
Simulation
van der Linden and Ren (2020)
MFI
Simulation
Jatobá et al. (2020)
– – – – –
Simulation
FI KL KLP MLWI MPWI
methods: Fisher Information, Proportional Method, and Progressive Method; the latter achieved the best results in measurement precision. The problem of capitalization on chance in CAT and some of its effects on the precision of the ability estimations was addressed by Olea et al. (2012), whose solution proposal was to add two exposure control methods in the process to select items. Regarding (Moyer et al. 2012), three constraint balancing systems were evaluated: CCAT (constrained CAT ), FMCCAT (flexible modified constrained CAT ), and WPM (weighted penalty model), under two scenarios: applying item exposure control methods and not applying control methods, in the item selection process. On the other hand, an item selection algorithm with Bayesian focus was developed by Frey et al. (2013) to carry out simulation studies to measure reliability in multidimensional adaptive testing. A new selection method called Mutual Information for CD-CAT (Cognitive diagnostic computerized adaptive testing) was proposed by Wang (2013), in the simulation studies it was compared to the selection methods: Kullback–Leibler Index, PWKL (Posterior weighted KL Information Index), and Shannon Entropy. As a conclusion, it was obtained that the proposed method consistently results in nearly the highest attribute and pattern recovery rate. On the other hand, the Monte Carlo method is proposed by Mao and Xin (2013) as an item selection strategy in cognitive diagnostic CAT, which showed better behavior in the measurement accuracy and exposure control of items compared to the MMGDI method. Equally important is the work of Kaplan et al. (2015), which proposed two new item selection methods called: MPWKL (modified posterior-weighted KL index) and DGI (generalized deterministic inputs, noisy and gate model discrimination index), which were compared to PWKL method in simulation studies, the results showed that the MPWKL and GDI perform very similarly, and have higher correct attribute classification rates or shorter mean test lengths compared with the PWKL. At the same time, a different version of the PWKL and MI (mutual information) item selection methods were proposed by Finkelman et al. (2014) which were called “PWKL per-time-unit” and “MI pertime-unit”. These were compared in simulation studies with their source versions to know their strengths in front of them. The results indicated that, on average, the new methods required more items but took less time than the standard procedures.
34
J. Pacheco-Ortiz et al.
In Kröhne et al. (2014), authors worked on a modified version of the MAT (Multidimensional adaptive testing) called CMAT (Constrained MAT ), which, unlike MAT, does not allow to mix items between dimensions when selected, thus avoiding the invalidity of the properties of the items. Likewise, in Koedsri et al. (2014), a new method of item selection called CW-ASTR (constraint weighted a-stratification method) was proposed, which was compared in simulation studies against MI and random selection in variable-length CAT. In Su and Huang (2015), a new version of the PI item selection method (priority index) was developed, called MPI (multidimensional PI), conducting simulation tests with two control conditions and four experimental conditions. At the same time, in Wei and Lin (2015), various simulation tests were carried out to evaluate the impact of “out-of -level” item management on those evaluated, i.e., items that are outside the range of the evaluator’s estimate level are selected in the tests. Similarly, in Wang et al. (2015), two new methods of item selection, RHA (randomization halving algorithm) and KL-ED (KL expected discrimination) were presented and tested to improve item-bank usage without sacrificing too much measurement precision, which is a deficiency of traditional methods of selecting items on CD-CAT. As for Veldkamp (2016), the item selection method used is MFI (maximum Fisher information), which also considers the response times. So, simulation studies were carried out to compare the selection process, with and without response times. Equally important are the results presented by Cheng et al. (2017), in which a new MIT-based (maximum information per time unit method) item selection method was proposed, called MIT-S (simplified MIT ), MIT and MIT-S were tested in simulation studies with the 1PL, 2PL, and 3PL models in order to evaluate their behavior. The results indicated that when the underlying IRT model is the 2PL or 3PL, the MIT-S method maintains measurement precision and saves testing time. And if the underlying model is the 1PL model, the MIT-S method maintains the measurement precision and saves a considerable amount of testing time. The simulation studies carried out by Choe et al. (2018) compared three methods of selecting MI-based items that use response times: BMIT (b-partitioned MIT ), MIB (MI with b-matching), GMIT (Generalized MIT ); and various methods of item control. For their part, (Tu et al. 2018) presented two new methods for the selection of items: MKB method (modified method of later expected KL information) and MCEM (modified continuous entropy method), which were compared to traditional item selection methods. The results showed that, when considering two dimensions, MCEM presented the lowest item exposure rate, and has relatively high accuracy, while considering more than two dimensions, MCEM and MUI (mutual information) keep relatively high estimation accuracy, and the item exposure rates decrease as the correlation increases. Like the previous work, (Bengs and Kröhne 2018) also focused their studies on the proposal of a new item selection method based on the Greedy Algorithm, which was tested in simulation studies with matroid constraints. In turn, (Sheng et al. 2018) developed two new item selection methods. The first one is a modified version of MFI called Progressive MFI, and the second one is known as the Information Stratification Method, which effectiveness was validated and compared to traditional select methods, through simulation tests with the Monte Carlo algorithm.
2 Towards Association Rule-Based Item Selection Strategy …
35
Meanwhile, (Ye and Sun 2018) compared the item selection methods: DOptimality, Bayesian D-Optimality, A-Optimality, Bayesian A-Optimality, and Random selection, in an MCAT system under various conditions for dichotomous and polytomous testing data, using two models for item calibration: 2PL Multidimensional and MGRM (multidimensional graded response model), with Bayesian A-Optimality obtaining the best result. On the other hand, (Yigit et al. 2019) presented a new item selection method called JSD (Jensen–Shannon divergence) index for the MC-DINA (multiple-choice deterministic inputs, noisy “and” gate) model, which was compared to other selection methods: GDI (G-DINA model discrimination index) and random selection. The results showed that the proposed model improves the attribute classification accuracy significantly by considering the information from distractors, even with a very short test length. Similarly, (Chen et al. 2020) proposed a new item selection method called SDC (dynamic stratification method based on dominance curves), which is aimed at improving trait estimation, because when CAT is under rigorous item exposure control, the precision of trait estimation decreases substantially. Of equal importance, (Lin and Chang 2019) developed a new item selection method called SWDGDI (standardized weighted deviation global discrimination index) for CD-CAT, which balances attribute coverage and the exposure rate without severe loss in estimation accuracy. Another approach is Du et al. (2019), where three item selection methods: ASBT (a-stratification b-blocking with time), MIB (maximum information with beta matching), and GMIT (generalized maximum information with time) were applied in an OMST (on-the-fly multistage adaptive testing) system adding response times. Then, in van der Linden and Ren (2020), a new algorithm called MCMC (Markov chain Monte Carlo) was proposed, which allows counting the effects of error on capacity and item parameters in adaptive tests through a joint real-time subsequent distribution of all parameters, allowing to rate the capacity of the examiner and optimally select the items. Finally, the studies of Jatobá et al. (2020) focused on the creation of a custom item selection method called ALICAT (personALIzed CAT ), which is composed of the mixture of several traditional selection methods, among which are: FI, KL, MLWI (maximum likelihood weighted information) and MPWI (maximum subsequent weighted information). As the main result, these studies obtained a considerable reduction of the number of items necessary to achieve a correct estimate of the skill of the person evaluated. Tables 2.1, 2.2, and 2.3 show the concentrate of related work, specifying the item selection methods used, as well as their validation process, that is, whether it was carried out in simulation studies or real application. We propose using ARM as an item selection criterion because we can exploit its ability to find associations or correlations between the elements or objects (in this case, test answers that were given by other students in the past) of a database, it has many advantages, among which are: (1) Associations can occur between correct/incorrect answers and correct ones; (2) They will determine the suitable item according to the answer of the evaluated, and (3) The items presented to the examinees will be selected considering interesting metrics widely used in related works. ARM has been used in various areas, among which are: recommendation systems
36
J. Pacheco-Ortiz et al.
(Dahdouh et al. 2019) and online learning (Gu et al. 2018), offering positive results in each case, however, to the best of our knowledge, its use as a selection strategy for CAT is not currently reported, therefore, this project contemplates the integration of ARM in the stage of selecting items in the CAT. The expected outcome at the end of the project is a system that uses the benefits of both CAT and association rules in the educational evaluation process, looking like a final product for a system that is not only adaptive but also learns and evolves according to the experiences that it accumulates over time.
2.4 Methods and Analysis The following subsections specify the integration of ARM in the CAT process and the method performed for comparing ARM algorithms. The first subsection shows the proposed CAT process. The second subsection contemplates the data bank used for comparison. The third subsection includes the specificity of the algorithms used.
2.4.1 CAT Process with ARM as Item Selection Criterion The proposed CAT follows a process shown in Fig. 2.2. It begins with an initial estimate of student knowledge to select and present the first item. Once the student answers a question, the CAT makes a new knowledge estimate. If the answer to the previous question is correct, a question with a higher level of complexity is chosen using association rules, as long as the stop criterion is not met. Otherwise, it is chosen one item with less complexity, selected according to association rules. Then, the CAT presents an item to the student to recalculate his/her level of knowledge estimate. This cycle repeats itself until the stop criterion is met. When this happens, the CAT saves all information, makes a final estimate of knowledge, displays the grade to the student, who logs out, and saves automatically new association rules, that will serve the next time a student submits the exam.
2.4.2 Collection and Preparation of Data Information of tests on pencil and paper corresponding to three units of the Computer Systems Master’s Database course was used for the creation of a database in MySQL. From the database records, three binary matrices were created to serve as the basis for the application of ARM algorithms. In the binary-matrix, questions are represented by the columns and examinees by the rows, where 1 corresponds to a correct answer, and 0 corresponds to an incorrect answer. The first binary-matrix called Exa1
2 Towards Association Rule-Based Item Selection Strategy …
37
Fig. 2.2 Integration of association rule mining in the item selection phase of the CAT process
corresponds to the answers of the first unit and includes thirty questions of twentyfive students. The second binary-matrix called Exa2 corresponds to the answers of the second unit and covers thirty questions and twenty-five students. The third binary-matrix called Exa3 corresponds to the answers of the third unit and covers ten questions and twenty-five students. According to the WEKA (Waikato Environment for Knowledge Analysis) tool specifications, the three data sets were analyzed based on their characteristics and it was observed that they did not need any other processing. So, they were ready for the next step of the process.
2.4.3 Evaluation of Algorithms There are several metrics to evaluate association rules, which are: interest factor, support, confidence, lift, rule interest, conviction, Laplace measure, certainty factor, odds ratio, and cosine similarity (Yan et al. 2009). There are also some new metrics such as bi-lift, bi-improve, bi-support (Ju et al. 2015), ID (Items-based Distance), and (Data Rowsbased Distance) (Djenouri et al. 2014). However, the most utilized
38
J. Pacheco-Ortiz et al.
are support, confidence, and lift, which are used in this project. Adding the time factor and number of rules, as well. For the comparative analysis of the association rule algorithms, the following four criteria were evaluated: • Confidence: it assesses the degree of certainty of the detected association. • Support: it represents the percentage of transactions from the database that the given rule satisfies. • Time: number of milliseconds that takes the construction of a model. • Rules: it represents the number of interesting rules obtained. • Lift: it indicates the ratio of the observed support of a set of items to the theoretical support of that set given the assumption of independence. The purpose of this comparison process is to identify the algorithm that provides those rules that meet the following search criteria: (1) Rules with one antecedent and one consequent, and (2) Rules with a value of consequent equal to 1 (correct answer). For example: Item5 = 1 =⇒ Item6 = 1 or Item3 = 0 =⇒ Item4 = 1 where value 1 means the question had a correct answer and 0 means it had the wrong answer. All of them with the highest levels of confidence and support, and found in the shortest possible time. Four association rule algorithms were applied to the data sets, which are mentioned below: 1.
2.
3.
Apriori (Agrawal et al. 1993). It is a classic algorithm for association rule mining. It generates rules through an incremental process that searches for frequent relationships between attributes bounded by a minimum confidence threshold. The algorithm can be configured to run under certain criteria, such as upper and lower coverage limits, and to accept sets of items that meet the constraint, the minimum confidence, and order criteria to display the rules, as well as a parameter to indicate the specific number of rules we want to show. FPGrowth (Han et al. 2012a). It is based on Apriori to perform the first exploration of the data, in which it identifies the sets of frequent items and their support, a value that allows us to organize the sets in a descending way. The method proposes good selectivity and substantially reduces the cost of the search, given that it starts by looking for the shortest frequent patterns and then concatenating them with the less frequent ones (suffixes), and thus identifying the longest frequent patterns. PredictiveApriori (Scheffer 2001). The algorithm achieves a favorable computational performance due to its dynamic pruning technique that uses the upper bound of all rules of the supersets of a given set of elements. In addition, through a backward bias of the rules, it manages to eliminate redundant ones that derive from the more general ones. For this algorithm, it is necessary to specify the number of rules that are required.
2 Towards Association Rule-Based Item Selection Strategy …
4.
39
Tertius (Flach and Lachiche 2001). It performs an optimal search based on finding the most confirmed hypotheses using a no redundant refinement operator to eliminate duplicate results. The algorithm has a series of configuration parameters that allow its application to multiple domains.
For a better understanding of the comparison process, the algorithms were grouped based on their characteristics, so a comparison was carried out between Apriori and FPGrowth since both allow to set different values for the confidence (min_conf ), support (min_sup), and lift to obtain four different groups of rules (15, 20, 25 and 50) with one antecedent and one consequent, where the value of the latter is equal to 1. Getting in response to each case, the time in milliseconds consumed in the execution of the algorithm, the confidence, the support, and the lift. While for the comparison of Predictive Apriori and Tertius, it was also necessary to specify the number of rules required, obtaining as a response for each case, the time in milliseconds used, as well as the confidence and support. To compare the four algorithms, the number of rules generated, time spent, support, and confidence were taken into account. Each evaluation was executed 100 times to estimate the average time for the construction of the models. Also, the average values of support and confidence were considered. It is important to mention that all tests were run on a computer with the following technical data: AMD A6 RADEON R4 2.56 GHz processor; RAM 8 GB, and 64-bit Windows operating system.
2.5 Results and Discussion Tables 2.4, 2.5, and 2.6 show the comparison between Apriori and FPGrowth for the Exa1, Exa2, and Exa3 data sets. As seen in Table 2.4, Apriori obtained 15 and 25 rules faster than FPGrowth in more cases. Although the latter discovered 20 rules with higher support, the rules found by Apriori had greater confidence. Moreover, Apriori was the only algorithm that obtained 15 and 20 rules considering a value of 0.9 for min_conf and min_sup, respectively, and 50 rules eight times. Therefore, based on the Time, Confidence, and Rules criteria, Apriori is better than FPGrowth for the Exa1 data set. Likewise, Table 2.5 shows that Apriori is faster than FPGrowth for 15, 20, and 25 rules. For the group of 25 rules, although FPGrowth has a higher level of confidence in all cases, Apriori has a higher level of support in them. Besides, for the group of 50 rules, Apriori was the only algorithm that obtained rules in eight cases. Therefore, based on the Time, Support, and Rules criteria, Apriori is also better than FPGrowth for the Exa2 data set. Table 2.6 shows that for the Exa3 data set, Apriori obtained 15, 20, and 25 rules faster than FPGrowth in more cases. Also, for the group of 15 rules, the former algorithm found rules with more confidence three times. For the group of 20 rules, even though FPGrowth has higher confidence in three cases, Apriori was able to discover rules with 0.9/06 and 0.9/0.7 for confidence and support. For the group of
–
1
0.9/0.9
Apriori
FPGrowth
1
FPGrowth
1
1
Apriori
0.9/0.8
FPGrowth
1
Apriori
0.9/0.7
1
1
0.9/0.6
Apriori
FPGrowth
1
FPGrowth
1
1
Apriori
0.9/0.5
FPGrowth
1
Apriori
0.8/0.6
1
1
0.8/0.3
Apriori
FPGrowth
1
FPGrowth
1
1
0.7/0.6
0.96
–
0.96
0.96
0.96
0.96
0.96
0.96
0.96
0.96
0.96
0.96
0.96
0.96
0.96
0.96
0.96
Time
–
7
8
7
5
11
7
7
5
6
3
5
6
5
7
6
5
3
–
0.98
0.98
1
0.98
1
0.98
1
0.98
1
0.98
1
0.98
1
0.98
1
0.98
1
Conf.
Sup. 0.96
Conf.
1
Apriori
0.7/0.5
Apriori
20 Rules
15 Rules
FPGrowth
Min_conf/Min_sup
Algorithms
Table 2.4 Test results for Apriori and FPGrowth for the Exa1 data set
–
0.95
0.95
0.94
0.95
0.94
0.95
0.94
0.95
0.94
0.95
0.94
0.95
0.94
0.95
0.94
0.95
0.94
Sup.
–
4
6
5
7
7
7
6
4
5
4
6
6
5
4
6
5
5
Time
–
–
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
Conf.
25 Rules
–
–
0.93
0.93
0.93
0.93
0.93
0.93
0.93
0.93
0.93
0.93
0.93
0.93
0.93
0.93
0.93
0.93
Sup.
–
–
7
7
9
7
7
5
6
4
6
6
6
5
6
8
5
4
Time
–
–
–
0.99
–
0.99
–
0.99
–
0.99
–
0.99
–
0.99
–
0.99
–
0.99
Conf.
50 Rules
–
–
–
0.87
–
0.87
–
0.87
–
0.87
–
0.87
–
0.87
–
0.87
–
0.87
Sup.
–
–
–
17
–
14
–
17
–
24
–
10
–
14
–
17
–
9
Time
40 J. Pacheco-Ortiz et al.
FPGrowth
0.99
0.99
Apriori
0.9/0.8
0.99
0.99
0.9/0.7
Apriori
FPGrowth
0.99
FPGrowth
0.99
0.99
Apriori
0.9/0.6
FPGrowth
0.99
Apriori
0.9/0.5
0.99
0.99
0.8/0.6
Apriori
FPGrowth
0.99
FPGrowth
0.99
0.99
Apriori
0.8/0.3
FPGrowth
0.99
0.7/0.6
Apriori
0.99
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
5
2
7
3
6
4
4
5
6
7
6
5
5
4
4
3
0.97
0.97
0.97
0.97
0.97
0.97
0.97
0.97
0.97
0.97
0.97
0.97
0.97
0.97
0.97
0.97
Conf.
0.99
0.7/0.5
Apriori
20 Rules Time
Conf.
Sup.
15 Rules
FPGrowth
Min_conf/Min_sup
Algorithms
Table 2.5 Test results for Apriori and FPGrowth for the Exa2 data set
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
0.90
Sup.
6
5
5
4
5
3
7
4
4
4
5
3
5
4
5
3
Time
0.99
0.98
0.99
0.98
0.99
0.98
0.99
0.98
0.99
0.98
0.99
0.98
0.99
0.98
0.99
0.98
Conf.
25 Rules
0.86
0.88
0.86
0.88
0.86
0.88
0.86
0.88
0.86
0.88
0.86
0.88
0.86
0.88
0.86
0.88
Sup.
6
4
7
6
9
6
7
4
6
5
6
7
7
5
7
6
Time
–
0.97
–
0.97
–
0.97
–
0.97
–
0.97
–
0.97
–
0.97
–
0.97
Conf.
50 Rules
–
0.85
–
0.85
–
0.85
–
0.85
–
0.85
–
0.85
–
0.85
–
0.85
Sup.
–
8
–
9
–
7
–
6
–
7
–
7
–
8
–
6
Time
2 Towards Association Rule-Based Item Selection Strategy … 41
FPGrowth
–
0.96
0.97
Apriori
0.9/0.7
FPGrowth
0.97
Apriori
0.9/0.6
0.96
0.97
0.9/0.5
Apriori
FPGrowth
0.95
FPGrowth
0.95
0.95
Apriori
0.8/0.6
FPGrowth
0.95
Apriori
0.8/0.3
0.95
0.95
0.7/0.6
FPGrowth
Apriori
0.95
–
0.82
0.84
0.82
0.85
0.85
0.86
0.85
0.86
0.85
0.86
0.86
0.86
0.86
–
23
7
5
4
3
6
4
4
3
3
4
4
4
–
0.96
–
0.96
0.96
0.95
0.94
0.92
0.94
0.92
0.92
0.92
0.92
0.92
Conf.
0.95
0.7/0.5
Apriori
20 Rules Time
Conf.
Sup.
15 Rules
FPGrowth
Min_conf/Min_sup
Algorithms
Table 2.6 Test results for Apriori and FPGrowth for the Exa3 data set
–
0.82
–
0.81
0.84
0.82
0.85
0.85
0.85
0.85
0.85
0.85
0.85
0.85
Sup.
–
21
–
9
6
5
3
5
5
4
5
3
4
5
Time
–
–
–
–
0.94
0.96
0.93
0.94
0.93
0.94
0.93
0.92
0.93
0.92
Conf.
25 Rules
–
–
–
–
0.82
0.81
0.82
0.81
0.82
0.81
0.82
0.83
0.82
0.83
Sup.
–
–
–
–
9
7
4
5
5
5
5
4
7
5
Time
–
–
–
–
–
–
–
–
–
–
–
0.88
–
0.88
Conf.
50 Rules
–
–
–
–
–
–
–
–
–
–
–
0.75
–
0.75
Sup.
–
–
–
–
–
–
–
–
–
–
–
5
–
5
Time
42 J. Pacheco-Ortiz et al.
2 Towards Association Rule-Based Item Selection Strategy …
43
25 rules, FPGrowth had greater support in three cases. Despite that, Apriori obtained rules, three times, with more confidence. Moreover, Apriori was the only algorithm that managed to obtain rules in two cases for the group of 50 rules, in the same way, it was the only algorithm that found 15 rules considering a value of 0.9/0.7 for min_conf and min_sup, respectively. Therefore, based on the Time and Rule criteria, Apriori is better than FPGrowth for the Exa3 data set. The comparisons between PredictiveApriori and Tertius for the Exa1, Exa2, and Exa3 data sets are shown in Tables 2.7, 2.8, and 2.9, respectively, as observed. Although PredictiveApriori’s support and confidence are higher, the time that it uses in rule creation is a sufficient factor to discard it, because the system should use the shortest time possible in generating rules that are the basis for selecting the next item. Additionally, Table 2.10 shows the comparison of Apriori and FPGrowth in Terms of Lift, for each of the three data sets, with the different rule groups and support and confidence values. It is observed that Apriori has a higher Lift value in the group of 20 and 50 rules for each of the data sets, as well as for the groups of 15 rules of Exa2 and Exa3 and the group of 25 rules of the latter; while FPGrowth only has a higher Lift value in the group of 25 rules for Exa1 and Exa2 data sets. Therefore, based on the Lift value, Apriori is the best algorithm for the Exa1, Exa2, and Exa3 sets. Table 2.7 Test results for PredictiveApriori and Tertius for the Exa1 data set Algorithms
Rules
Confidence
Support
Time
PredictiveApriori
15
1
0.90
16959
Tertius
15
0.81
0.41
22
PredictiveApriori
20
1
0.87
48558
Tertius
20
0.79
0.40
29
PredictiveApriori
25
1
0.84
22674
Tertius
25
0.79
0.41
25
PredictiveApriori
50
1
0.73
58364
Tertius
50
0.79
0.43
47
Table 2.8 Test results for PredictiveApriori and Tertius for the Exa2 data set Algorithms
Rules
Confidence
Support
Time
PredictiveApriori
15
1
0.79
18223
Tertius
15
0.82
0.44
33
PredictiveApriori
20
1
0.73
8539
Tertius
20
0.82
0.45
34
PredictiveApriori
25
1
0.68
8528
Tertius
25
0.79
0.44
26
PredictiveApriori
50
0.99
0.69
13706
Tertius
50
0.79
0.41
52
44
J. Pacheco-Ortiz et al.
Table 2.9 Test results for PredictiveApriori and Tertius for the Exa3 data set Algorithms
Rules
Confidence
Support
Time
PredictiveApriori
15
0.99
0.68
929
Tertius
15
0.73
0.34
10
PredictiveApriori
20
0.99
0.57
1097
Tertius
20
0.73
0.31
10
PredictiveApriori
25
0.99
0.57
1090
Tertius
25
0.74
0.31
12
PredictiveApriori
50
0.99
0.41
4920
Tertius
50
0.70
0.34
22
Figures 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 2.10 and 2.11 show the comparison between the four algorithms concerning support, confidence, and time, respectively. The results indicate that the algorithm that generates rules with better support and confidence within the Exa1 and Exa2 data sets and in less time is Apriori. In the Exa3 data set, FPGrowth found 20 and 25 rules with more support. Nevertheless, Apriori obtained 15 and 25 rules with the highest value of confidence and it was also faster than FPGrowth for 15, 20, and 25 rules. Therefore, Apriori is the best algorithm for the data sets. The product of the analyses carried out in this section allows determining that the Apriori algorithm is the one that presents the best results in each of the data sets. For example, in the Exa1 data set, one of the rules generated with a confidence equal to 1 and support equal to 1 is as follows: Item13 = 1 =⇒ Item7 = 1 This indicates that all twenty-five students who answered question 13 correctly answered question 7 correctly, as well. Another example is found in the Exa2 data set, where the algorithm generated a rule with a confidence of 1 and support of 0.88: Item48 = 1 =⇒ Item54 = 1 That means that the pattern appears in 88% of total transactions, that is, in 22 of the 25 tests. And that every time the students correctly answered question 48, their answer to question 54 was also correct. An example where the antecedent takes the value of 0 is found in Exa3, where the algorithm generated a rule with confidence equal to 1 and support equal to 0.72. Item67 = 0 =⇒ Item68 = 1
–
1
0.9/0.9
Apriori
FPGrowth
1
FPGrowth
1
1
Apriori
0.9/0.8
FPGrowth
1
Apriori
0.9/0.7
1
1
0.9/0.6
Apriori
FPGrowth
1
FPGrowth
1
1
Apriori
0.9/0.5
FPGrowth
1
Apriori
0.8/0.6
1
1
0.8/0.3
Apriori
FPGrowth
1
FPGrowth
1
1
0.7/0.6
0.98
–
0.98
0.98
1
0.98
1
0.98
1
0.98
1
0.98
1
0.98
1
0.98
1
25
–
–
1
0.99
1
0.99
1
0.99
1
0.99
1
0.99
1
0.99
1
0.99
1
0.99
–
–
–
1
–
1
–
1
–
1
–
1
–
1
–
1
–
1
50
20 1
15
1
Apriori
0.7/0.5
Apriori
Exa2
Exa1
FPGrowth
Min_conf/Min_sup
Algorithms
–
–
0.99
1.01
0.99
1.01
0.99
1.01
0.99
1.01
0.99
1.01
0.99
1.01
0.99
1.01
0.99
1.01
15
–
–
0.97
0.98
0.97
0.98
0.97
0.98
0.97
0.98
0.97
0.98
0.97
0.98
0.97
0.98
0.97
0.98
20
–
–
1.01
1
1.01
1
1.01
1
1.01
1
1.01
1
1.01
1
1.01
1
1.01
1
25
Table 2.10 Test results for Apriori and FPGrowth in terms of Lift for the Exa1, Exa2 and Exa3 data set 50
–
–
–
1.02
–
1.02
–
1.02
–
1.02
–
1.02
–
1.02
–
1.02
–
1.02
15
–
–
–
–
–
1.15
1
1.15
1
1.15
0.98
1.13
0.98
1.13
0.98
1.13
0.98
1.13
–
–
–
–
–
1.17
–
1.17
1.01
1.15
0.98
1.12
0.98
1.12
0.96
1.12
0.96
1.12
20
Exa3 25
–
–
–
–
–
–
–
–
1.01
1.17
1
1.14
1
1.14
1
1.12
1
1.12
50
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
1.11
–
1.11
2 Towards Association Rule-Based Item Selection Strategy … 45
46
J. Pacheco-Ortiz et al.
Fig. 2.3 Comparison of Apriori, FPGrowth, PredictiveApriori and Tertius algorithms in terms of support for Exa1 data set
Fig. 2.4 Comparison of Apriori FPGrowth, PredictiveApriori and Tertius algorithms in terms of confidence for Exa1 data set
That means that the pattern appears in 72% of total transactions, that is, in 18 of the 25 tests. And that every time the students incorrectly answered question 67, their answer to question 68 was correct. The method of selection of items in the CAT system that is being developed will use these patterns to determine what items present to students, according to
2 Towards Association Rule-Based Item Selection Strategy …
47
Fig. 2.5 Comparison of Apriori, FPGrowth, PredictiveApriori and Tertius algorithms in terms of time for Exa1 data set
Fig. 2.6 Comparison of Apriori, FPGrowth, PredictiveApriori and Tertius algorithms in terms of support for Exa2 data set
their previous answer, and based on the past experiences of other students who have presented the same exam.
48
J. Pacheco-Ortiz et al.
Fig. 2.7 Comparison of Apriori FPGrowth, PredictiveApriori and Tertius algorithms in terms of confidence for Exa2 data set
Fig. 2.8 Comparison of Apriori, FPGrowth, PredictiveApriori and Tertius algorithms in terms of time for Exa2 data set
2.6 Recommendations Like many other areas, education has evolved and adapted according to the emergence of new tools that have appeared, especially those that are driven by the emergence of information technologies, which has given way to new methods of the teachinglearning process, for example, distance learning through digital platforms, which
2 Towards Association Rule-Based Item Selection Strategy …
49
Fig. 2.9 Comparison of Apriori, FPGrowth, PredictiveApriori and Tertius algorithms in terms of support for Exa3 data set
Fig. 2.10 Comparison of Apriori FPGrowth, PredictiveApriori and Tertius algorithms in terms of confidence for Exa3 data set
50
J. Pacheco-Ortiz et al.
Fig. 2.11 Comparison of Apriori, FPGrowth, PredictiveApriori and Tertius algorithms in terms of time for Exa3 data set
allows a person to study regardless of the place or time, where important universities such as Harvard and MIT (Massachusetts Institute of Technology) with edX platform (https://www.edx.org/), La Rioja with Proeduca4Schools platform (https:// www.proeduca4schools.org/) and Yale University with courses on the online platform Coursera (https://www.coursera.org/yale), in countries such as the United States or Spain have made significant progress in the development of this new modality, making great strides towards a more evolved education, this being an example of overcoming for countries that remain their traditional educational systems. CAT are the link of evaluation in this new era of education, switching from traditional paper exams to computerized adaptive testing is a mandatory process for all those institutions that do not wish to become stagnant, however, there is much work to be done, among this, to look for new tools that strengthen the opportunity areas presented in this type of new evaluations.
2.7 Conclusions and Future Work This chapter shows the complete process of the proposed CAT system, and the comparison of four association rule mining algorithms applied to three data sets to find the most suitable one to implement it as a selection method in a CAT system. With the results obtained, the Apriori algorithm has the greatest advantages compared to FPGrowth, PredictiveApriori, and Tertius, since it obtained rules with good support, confidence, lift, and in less time. Although the first three criteria are very important to select interesting rules, the last criterion is critical for this work because the system
2 Towards Association Rule-Based Item Selection Strategy …
51
must use the shortest time possible in generating rules that will serve for the selection of the next item in the test that is being presented by the evaluated. In the future, these results will serve to develop and implement a CAT system that uses association rule mining as an item selection criterion, which will be tested in students of master’s level to compare the estimation of knowledge of the evaluated when presenting physical tests against electronic and adaptive examinations, all to verify the effectiveness of the developed system. Acknowledgments The authors are very grateful to Tecnológico Nacional de México for supporting this work. Also, this research chapter was sponsored by the National Council of Science and Technology (CONACYT).
References Agrawal R, Imieli´nski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on Management of data-SIGMOD’93. Association for Computing Machinery (ACM), New York, USA, pp 207– 216 Albano AD, Cai L, Lease EM, McConnell SR (2019) Computerized adaptive testing in early education: exploring the impact of item position effects on ability estimation. J Educ Meas 56:437–451. https://doi.org/10.1111/jedm.12215 Barla M, Bieliková M, Ezzeddinne AB et al (2010) On the impact of adaptive test question selection for learning efficiency. Comput Educ 55:846–857. https://doi.org/10.1016/j.compedu.2010. 03.016 Bengs D, Kröhne U (2018) Adaptive item selection under Matroid constraints. J Comput Adapt Test 6:15–36. https://doi.org/10.7333/1808-0602015 Chen JH, Chao HY, Chen SY (2020) A dynamic stratification method for improving trait estimation in computerized adaptive testing under item exposure control. Appl Psychol Meas 44:182–196. https://doi.org/10.1177/0146621619843820 Cheng Y (2010) Improving cognitive diagnostic computerized adaptive testing by balancing attribute coverage: the modified maximum global discrimination index method. Educ Psychol Meas 70:902–913. https://doi.org/10.1177/0013164410366693 Cheng Y, Diao Q, Behrens JT (2017) A simplified version of the maximum information per time unit method in computerized adaptive testing. Behav Res Methods 49:502–512. https://doi.org/ 10.3758/s13428-016-0712-6 Choe EM, Kern JL, Chang H-H (2018) Optimizing the use of response times for item selection in computerized adaptive testing. J Educ Behav Stat 43:135–158. https://doi.org/10.3102/107699 8617723642 Dahdouh K, Dakkak A, Oughdir L, Ibriz A (2019) Association rules mining method of big data for e-learning recommendation engine. In: Advances in intelligent systems and computing. Springer, pp 477–491 Djenouri Y, Gheraibia Y, Mehdi M et al (2014) An efficient measure for evaluating association rules. In: 6th international conference on soft computing and pattern recognition, SoCPaR 2014. Institute of Electrical and Electronics Engineers Inc., pp 406–410 Du Y, Li A, Chang HH (2019) Utilizing response time in on-the-fly multistage adaptive testing. Springer Proceedings in Mathematics and Statistics. Springer, New York LLC, pp 107–117
52
J. Pacheco-Ortiz et al.
Finkelman MD, Kim W, Weissman A et al (2014) Cognitive diagnostic models and computerized adaptive testing: two new item-selection methods that incorporate response times. J Comput Adapt Test 2:59–76. https://doi.org/10.7333/1412-0204059 Flach PA, Lachiche N (2001) Confirmation-guided discovery of first-order rules with Tertius. Mach Learn 42:61–95. https://doi.org/10.1023/A:1007656703224 Frey A, Seitz N-N, Kröhne U (2013) Reporting differentiated literacy results in PISA by using multidimensional adaptive testing. Research on PISA. Springer, Netherlands, pp 103–120 García-Peñalvo FJ, Seoane Pardo AM (2015) Una revisión actualizada del concepto de eLearning. Décimo Aniversario. Educ Knowl Soc 16:119. https://doi.org/10.14201/eks2015161119144 Gu J, Zhou X, Yan X (2018) Design and implementation of students’ score correlation analysis system. In: ACM international conference proceeding series. Association for Computing Machinery, New York, USA, pp 90–94 Han J, Kamber M, Pei J (2012a) Data mining: concepts and techniques. Morgan Kau, USA Han J, Kamber M, Pei J (2012b) Data mining: concepts and techniques. Elsevier Inc Han KT (2012) An efficiency balanced information criterion for item selection in computerized adaptive testing. J Educ Meas 49:225–246. https://doi.org/10.1111/j.1745-3984.2012.00173.x Huang H-Y, Chen P-H, Wang W-C (2012) Computerized adaptive testing using a class of high-order item response theory models. Appl Psychol Meas 36:689–706. https://doi.org/10.1177/014662 1612459552 Jatobá VM, Farias JS, Freire V et al (2020) ALICAT: a customized approach to item selection process in computerized adaptive testing. J Braz Comput Soc 26:4. https://doi.org/10.1186/s13 173-020-00098-z Ju C, Bao F, Xu C, Fu X (2015) A novel method of interestingness measures for association rules mining based on profit. Discret Dyn Nat Soc 2015. https://doi.org/10.1155/2015/868634 Kaplan M, de la Torre J, Barrada JR (2015) New item selection methods for cognitive diagnosis computerized adaptive testing. Appl Psychol Meas 39:167–188. https://doi.org/10.1177/014662 1614554650 Koedsri A, Lawthong N, Ngudgratoke S (2014) Efficiency of item selection method in variablelength computerized adaptive testing for the testlet response model: constraint-weighted astratification method. Procedia Soc Behav Sci 116:1890–1895. https://doi.org/10.1016/j.sbspro. 2014.01.490 Kröhne U, Goldhammer F, Partchev I (2014) Constrained multidimensional adaptive testing without intermixing items from different dimensions. Undefined Lee CS, Wang MH, Wang CS et al (2018) PSO-based fuzzy markup language for student learning performance evaluation and educational application. IEEE Trans Fuzzy Syst 26:2618–2633. https://doi.org/10.1109/TFUZZ.2018.2810814 Lin CJ, Chang HH (2019) Item selection criteria with practical constraints in cognitive diagnostic computerized adaptive testing. Educ Psychol Meas 79:335–357. https://doi.org/10.1177/001316 4418790634 Mao X, Xin T (2013) The application of the monte carlo approach to cognitive diagnostic computerized adaptive testing with content constraints. Appl Psychol Meas 37:482–496. https://doi.org/ 10.1177/0146621613486015 Miyazawa Y, Ueno M (2020) Computerized adaptive testing method using integer programming to minimize item exposure. In: Advances in intelligent systems and computing. Springer, pp 105–113 Moyer EL, Galindo JL, Dodd BG (2012) Balancing flexible constraints and measurement precision in computerized adaptive testing. Educ Psychol Meas 72:629–648. https://doi.org/10.1177/001 3164411431838 Olea J, Barrada JR, Abad FJ et al (2012) Computerized adaptive testing: the capitalization on chance problem. Span J Psychol 15:424–441. https://doi.org/10.5209/rev_sjop.2012.v15.n1.37348 Pan CC, Lin CC (2018) Designing and implementing a computerized adaptive testing system with an MVC framework: a case study of the IEEE floating-point standard. In: Proceedings of 4th IEEE
2 Towards Association Rule-Based Item Selection Strategy …
53
international conference on applied system innovation 2018, ICASI 2018. Institute of Electrical and Electronics Engineers Inc., pp 609–612 Prajapati DJ, Garg S, Chauhan NC (2017) Interesting association rule mining with consistent and inconsistent rule detection from big sales data in distributed environment. Futur Comput Inform J 2:19–30. https://doi.org/10.1016/j.fcij.2017.04.003 Rodríguez-Cuadrado J, Delgado-Gómez D, Laria JC, Rodríguez-Cuadrado S (2020) Merged TreeCAT: a fast method for building precise computerized adaptive tests based on decision trees. Expert Syst Appl 143:113066. https://doi.org/10.1016/j.eswa.2019.113066 Rubio Delgado E, Rodríguez-Mazahua L, Palet Guzmán JA et al (2018) Analysis of medical opinions about the nonrealization of autopsies in a Mexican hospital using association rules and Bayesian networks. Sci Program 2018:1–21. https://doi.org/10.1155/2018/4304017 Scheffer T (2001) Finding association rules that trade support optimally against confidence. In: Lecture notes in computer science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, pp 424–435 Sheng C, Bingwei B, Jiecheng Z (2018) An adaptive online learning testing system. In: ACM international conference proceeding series. Association for Computing Machinery, New York, USA, pp 18–24 Stafford RE, Runyon CR, Casabianca JM, Dodd BG (2019) Comparing computer adaptive testing stopping rules under the generalized partial-credit model. Behav Res Methods 51:1305–1320. https://doi.org/10.3758/s13428-018-1068-x Su YH, Huang YL (2015) Using a modified multidimensional priority index for item selection underwithin-item multidimensional computerized: adaptive testing. Springer proceedings in mathematics and statistics. Springer, New York LLC, pp 227–242 Tokusada Y, Hirose H (2016) Evaluation of abilities by grouping for small IRT testing systems. In: Proceedings-2016 5th IIAI international congress on advanced applied informatics, IIAI-AAI 2016. Institute of Electrical and Electronics Engineers Inc., pp 445–449 Tu D, Han Y, Cai Y, Gao X (2018) Item selection methods in multidimensional computerized adaptive testing With Polytomously scored items. Appl Psychol Meas 42:677–694. https://doi. org/10.1177/0146621618762748 Ueno M, Songmuang P (2010) Computerized adaptive testing based on decision tree. In: Proceedings-10th IEEE international conference on advanced learning technologies, ICALT 2010, pp 191–193 van der Linden WJ, Ren H (2020) A fast and simple algorithm for Bayesian adaptive testing. J Educ Behav Stat 45:58–85. https://doi.org/10.3102/1076998619858970 Veldkamp BP (2016) On the issue of item selection in computerized adaptive testing with response times. J Educ Meas 53:212–228. https://doi.org/10.1111/jedm.12110 Wang C (2013) Mutual information item selection method in cognitive diagnostic computerized adaptive testing with short test length. Educ Psychol Meas 73:1017–1035. https://doi.org/10. 1177/0013164413498256 Wang C, Chang HH (2011) Item selection in multidimensional computerized adaptive testinggaining information from different angles. Psychometrika 76:363–384. https://doi.org/10.1007/ s11336-011-9215-7 Wang C, Chang HH, Boughton KA (2011) Kullback-Leibler information and its applications in multi-dimensional adaptive testing. Psychometrika 76:13–39. https://doi.org/10.1007/s11336010-9186-0 Wang W, Ding S, Song L (2015) New item-selection methods for balancing test efficiency against item-bank usage efficiency in CD-CAT. Springer proceedings in mathematics and statistics. Springer, New York LLC, pp 133–151 Wang F, Li K, Dui´c N et al (2018) Association rule mining based quantitative analysis approach of household characteristics impacts on residential electricity consumption patterns. Energy Convers Manag 171:839–854. https://doi.org/10.1016/j.enconman.2018.06.017 Wei H, Lin J (2015) Using out-of-level items in computerized adaptive testing. Int J Test 15:50–70. https://doi.org/10.1080/15305058.2014.979492
54
J. Pacheco-Ortiz et al.
Yan X, Zhang C, Zhang S (2009) Confidence metrics for association rule mining. Appl Artif Intell 23:713–737. https://doi.org/10.1080/08839510903208062 Ye Z, Sun J (2018) Comparing item selection criteria in multidimensional computerized adaptive testing for two item response theory models. In: Proceedings-3rd international conference on computational intelligence and applications, ICCIA 2018. Institute of Electrical and Electronics Engineers Inc., pp 1–5 Yigit HD, Sorrel MA, de la Torre J (2019) Computerized adaptive testing for cognitively based multiple-choice data. Appl Psychol Meas 43:388–401. https://doi.org/10.1177/014662161879 8665
Chapter 3
Uncertainty Linguistic Summarizer to Evaluate the Performance of Investment Funds Carlos Alexander Grajales
and Santiago Medina Hurtado
Abstract This chapter proposes a methodology to implement the uncertain linguistic summarizer posed in Liu’s uncertain logic to measure the performance of investment funds in the Colombian capital market. The algorithm extracts a truth value for a set of linguistic summaries, written as propositions in predicate logic, where the terms for the quantifier, subject, and predicate are unsharp. The linguistic summarizer proves to be autonomous, successful, efficient, and close to human language. Furthermore, the implementation has a general scope and could become a data mining tool under uncertainty. The propositions found characterize with plenty of sense the investment funds data. Finally, a Corollary that allows accelerating the obtention of the summaries is presented.
3.1 Introduction In the study of indeterminacy phenomena, the concept of measure assigned to the ‘likelihood’ of the occurrence of an event has evolved dramatically. Probability measure was axiomatized by Kolmogorov (1933) in 1933; the Capacity measure was founded by Choquet (1954) in 1954, Fuzzy measure was proposed by Sugeno (1974) in 1974, Possibility measure was developed by Zadeh (1978) in 1978, and more recently Uncertain measure has been proposed by Liu (2007) in 2007. Consequently, with the above, uncertainty theory was founded by Liu (2007) in 2007 and perfected by Liu (2010a) in 2010. One of its purposes is the modeling of human uncertainty. The mentioned uncertain measure satisfies normality, duality, subadditivity, and product axioms. Such a measure represents the belief degree we have that an event happens. Research works on uncertain measure have been well C. A. Grajales (B) Universidad de Antioquia, Medellín, Colombia e-mail: [email protected] S. Medina Hurtado Universidad Nacional de Colombia, Medellín, Colombia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_3
55
56
C. A. Grajales and S. Medina Hurtado
developed, among others, in Liu (2015), Peng and Iwamura (2012), Zhang (2011), and Gao (2009). Uncertain variable and uncertain set are concepts in the core of uncertainty theory. The former is intended to represent uncertain quantities, and the second one focuses on modeling unsharp concepts. Fundamental advances in the study of uncertain sets are found in Liu (2010b, 2012, 2018). These works cover the concept of membership function of an uncertain set, introduce the operational law of uncertain sets, propose a methodological schema for calibration of membership functions, and establish conditions for the existence of membership functions. Based on uncertain sets, uncertain logic is proposed by Liu (2011) for modeling human language. There, the truth value of a proposition is defined via uncertain measure. As an application of this type of logic, a linguistic summarizer is proposed to extract linguistic summaries of a universe of raw data through the use of uncertain propositions in predicate logic. Other definitions of truth value for propositions with unsharp terms, and subsequent studies of linguistic summarizers, have been found in Fuzzy logic by authors such as Zadeh (1965, 1975, 1983, 1996), Mamdani and Assilian (1999), Yager (1982), and Kacprzyk and Yager (2001). In these seminal works, the possibility measure is taken into account to establish a set of heuristic rules in a logic basic form if— then. The work of these researchers consolidated the fuzzy theory, which is currently mixed with many artificial intelligence techniques such as Neural Networks or expert systems, opening a wide range of possibilities for use on a practical level. The interest to explore more deeply the behaviour of uncertain linguistic summaries underlies at least the following criteria. First, uncertain logic is consistent with classical logic in the sense that it obeys the law of true conservation (truth value of a proposition p plus truth value of ∼ p is the unit), and in the extreme truth values of 0 and 1, it is consistent with the law of excluded middle and the law of contradiction. Meanwhile, all of these statements are not true for the Fuzzy logic under the possibility measure of Zadeh (Liu 2010a). Second, uncertain linguistic summarizer operates under a new definition of the truth value of a predicate proposition of the form Q of S are P, where the uncertain sets Q, S, and P represent, respectively, an uncertain quantifier, subject, and predicate. Meanwhile, as it was mentioned, the measure and the linguistic forms in Zadeh are different. Third, such exploration associates a practical level with needs posed since the twentieth century, framed in the modeling of human language to extract accurate and efficient descriptions of textual or numerical data. Such a need finds in uncertain logic a natural and apprehensible way for modeling. Today this kind of logic goes beyond applications in economics, as well as covering a wide range of areas of knowledge. This chapter proposes a methodology to implement an uncertain linguistic summarizer, theoretically proposed by Liu, oriented towards measuring the performance of investment funds in the Colombian capital market. The algorithm extracts uncertain propositions in predicate logic over a data set, in which there are three vague concepts, the quantifier, the subject, and the predicate, under a required truth value. The linguistic summarizer proves to be autonomous, successful, efficient, and close
3 Uncertainty Linguistic Summarizer to Evaluate …
57
to human language. Also, the data linguistic summaries obtained show a clear differentiation with results that would be obtained from the probability theory. Finally, the implementation has a general scope so that it can become a tool to make data mining under uncertainty.
3.2 Preliminaries This section introduces the concepts of uncertain sets and membership functions that are in the core of the modelling of unsharp concepts. These kinds of concepts are recurrent in human language when we state, for example, propositions like most young students are tall. Here the terms most, young, and tall are vague concepts rather than deterministic. Next, it proceeds to establish some fundamentals of uncertain logic aimed to model human language. A linguistic summarizer is then presented to look for machine linguistic summaries that describe especial types of data sets. Definition 3.1 (uncertain variable (Liu 2007)) An uncertain variable is a function ξ from an uncertainty space (, L, M) to the set of real numbers such {ξ ∈ B} is an event for any Borel set B. Definition 3.2 (uncertain set (Liu 2010b)) An uncertain set is a function ξ from an uncertainty space (, L, M) to a collection of sets of real numbers such that both {B ∈ ξ } and {ξ ∈ B} are events for any Borel set B. The operations union, intersection, and complement between uncertain sets are defined in the usual way as in the set theory. Further, a given measurable function f over uncertain sets ξ1 , . . . , ξn results in a new uncertain set (Liu 2015). Membership function (m.f.) μ(x) is a concept related to an uncertain set ξ , and when it does exist, it represents the membership degree that x ∈ ξ . The definition is as follows, Definition 3.3 (membership function (Liu 2012)) An uncertain set ξ has a membership function μ if for any Borel set B, the following measure inversion formulas are satisfied: M{B ∈ ξ } = inf μ(x)
(3.1)
M{ξ ⊂ B} = 1 − sup μ(x)
(3.2)
x∈B
x∈B c
The next propositions are about membership functions. A more in-depth discussion is found in (Liu 2015). Theorem 3.1 If an uncertain set ξ has a membership function μ(x), then for each number x ∈ R,
58
C. A. Grajales and S. Medina Hurtado
μ(x) = M{x ∈ ξ }
(3.3)
Proof From the condition given in Eq. (3.1), let the Borelian set B = {x}. Then, M{B ⊂ ξ } = M{{x} ∈ ξ } = inf z∈{x} μ(z) = μ(x). Now, by noting the equivalence of the events {{x} ⊂ ξ } = {γ ∈ | {x} ⊂ ξ(γ )} and {x ∈ ξ }, the relation (3.1) holds. Theorem 3.2 (existence of membership function (Liu 2018)) If ξ is a totally ordered uncertain set defined on a continuous uncertainty space, then its membership function μ(x)always exists and μ(x) = M{x ∈ ξ }.
Illustration: uncertain sets and membership functions.
As an illustration, take the uncertain space (, L, M) to be (0, 1) with Borel σ algebra and M as Lebesgue measure. Consider the uncertain sets ξ1 = [−γ , γ ] and ξ2 = [−(1 + b)γ +b, (1 − c)γ +c], γ ∈ , −1 < b < c < 1. In the case of the uncertain set ξ1 , to find its membership function, μ1 (x), note that if 0 ≤ x < 1, then M{x ∈ ξ1 } = M{γ ∈ [x, 1)} = 1 − x. On the other hand, if −1 < x < 0, then M{x ∈ ξ1 } = M{γ ∈ [−x, 1)} = 1 − (−x). In a similar way, we can obtain the membership function μ2 (x) for the set ξ2 . Consequently, the membership functions μ1 and μ2 associated with ξ1 and ξ2 would have triangular and trapezoidal forms, respectively, and they would be given by:
1 − |x| if x ∈ −1 < x < 1 0 elsewhere ⎧ x+1 ⎪ if − 1 < x < b ⎪ ⎪ b+1 ⎨ 1 if b ≤ x ≤ c μ2 (x) = x−1 ⎪ if c < x < 1 ⎪ c−1 ⎪ ⎩ 0 elsewhere
μ1 (x) =
To complement this illustration, Fig. 3.1 shows the uncertain sets ξ1 and ξ2 with their respective membership functions, μ1 and μ2 . Illustration: representation of unsharp concepts.
Suppose ξ1 , ξ2 , and ξ3 represent the uncertain sets (also referred to here as unsharp concepts or linguistic terms): most, young, and tall, respectively. Then, all the three sets can be considered as totally ordered uncertain sets on a continuous uncertain
3 Uncertainty Linguistic Summarizer to Evaluate …
59
ξ1
μ1 (x)
1
1
0
−1 0
1
−1
0
Γ
1
x (a)
μ2 (x)
1
1
ξ2
c
b -1 0
1
-1
b
0
c
1
x
Γ (b)
Fig. 3.1 Uncertain sets and their membership functions: (a) ξ1 and μ1 ; (b) ξ2 and μ2
space, and consequently, such concepts have membership functions. These could be established as: Set ξ1 : most (%) trapezoidal m.f. λ(x) = (70%, 75%, 90%, 95%) Set ξ2 : young (years) right ramp m.f. ν(y) = (15, 25, 45) Set ξ3 : tall (meters) left ramp m.f. μ(z) = (1.75, 1.85, 2.00) Figure 3.2 shows the representation of these uncertain concepts. In general, the trapezoidal m.f λ(x) = (a, b, c, d), left ramp m.f. μ(x) = (a, b, c), and right ramp m.f. ν(x) = (a, b, c) are defined as ⎧ x−a ⎪ ⎪ b−a ⎪ ⎨ 1 λ(x) = d−x ⎪ d−c ⎪ ⎪ ⎩ 0 ⎧ x−a ⎨ b−a μ(x) = 1 ⎩ 0
if a < x < b if b ≤ x ≤ c if c < x < d elsewhere if a < x < b if b ≤ x ≤ c elsewhere
60
C. A. Grajales and S. Medina Hurtado λ(x)
ν(y)
1
0
μ(z)
1
0.7 0.75
0.9 0.95
0
x
1
15
25
45
0
1.75
y
1.85
2
z
Fig. 3.2 Membership functions for the unsharp concepts most, young, and tall are represented, respectively, through λ(x), ν(y), and μ(z)
⎧ ⎨ 1 if a ≤ x ≤ b ν(x) = c−x if b < x < c ⎩ c−b 0 elsewhere
3.2.1 Uncertain Logic Based on uncertain sets, uncertain Logic was proposed by Liu (2011) in 2011, aimed to model human language. Let us assume we have the universe of discourse A = {(a1 , b1 ), (a2 , b2 ), . . . , (an , bn )}
(3.4)
that contains some raw data for n individuals. This means that for the individual i, two feature data have been measured: ai and bi . For example, if the universe refers to a teacher’s community, then the i-th teacher has the data (ai , bi ) associated, which in turn may represent a pair of unsharp concepts, like age and height, of the i-th teacher. Now, consider that S ⊂ A. One of the purposes in uncertain logic is to provide a methodology to calculate the truth value of a proposition of the form Q of S are P, denoted as the vector of linguistic terms (uncertain sets) (Q, S, P). Here, Q is an uncertain quantifier, e.g., most, S is an uncertain subject, e.g., young, and P is an uncertain predicate, e.g., tall. Consequently, if universe A refers to some teacher’s community, then (Q, S, P) could represent the uncertain proposition: most young teachers are tall. Another purpose is to apply such methodology to extract a linguistic summary ¯ S, ¯ P) ¯ from a set of row data. Such a linguistic summary constitutes a tool to ( Q, perform data mining in a natural language comprehensible to humans.
3 Uncertainty Linguistic Summarizer to Evaluate …
61
Definition 3.4 (truth value (Liu 2011)) Let (Q, S, P) be an uncertain proposition with membership functions given by (λ, ν, μ), respectively, and let λ be an unimodal function. The truth value of (Q, S, P) with respect to A is
T (Q, S, P) = sup 0≤ω≤1
ω ∧ sup inf μ(a) ∧ sup inf ∼ μ(a) K ∈Kω a∈K
K ∈K∗ω a∈K
(3.5)
where Kω = {K ⊂ Sω |λ(|K |) ≥ ω},
(3.6)
K∗ω = {K ⊂ Sω |λ(|Sω | − |K |) ≥ ω},
(3.7)
Sω = {a ∈ A|ν(a) ≥ ω}.
(3.8)
Remarks: • |K | represents the cardinality of the set K and |Sω | represents the cardinality of the set Sω , • ∼ μ(a) = 1 − μ(a), • if K = ∅, inf μ(a) = inf ∼ μ(a) = 1
a∈∅
a∈∅
• if Q is a percentage quantifier, then λ(|K |) in Eq. (3.6) is changed by λ(|K |/|Sω |), and λ(|Sω | − |K |) in Eq. (3.7) is changed by λ(1 − |K |/|Sω |), • if S = A, then T (Q, A, P) is obtained by letting Sω = A and |Sω | = |A|. Theorem 3.3 (truth value theorem (Liu 2011)) Let (Q, S, P) be an uncertain proposition with membership functions given by (λ, ν, μ), respectively, and let λbe an unimodal function. The truth value of (Q, S, P) is
T (Q, S, P) = sup ω ∧ (kω ) ∧ ∗ (kω∗ )
(3.9)
kω = min{x | λ(x) ≥ ω}
(3.10)
(kω ) = kω − max{μ(ai ) | ai ∈ Sω }
(3.11)
0≤ω≤1
where
62
C. A. Grajales and S. Medina Hurtado
kω∗ = |Sω | − max{x | λ(x) ≥ ω}
(3.12)
∗ kω∗ = kω∗ − max{1 − μ(ai ) | ai ∈ S_ω}
(3.13)
Remarks: • (0) = ∗ (0) = 1, • If Q is a percentage quantifier, then λ(x) in Eqs. (3.10) and (3.12) is changed by λ(x/|Sω |), • If S = A, then Sω in Eqs. (3.11) and (3.13) is changed by A, and |Sω | in Eq. (3.12) is substituted by |A| = n. • k − max{·} indicates the k-th largest value of the elements in the respective set.
3.2.2 Linguistic Summarizer Let A be the universe of discourse described in Eq. (3.4) consisting of pairs of raw data for n individuals. Also consider a collection of human linguistic terms for the uncertain quantifier Q, subject S, and predicate P: Q = {Q 1 , Q 2 , . . . , Q m } S = {S1 , S2 , . . . , Sn } P = {P1 , P2 , . . . , Pk }.
(3.14)
The uncertain data mining problem consists in getting a linguistic summary ¯ S, ¯ P), ¯ with a truth value of at least α, which solves the linguistic summarizer ( Q, (Liu 2011): Find s.t.
Q, S, P Q∈Q S∈S P∈P T (Q, S, P) ≥ α,
(3.15)
where the truth value T is evaluated for the universe of discourse A and making use of the relations (3.9) to (3.13).
3.3 Problem and Methodology This section designs a methodology to implement computationally the linguistic summarizer given in (3.15) to be a tool in data mining under uncertainty. The imple-
3 Uncertainty Linguistic Summarizer to Evaluate …
63
mentation is posed through a financial application about the performance appraisal of investment funds in Colombia. The appraisal is based on the truthfulness of a set of linguistic summaries of the form Q of S are P (Q, S, P), which in turn characterizes a dataset A defined by pairs of risk—profitability of investment funds. According to Sect. 3.2, (Q, S, P) is a logic predicate proposition in which Q is an uncertain quantifier, S is an uncertain subject, and P is an uncertain predicate. The machine summaries, written in human language, would give support to market agents in making investment decisions and even drawing up new strategies and projections to improve profits and attract new investors.
3.3.1 Data Set Row data A represents the financial performance of a set of investment funds managed by trust companies in the Colombian capital market, which invest basically in fixed income securities. Data are taken from the Financial Superintendence of Colombia (SFC). Similar to Eq. (3.4), data A is a monthly financial data series with n = 1095 samples given by A = {(ai , bi )}, i = 1, 2, . . . , 1095,
(3.16)
where ai is the Equity Loss Ratio of 23 Colombian Fund Investments during the period from January 2002 to August 2006, and bi is the respective Annual Profitability Index of the enterprises. Equity Loss Ratio (Equity / SocialCapital) is an indicator that measures the dissolution risk of the company. A ratio less than one indicates that losses are consuming the entity’s capital stock, increasing the risk of dissolution, whereas a higher indicator means that the enterprise has capitalized profits for its shareholders. The annual profitability index is the annual percentage change of the fund value. For the sake of clarity, we refer to both variables, respectively, as risk and profitability. Figure 3.3 shows from left to right the histograms of both variables: risk and profitability. As can be seen, most investment funds have an Equity Loss Ratio between 1.4 and 2.0, while profitability is between 0 and 8% per year; when compared to the risk-free interest rate (between 7 and 11% in the period), we can say that most of the funds performed poorly (weak performance). On the range of both variables, the next section defines linguistic values (uncertain sets) that collect investor’s perception of performance.
64
C. A. Grajales and S. Medina Hurtado
0.8
6
0.6 4 0.4 2
0.2
0
0 [0, 2)
[2, 4)
[4, 6)
EquityLossRatio(risk)
[−0.2, ·) [0, ·) [0.2, ·) [0.4, ·) [0.6, ·) AnnualProfitability(rentability)
Fig. 3.3 Histograms, from left to right, of the Equity Loss Ratio and Annual profitability of the investment funds
3.3.2 Linguistic Summaries According to relation (3.14) and based on the expert opinion, a group of uncertain sets is defined for the risk and profitability variables, as well as for linguistic quantifiers. An uncertain quantifier Q, subject S, and predicate P will determine an uncertain proposition (Q, S, P) that allows describing the behavior of the performance of the investment funds. Specifically, the quantifiers are represented as percentages in the space (0, 1). Value 0 is associated with the concept ‘none’ and value 1 with the concept ‘all’. We define the linguistic terms as Q = {most, hal f, a f ew} S = {sever e, middle, low, unr eal} P = {catastr ophic, bad, middle, high, ver y high}
(3.17)
Consequently, the membership functions associated to the uncertain sets (3.17) are restricted to be either trapezoidal, or left ramp, or right ramp. The functions are shown in Tables 3.1, 3.2, and 3.3. Next, we pose the linguistic summarizer problem similarly to relation (3.15), Table 3.1 Linguistics and m.f. of the uncertain quantifier
Quantifier (Q)
Type of m.f
Membership function (λ, %)
most
Trapezoidal
(0.70, 0.75, 0.90, 0.95)
half
Trapezoidal
(0.30, 0.45, 0.55, 0.70)
a few
Trapezoidal
(0.05, 0.10, 0.25, 0.30)
3 Uncertainty Linguistic Summarizer to Evaluate … Table 3.2 Linguistics and m.f. of the uncertain subject
Subject (S)a
Type of m.f
severe
Right ramp
(−0.50, 1.00, 1.50)
middle
Trapezoidal
(1.00, 1.50, 1.80, 2.00)
low
Trapezoidal
(1.80, 2.00, 3.00, 4.00)
unreal
left ramp
(3.00, 4.00, 8.00)
a Risk
Table 3.3 Linguistics and m.f. of the uncertain predicate
65 Membership function (ν, %)
(Equity Loss Ratio)
Predicate (P)b
Type of m.f
Membership function (μ, %)
catastrophic
Right ramp
(−1.00, −0.80, 0.00)
bad
Right ramp
(0.00, 0.05, 0.08)
middle
Trapezoidal
(0.07, 0.10, 0.12, 0.15)
high
Trapezoidal
(0.12, 0.15, 0.20, 0.30)
Left ramp
(0.25, 0.30, 3.00)
very high b Profitability
Find s.t.
(Annual Profitability Index)
Q, S, P Q∈Q S∈S P∈P T (Q, S, P) > 0,
(3.18)
where the calculation of T is made with respect to financial data A, and then we rank ¯ S, ¯ P), ¯ the solutions found in a descending direction. The higher the value of T ( Q, ¯ S, ¯ P). ¯ the better the characterization of the data through the linguistic summary ( Q, The problem is solved through the following steps: Methodology for obtaining the linguistic summaries
1.
Derive the number of possible combinations of uncertain sets Q, S, and P defined in Tables 3.1, 3.2, and 3.3, and consider the respective uncertain propositions (Q, S, P). In our case, the number of statements will be 3 × 4 × 5 = 60. Consequently, each proposition takes the comprehensible human language form, (Q, S, P) : Q investment f unds with S risk have P r entabilit y (3.19)
2.
Associate each proposition (Q, S, P) with the respective membership functions arrangement (λ(x), ν(y), μ(z)). In the case of the study under consideration,
66
C. A. Grajales and S. Medina Hurtado
3.
the domains of the m.f. are, respectively, x ∈ (0%, 100%), y ∈ [−50%, 800%] (risk, i.e. equity loss ratio), and z ∈ [−100%, 300%] (profitability, i.e. annual profitability index). In this work we do not take the quantifiers all or none, and this is the reason why neither x = 0% nor x = 100% is considered. Solve the problem (3.18) by calculating the truth value T (Q, S, P) for each proposition through Theorem 3.3, making use of relations (3.9)–(3.13). Finally, ¯ S, ¯ P) ¯ are ranked in descendent order. To accelerate the solutions found, ( Q, the computation, use relations (3.20) and (3.21) that we propose in the next Corollary.
To accelerate some calculations of T (Q, S, P) in step #3 of the above methodology, we propose the following Corollary of Liu’s Theorem 3.3. Corollary 3.1 Let (Q, S, P)be an uncertain proposition with membership functions given by (λ, ν, μ), respectively, and let λbe an unimodal function. Assume that λ(x)is a trapezoidal membership function (a, b, c, d). Then, for a non-empty set Sω , the quantities kω and kω∗ in the Liu formula
T (Q, S, P) = sup ω ∧ (kω ) ∧ ∗ (kω∗ ) 0≤ω≤1
of Theorem 3.3 take the values: kω = a + ω(b − a)
(3.20)
kω∗ = |sω | − d − ω(d − c)
(3.21)
where x and x are the ceil and floor functions of x. Proof Relations (3.20) and (3.21) are implicated straightforwardly from Eqs. (3.10) and (3.12), respectively. Figure 3.4 displays an overview of the main points discussed up to here in order to characterize, under uncertainty, dataset A. The characterization provides insight to evaluate the performance of the funds.
3.4 Results The methodology outlined in Sect. 3.3 was implemented in Matlab (R2020a) through one main function f that evaluates the truth value T for given inputs Q, S, P, and A. The function was executed in a processor of 1.6 GHz dual-core Intel Core i5 (Turbo Boost 3.6 GHz), lasting about 0.1125 s. The complete experiment finalized approximately 6.10 s after. The computational algorithm implemented in f has a general scope, and then it is functional for any dataset A with similar characteristics
3 Uncertainty Linguistic Summarizer to Evaluate …
Quantifyer Q: λ(x)
Types: Trapezoidal, Left Ramp, Right Ramp
67
P : Predicate Class T (Q, S, P ): Truth Value Under Uncertainty
S: Subject Class
Subject S: ν(y)
Membership Functions
Q: Quantifier Class
Linguistic Summarizer on A
Corollary Accelerate Execution
Predicate P : μ(z) Machine Human Language Best Linguistic Summaries
Uncertain Data Mining Theoretical Support & Others – Liu’s Uncertain Logic – Successful and General Scope Implementation
Q = Quantif ier: most, half, a few
S = Risk: severe, middle, low, unreal
Investment Funds in Colombia
Linguistic Summary (Q, S, P )
P = Return: catastrophic, bad, middle, high, very high
Financial Data
Uses: Support Investment Decisions, Projections
Funds Performance Appraisal
# Funds: 23, # observations: 1095, data A = {(ai , bi )}
ai : Equity Loss Ratio bi : Annual Profitability
Fig. 3.4 The diagram is read clockwise starting at Financial Data. Liu’s uncertain data mining poses a linguistic summarizer on dataset A to evaluate the performance of the investment funds. Implementation of the summarizer is general in scope to ease the fit toward other problems
to the ones discussed in this work. Consequently, f is considered as a tool for performing data mining under uncertainty. 20 linguistic summaries were found that jointly describe the performance of investment funds in Colombia in the period of ¯ S, ¯ P), ¯ study. Such summaries are denoted, according to the used notation, as ( Q,
¯ S, ¯ P¯ > 0. Table 3.4 shows the first 10 results and their truth values satisfy T Q, whose truth values were greater than 0.5, and Table 3.5 reports those with truth values assessed between 0 and 0.5.
68
C. A. Grajales and S. Medina Hurtado
Table 3.4 Results: linguistic summaries—part I (Colombian investment funds) Linguistic summary (Q, S, P)
T (Q, S, P)
HALF investment funds with UNREAL risk have BAD rentability
0.8415
A FEW investment funds with MIDDLE risk have MIDDLE rentability
0.7225
A FEW investment funds with LOW risk have MIDDLE rentability
0.6901
HALF investment funds with MIDDLE risk have BAD rentability
0.6702
HALF investment funds with LOW risk have BAD rentability
0.6424
HALF investment funds with SEVERE risk have BAD rentability
0.6199
A FEW investment funds with SEVERE risk have HIGH rentability
0.5871
A FEW investment funds with SEVERE risk have MIDDLE rentability
0.5744
A FEW investment funds with UNREAL risk have MIDDLE rentability
0.5509
A FEW investment funds with MIDDLE risk have HIGH rentability
0.5100
Table 3.5 Results: linguistic summaries—part II (Colombian investment funds) Linguistic summary (Q, S, P)
T (Q, S, P)
A FEW investment funds with UNREAL risk have VERY HIGH rentability
0.4126
A FEW investment funds with LOW risk have HIGH rentability
0.2651
A FEW investment funds with UNREAL risk have HIGH rentability
0.2545
A FEW investment funds with MIDDLE risk have VERY HIGH rentability
0.1529
A FEW investment funds with LOW risk have VERY HIGH rentability
0.1045
A FEW investment funds with SEVERE risk have CATASTROPHIC rentability
0.0926
A FEW investment funds with MIDDLE risk have CATASTROPHIC rentability
0.0815
A FEW investment funds with LOW risk have CATASTROPHIC rentability
0.0652
A FEW investment funds with UNREAL risk have CATASTROPHIC rentability
0.0651
A FEW investment funds with LOW risk have BAD rentability
0.0228
¯ S, ¯ P¯ > 0 were all verified The linguistic summaries found for which T Q, carefully with respect to data A to determine the sense of the uncertain propositions found by the linguistic summarizer (3.18). Verification was done by using two data filters in a Worksheet. The procedure for the first summary is as follows.
3.4.1 Checking the First Uncertain Proposition Let us take the first linguistic summary of data A in Table 3.4 together with its truth value, T (HALF investment funds with UNREAL risk have BAD rentability)
3 Uncertainty Linguistic Summarizer to Evaluate …
0.8 0.6 rentability
Fig. 3.5 Equity Loss Ratio vs Annual Profitability on investment funds. A posteriori human verification through filter 1 of the logical sense for the first linguistic summary found by the summarizer
69
0.4 0.2 0 −0.2 0
2
4
6
risk no filter one filter
= 0.8415
(3.22)
The first filter focuses on the values y of risk enterprise so that y takes only values greater or equal than 3.00, in coherence with the left ramp membership function (3.00, 4.00, 8.00), as it is indicated at the fourth row in Table 3.2. The first filter reduces the samples to n 1 = 389. The second filter is applied on values z of profitability so that z takes only values in [0.00, 0.08], in coherence with the right ramp membership function (0.00, 0.05, 0.08), as it is indicated at the second row in Table 3.3. The first and second filters reduce the samples to n 2 = 224. Note that both filters, which constitute two events in the language of probability theory, define a rectangular region on the plane risk—profitability given by R1 = {(y, z)|3 ≤ y ∧ 0 ≤ z ≤ 0.08}. Such region does not account for any unsharp concepts related with unreal risk or bad profitability. It only matters if an observation (ai , bi ) ∈ A falls or not in R1 . All the above is illustrated in Figs. 3.5 and 3.6. There, Fig. 3.5 applies the first filter to data A, and hence, it suggests the effectiveness of the machine summarizer answer (3.22) to humans. The argumentation is as follows. From the uncertainty perspective, region R1 encloses a data cluster in which each data possesses extents of memberships, νunr eal (y) and μbad (z) (Tables 3.2 and 3.3). Now, if 4 ≤ y and 0 ≤ z ≤ 0.05, from the membership functions there is no doubt about which investment funds satisfy the statement given in (3.22). Meanwhile, in the complementary region, i.e., if 3 ≤ y ≤ 4 and 0.05 ≤ z ≤ 0.08, it makes sense to asseverate there is a human belief-degree (uncertain measure between 0 and 1) that the respective companies’ performance is still validly estimated through both unsharp concepts, unreal risk and bad profitability. Consequently, among all the investment funds with unreal risk, it makes sense to assert that approximately half of them have bad profitability. Such a linguistic summary is not precise, and then its uncertain measure reflects 0.8415 of truthfulness.
70
0.8 0.6 rentability
Fig. 3.6 Equity Loss Ratio versus Annual Profitability on investment funds. A posteriori human verification through filters 1 and 2 of the logical sense for the first linguistic summary found by the summarizer
C. A. Grajales and S. Medina Hurtado
0.4 0.2 0 −0.2 0
2
4
6
no filter one filter two filter
On the other hand, Fig. 3.6 applies both filters to A, and this allows us to compare the uncertain measure with the probability measure. While the correspondent truth value is T = 0.8415, the probability would be p = n 2 /n 1 = 0.5758. This proportion is not far away from the quantifier defined as half . The probability and uncertain measures are not competitors. However, a human may be more comfortable and agreeable with the measure T given that it is assigned to linguistic summaries written in the usual form of predicate propositions, which embed unsharp terms for quantifiers, subjects, and predicates. This kind of language expressions are easily understood by humans and are of practical use to find patterns in the initial data set, evaluate the performance of the investment funds, and later support informed decision making. Consequently, the first linguistic summary (3.22) has been extracted from data A with plenty of sense by the linguistic summarizer (3.18), with a truth value of 0.8415.
3.4.2 Truth Value Versus Probability Measure The full set of linguistic summaries shown in Tables 3.4 and 3.5, found by the summarizer (3.18) and ranked according to their truth values T , were checked with data A, obtaining that the linguistic summarizer characterizes the data effectively and efficiently. A related checking routine is detailed in what follows from a comparison made between the measure T and the counterpart probability p. The logical sense of the n-th linguistic response (Q, S, P) can be examined by comparing its truth value, Tn with the counterpart probability measure, pn , n = 1, . . . , 20, by following for this task the same procedure outlined in the previous section. More precisely, suppose the membership function ν(y) of S takes the form (νl , . . . , νr ) and the membership function μ(z) of P is (μl , . . . , μr ). Remember that
3 Uncertainty Linguistic Summarizer to Evaluate …
71
Table 3.6 Linguistic summaries - Truth value T vs. probability measure p Q
S
P
T
p
half
unreal
bad
0.8415
0.5758
a few
middle
middle
0.7225
0.1883
a few
low
middle
0.6901
0.2444
half
middle
bad
0.6702
0.4848
half
low
bad
0.6424
0.5093
half
severe
bad
0.6199
0.5022
a few
severe
high
0.5871
0.0649
a few
severe
middle
0.5744
0.1732
a few
unreal
middle
0.5509
0.2391
a few
middle
high
0.5100
0.1190
a few
unreal
very high
0.4126
0.0617
a few
low
high
0.2651
0.0944
a few
unreal
high
0.2545
0.0771
a few
middle
very high
0.1529
0.0606
a few
low
very high
0.1045
0.0593
a few
severe
catastrophic
0.0926
0.2944
a few
middle
catastrophic
0.0815
0.2468
a few
low
catastrophic
0.0652
0.2370
a few
unreal
catastrophic
0.0651
0.1902
a few
low
bad
0.0228
0.5093
both ν and μ have either tapezoidal, or left ramp, or right ramp membership functions. Consequently, if the pair (y, z) represents any risk-profitability of an investment fund in data set A, then pn is the probability of the event νl ≤ y ≤ νr and μl ≤ z ≤ μr . In this way, Table 3.6 shows the unsharp terms Q, S, and P, linked to the summaries, as well as the respective measures T (Q, S, P) and p. On the other hand, it is noticeable that quantifier Q in the linguistic summaries does not adopt the unsharp term most, suggesting that data A are highly dispersed. Also, Fig. 3.7 provides a view of the results reported in Table 3.6. Quantifiers low and half are displayed in areas that correspond to their trapezoidal membership functions given by (5%, 10%, 25%, 30%) and (30%, 45%, 55%, 70%), respectively. The areas of those regions, where the membership function is less than one, are diffusively represented. We see that the probability p associated to the n-th linguistic summary belongs to the quantifier domain of such summary. Therefore, the measure p falls into the respective region linked to que unsharp quantifier. The unique exception of this observation is the 20-th summary. Note that the probability framework does not use uncertainty terms for the subject S (risk) and predicate P (profitability) but deterministic intervals, instead. Also, the measure T represents the truthfulness of each summary found by the machine summarizer. Equivalently, we can say that the
72
C. A. Grajales and S. Medina Hurtado
Linguistic Summaries Truth Value vs. Probability
80
80
60
60
40
40
20
20
0
2
4
6 8 10 12 14 16 n-th linguistic summary low concept T -Measure
18
20
probabilityp (%)
truth value T (%)
Investment Financial Funds in Colombia
0
half concept p-Measure
Fig. 3.7 Estimation of the truth values T and probability p for the linguistic summaries found. The summaries describe the performance of investment funds in Colombia in the period of study. Calculations of T are made under the Liu’s uncertainty framework, and the summaries, written in Human language, are found by implementing the linguistic summarizer in (3.18)
imprecision committed in each summary, when describing the behaviour of the data, can be measured by 1 − T . From the measure of the truth value T , it can be deduced that the first ten summaries confidently describe data A by using human language. All the above argumentation suggests that the linguistic summarizer (3.18) could describe the behaviour of raw data A autonomously through uncertain propositions Q of S are P. Moreover, such a summarizer could become a valuable tool for data mining under uncertainty, in which the machine and human languages are close to each other.
3 Uncertainty Linguistic Summarizer to Evaluate …
73
3.4.3 Scope and Other Possibilities Truthfulness T (Q, S, P) works on linguistic summaries in the predicate logic form Q of S are P, where Q, S, and P are unsharp terms for quantifier, subject, and predicate, respectively. Such terms are modelled through uncertain sets. More complex uncertain propositions and their truth value are beyond this work but they appear to be interesting approaches. On the other hand, a data set A for n individuals for whom two attributes are measured could be nominated to be characterized by means of the linguistic summarizer. For example, apart from finance, medical services could find in the implementation of the linguistic summarizer an uncertain data mining tool in making diagnoses.
3.5 Conclusions • In the case of study, the probability theory defines fixed limits for the proposed propositions, from which it calculates proportions on the data based on principles of classical logic. On the other hand, uncertain propositions do not have fixed limits, they are vague, and their membership functions overlap, so it is necessary to assign a degree of truth. • The linguistic summaries obtained are uncertain statements that may represent clusters of the data in terms of degrees of truth, different from clusters based on distance functions. In this study, 60 linguistic summaries of the data were introduced, of which 20 were found with a degree of truth higher than zero, and 10 with a degree greater than 0.5. These uncertain propositions characterized the behaviour of the data. • It was found that the behaviour of the performance of the Colombian investment funds, in the time under study, is better represented by the linguistic statement ‘HALF investment funds with UNREAL risk have BAD profitability’, which obtained a truth value equal to 0.8415. The best ten linguistic summaries were ranked according to their truth value to characterize the risk-return dataset of the investment funds. The summaries are written in human language, and in turn, make up a set of reliable evaluative descriptors of their performance. • The proposed methodology and the implemented algorithm exceeded expectations in efficacy and efficiency regarding the characterization of the data through linguistic summaries. • Data mining to extract information in the form of linguistic summaries is currently a field of in-deep exploration since it opens up a full possibility of practical work and applications, which allow supporting decision-making in areas such as investment, medicine, information mining in social networks, transportation systems, services, businesses, among others.
74
C. A. Grajales and S. Medina Hurtado
Acknowledgements Universidad de Antioquia, Colombia and Universidad Nacional de Colombia, Colombia.
References Choquet G (1954) Theory of capacities. Ann l’Institute Fourier 5:131–295 Gao X (2009) Some properties of continuous uncertain measure. Int J Uncertain Fuzziness Knowlege-Based Syst 17:419–426. https://doi.org/10.1142/S0218488509005954 Kacprzyk J, Yager RR (2001) Linguistic summaries of data using fuzzy logic. Int J Gen Syst 30:133–154. https://doi.org/10.1080/03081070108960702 Kolmogorov AN (1933) Grundbegriffe der Wahrscheinlichkeitsrechnung. Julius Springer, Berlin Liu B (2007) Uncertainty Theory, 2nd edn. Springer-Verlag, Berlin Heidelberg, Berlin Liu B (2010a) Uncertainty theory: a branch of mathematics for modeling human uncertainty. Springer Verlag, Berlin Liu B (2010b) Uncertain set theory and uncertain inference rule with application to uncertain control. J Uncertain Syst 4:83–98 Liu B (2011) Uncertain logic for modeling human language. J Uncertain Syst 5:3–20 Liu B (2012) Membership functions and operational law of uncertain sets. Fuzzy Optim Decis Mak 11:387–410. https://doi.org/10.1007/s10700-012-9128-7 Liu B (2015) Uncertainty theory, 4th edn. Springer Uncertainty Research, Beijing Liu B (2018) Totally ordered uncertain sets. Fuzzy Optim Decis Mak 17:. https://doi.org/10.1007/ s10700-016-9264-6 Mamdani E, Assilian S (1999) An experiment in linguistic synthesis with a fuzzy logic controller. Int J Hum Comput Stud 51:135–147. https://doi.org/10.1006/ijhc.1973.0303 Peng Z, Iwamura K (2012) Some properties of product uncertain measure. J Uncertain Syst 6:263– 269 Sugeno M (1974) Theory of fuzzy integrals and its applications. Tokio Institute of Technology Yager RR (1982) A new approach to the summarization of data. Inf Sci (Ny) 28:69–86. https://doi. org/10.1016/0020-0255(82)90033-0 Zadeh LA (1965) Fuzzy sets. Inf Control 8:338–353. https://doi.org/https://doi.org/10.1016/S00199958(65)90241-X Zadeh LA (1975) The concept of a linguistic variable and its application to approximate reasoning-I. Inf Sci (Ny) 8:199–249. https://doi.org/10.1016/0020-0255(75)90036-5 Zadeh LA (1978) Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst 1:3–28. https:// doi.org/10.1016/0165-0114(78)90029-5 Zadeh LA (1983) A computational approach to fuzzy quantifiers in natural languages. Comput Math with Appl 9:149–184. https://doi.org/10.1016/0898-1221(83)90013-5 Zadeh LA (1996) Fuzzy logic = computing with words. IEEE Trans Fuzzy Syst 4:103–111. https:// doi.org/10.1109/91.493904 Zhang Z (2011) Some discussions on uncertain measure. Fuzzy Optim Decis Mak 10:31–43. https:// doi.org/10.1007/s10700-010-9091-0
Chapter 4
Map-Bot: Mapping Model of Indoor Work Environments in Mobile Robotics Gustavo Alonso Acosta-Amaya, Andrés Felipe Acosta-Gil, Julián López-Velásquez, and Jovani Alberto Jiménez-Builes
Abstract This work presents a mapping model of indoor work environments in mobile robotics, called Map-Bot. The model integrates hardware and software modules for navigation, data acquisition & transfer and mapping. Additionally, the model incorporates a computer that runs the software responsible for the construction of two-dimensional representations of the environment (Vespucci module), a mobile robot that collects sensory information from the workplace and a wireless communications module for data transfer between the computer and the robot. The results obtained allow the implementation of the reactive behavior “follow walls” located on its right side on paths of 560 cm. The model allowed to reach a safe and stable navigation for indoor work environments, using this distributed approach.
4.1 Introduction One of the fields of mobile robotics that has generated a huge interest in recent years within the international scientific community is planned navigation to accomplish cooperative tasks in structured environments. The success in the execution of the assigned tasks depends, to a great extent, on the availability and reliability of a priori models of these environments. Although a large variety of methodologies for G. A. Acosta-Amaya · J. López-Velásquez Facultad de Ingeniería, Departamento de Instrumentación y Control, Politecnico Colombiano, Medellín, Antioquia, Colombia e-mail: [email protected] J. López-Velásquez e-mail: [email protected] A. F. Acosta-Gil · J. A. Jiménez-Builes (B) Facultad de Minas, Departamento de Ciencias de la Computación y de la Decisión, Universidad Nacional de Colombia, Medellín, Antioquia, Colombia e-mail: [email protected] A. F. Acosta-Gil e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_4
75
76
G. A. Acosta-Amaya et al.
surveying and building models of different types of work environments have been proposed in the scientific literature the task remains as challenging, mainly to build reliably and accurate models in a quick and economical way (Acosta 2010; Ajeil et al. 2020). The main issues regarding mobile robotics that the literature reports are (Labidi and Lajouad 2004): perception, location, navigation, intelligence, autonomy and cooperation. These problems have traditionally been approached from a perspective that favors the concentration of functions in a single robot or robotic agent which realizes the exploration and collection of data from the environment with the purpose of preparing the model (Betskov et al. 2019). For the above mentioned, the fusion of odometric, infrared (IR) and ultrasound (sonar) measurements are used. The principal disadvantage of this approach is that the computational capacity is concentrated in a single agent, requiring resources, perception and locomotion (Acosta 2010). In recent years, a new approach has gained acceptance that consists in the use of Robotic Multi-Agent Systems (RMAS), which work in a collaboratively and distributed way to solve the problem of systematic exploration and data collection from the environment (Souza 2002). This task is accomplished more efficiently, when different types of agents cooperate with each other to carry out a specific goal (Minelli et al. 2020). RMAS are not necessarily configured in several physical agents, it can be just one equipped with a series of complementary hardware and software elements (Acosta 2010). Thus, more efficient solutions to the problem of creating models of the environment have been presented, which have been used in applications such as the transportation of supplies and raw materials in industrial complexes, security and surveillance of facilities, search and rescue of victims, among others (Darintsev et al. 2019).The two main advantages of a RMAS based approach lies in the performance improvement (less time is required to complete a task), action and distributed perception (improves fault tolerance and provides redundancy). Disadvantages include the interference between sensorial systems when implementing multiple robots, uncertainty regarding other agents and the complexity of the communication system (Acosta 2010). Although the RMAS approach can facilitate the problem of distributing the tasks for achieving a complex goal, controlling this system could still be a challenge, due to the nonlinearities of the system and the high number of perturbations found in the environment. Moreover, building cost-effective robotic platforms has the problem of higher measurement errors and a higher sensitivity to measurement noise. For these reasons, Artificial Intelligence (AI)-based control constitutes an important alternative in the robotics area, with techniques such as Neural Networks and Fuzzy Logic, which exploit high amounts of available data and the knowledge of experts to build a system able to represent the nonlinearities and complexities of the system for control purposes. These techniques have been used successfully to predict market revenues (Jian et al. 2020), control of induction machines in electrical power systems applications (Bouhoune et al. 2017), and in the field of mobile robotics for implementing wall following behaviors (Budianto et al. 2017). This chapter presents a model that allows the construction of systematic exploration tasks of structured scenarios using RMAs paradigm and based on reactive
4 Map-Bot: Mapping Model of Indoor Work Environments …
77
navigation algorithms, which allow the development of digital models of these scenarios. The model incorporates a robotic agent, a software agent, a software module for the computational representation of environments and a communications station. The robotic agent, called Walling, has its own perception and exteroceptive systems, control, effectors and communication. This agent was assigned with the job of capturing data from the environment as it explores it. Control of the movements of the robot is the task of the software agent Magalhães, which holds the safe navigation of Walling in its environment based on two emerging behaviors. A high priority one for obstacle avoidance based on a brute force control (hard control), and a low priority behavior for contour tracking, which was implemented based on a fuzzy Mamdani-type controller. The model allowed the construction of test environments such as sections of corridors with columns, walls and access doors; sections of office areas delimited by walls and access doors were also considered. The acquired maps can be used in navigation, location and trajectory planning tasks. The chapter is distributed as follows: the section that describes the materials and methods used in this work is presented afterwards, then, in section three, the results and the discussion of the proposed model are presented, emphasizing the mobile robotic agent known as Walling, the communication interface, and the navigation and mapping modules. Lastly, conclusions and bibliography used are presented.
4.2 Materials and Methods Mapping of unknown and dynamic spaces constitutes today one of the essential problems of mobile robotics (Li, Hang & Savkin 2018). It is a complex issue, not easy to solve, in part, because, it is highly correlated with other problems such as navigation, location and perception (Thrun et al. 2000; Habib 2007). Indeed, complex tasks such as route tracing and deliberative navigation require accurate and reliable models of the environment, which can be easily updated when the environment is volatile (objects and people in motion) (Acosta 2010). Mapping is the problem of incorporating the data acquired by one or more agents into a computational model of the environment (Bozhinoski et al. 2019). There are two crucial aspects in mapping (Acosta 2010): the available techniques for the representation of physical environments and the models for the correct estimation of the data provided by the sensory systems. In most cases a priori maps of the environment are not available, therefore, new skills and abilities of mobile robotics are required. For instance, the ability to accomplish a secure exploration, data acquisition and information processing, and above all, the capacity to create maps in situ. As the robotic agent develops a representation of the environment, it must as well be able to simultaneously determine its location. In the scientific literature, SLAM (Simultaneous Localization and Mapping) is known as the issue of creating a map and at the same time localize the robot in it (Acosta 2019; Tiwari and Chong 2020). A good model of the environment
78
G. A. Acosta-Amaya et al.
grants the robot to establish safe and optimal routes that enable it to go to the places where it must perform its objectives (Al-Taharwa et al. 2008; McGuire et al. 2019). Building and updating maps, turns out to be a complex problem due to factors such as (Andrade-Cetto and Sanfeliu 2001; Aboshosha and Zell 2003; Shiguemi 2004): sensory uncertainty, uncertainty while moving, unfavorable conditions within the area, and range limitations. In the last two decades, there has been a considerable preference for probabilistic techniques, which is due to the fact that they allow modeling the uncertainty associated with commonly used perception systems (Habib 2007). Devices such as sonar, lasers, infrared modules, compassed and GPS, let the robots to perceive and acquire data from the environment. However, it is also necessary to present navigation strategies for the exploration of the surroundings, so it makes possible to compensate the noise in the measurements and limitations of the observation ranges (Shiguemi 2004). The dominant paradigms in the representation of work environments are topological maps and metric maps (Dufourd 2005; Acosta 2010, 2019; Islam et al. 2020). Sonar (Sound Navigation and Ranging), is probably the perception device that shows the best cost/benefit ratio. The device is used for obstacle detection and distance calculation. It offers a number of advantages over other laser, vision and radar based telemetric devices, such as (Acosta 2010): low cost, bandwidth, sensitivity to smoke and light. Among the disadvantages are (Acosta 2010): low angular resolution, and the speed of sound propagation depends on temperature and humidity; also, the sound signals exhibits a specular behavior. Ultrasonic perception systems are used to detect and measure the distance to objects that are in the working area. The principle of operation is somewhat simple and known as echo-detection. An ultrasound wave transmitting element emits an acoustic pulse to the surrounding medium, in the presence of an obstacle, an echo is received at the sonar receiving element. The distance to the obstacle that returns the echo is calculated based on the measurement of the Time of Flight (TOF) elapsed from the emission of the ultrasonic signal to the reception of the first echo. Equation (4.1) calculates the distance to the object by TOF. Figure 4.1 explains the echo-location operating principle associated with sonar systems.
Fig. 4.1 Measurements of distances by Time of Flight (TOF) of the ultrasonic sensors. Source Authors
4 Map-Bot: Mapping Model of Indoor Work Environments …
D = (Vs × T O F) 2
79
(4.1)
The following equation: D: Distance of the sonar to the surface of the nearest detected object. Vs: Speed of sound in the propagation médium, 340 m/s at sea level. T: Time of Flight of the ultrasonic pulse from its emission to the reception of the echo.
4.3 Results and Discussion For the mapping of indoor environments, the Map-Bot model was developed, which integrates hardware and software modules for navigation, data acquisition, transfer of information, and mapping. The model incorporates a computer that runs the software responsible for the creation of the two-dimensional representations of the work areas (Vespucci module), a mobile robotic agent that collects the sensory information of the environment, and a wireless communication module for data transfer between the computer and the robot, using the RMAS paradigm (Fig. 4.2). Each of the components is defined below.
4.3.1 Mobile Robotic Agent The exploration of the work environment is achieved using the mobile robot Walling. Its name derives from the navegation algorithm known as Wall Following, a reactive navigation technique widely used in robotic navigation in indoor environments.
Fig. 4.2 Map-Bot architecture for mapping of indoor environments. Source Authors
80
G. A. Acosta-Amaya et al.
Walling is robotic agent equipped with control, perception and performance subsystems, that allow it to safely explore the environment, avoiding obstacles around it. During the exploration, the robot collects and transmits the information of the distances to the surrounding objects, such as walls, chairs, tables, among others, as well as the length of the paths covered (Acosta 2010). Walling has two acrylic bases to attach the effectors, and the control and perception modules (Fig. 4.3). The lower one corresponds to a 12.9 cm diameter circular base, all devices that are part of the effectors system are attached to this one, these are: a double gearbox with two DC electric motors, two drive wheels, a caster ball, and an odometry system. The locomotion corresponds to a differential configuration centered with the third point of contact. This way, there is a mechanical structure that provides maneuverability and stability to the robot. The second acrylic base has the effectors and control circuits, sensors and communication fixed modularly. The effectors control system allows controlling speed and direction of rotation of the motors attached to the gearbox. Two H bridge TA7291SG are used, each one provides an average current of up to 400 mA. The control system generates PWM (Pulse-Width Modulation) signals, which activate the H bridges. Navigation control and data processing functions are accomplished by a DEMOQE-128 development system, its main control device corresponds to a 32-bit Coldfire MCF51QE128 microcontroller, which runs at an internal bus frequency of 26.6 MHz. The exteroceptive perception and communications system incorporates three SRF02 ultrasonic modules for distance measurement. These sensors send the distance data to obstacles that are near the front and right side of the robot. The paths traveled and range measurements are transferred to the Vespucci mapping module that runs on the computer, through an XBee wireless communications module that supports the IEEE 802.15.4 protocol.
Fig. 4.3 Four-tier architecture of the mobile robot Walling. Source Authors
4 Map-Bot: Mapping Model of Indoor Work Environments …
81
The distances traveled by Walling during the exploration of the environment are determined based on an odometric measurement system. The robot’s navigation algorithm allows the distance to walls located on its right side to be maintained, which helps to preserve the initial orientation (angular position) of the robot. This way, almost rectilinear navigations paths are obtained. The odometric system of Walling lets to determine the number of complete laps or fractions of a turn that each of its wheels has done. Therefore, it’s possible to know the length of the movements carried out by the robot. The hardware incorporates a reflective disk (or encoded disk) with absorption and reflection bands of infrared light, which is attached to the drive wheels of the robot. An infrared light emitting LED projects a beam onto the disc and this, reflects or not the beam, on a photodetector element. This photodetector generates a train of electrical pulses as the wheel, and the reflective disc attached to it, perform angular movement. The disc has thirty-two reflective stripes, and an equal number of non-reflective stripes. This implies that one complete revolution of the wheel generates thirty-two electrical pulses in the photodetector. Considering the fact that the perimeter of the drive wheels is 11.62 cm, then, it is easy to determine the linear distance the robot travels by counting the pulses generated by the perception system. This measurement technique is known as odometry and can be implemented based on an electronic device called an encoder. Walling has two quadrature encoders to measure the linear displacement of each wheel. The odometry of the robot is executed based on the device with reference WW-01 manufactured by Neotic Design Inc. Pulse counting and distance calculation are performed by the robot’s central processor by interrupting the Timer/Counter modules TPM1 & TPM3. An interrupt is generated every thirty-two pulses. The corresponding interrupt routines update the variables that store the robot’s movements. Walling has a sonar system arranged orthogonally on the SenCom_Card circuit board. This arrangement allows the robot to obtain distance measurements to obstacles located in its immediate environment (Acosta 2019). From these data, the Magalhães navigation module generates the control signals that grant the safe navigation of the robot in its environment. The reading provided by the S0 and S1 sensors are used to implement a surroundings following navigation control. This kind of navigation, the robot borders the contour of the objects around it and maintains a distance from them. The S2 sensor is used to avoid obstacles located in front of the robot. The exteroceptive sensors of the Walling robot was implemented based on sonar SRF02, a range ultrasonic device that operates at a frequency of 40 kHz. The commands and data transfer between the sonar and the central processor is done through UART interface, frame format 9600, 8, N, 1 and TTL levels (Acosta-Amaya et al. 2020). This transducer allows the simultaneous work of up to sixteen SRF02 modules that can be connected to a serial communication bus, being necessary to configure a different eight-bit address for each sonar. The range of available addresses is from 0 × 00 to 0 × 0F. Table 4.1 gives the technical specifications of the SRF02 modules.
82
G. A. Acosta-Amaya et al.
Table 4.1 Technical specifications of the SFR02 sensor. Source (Acosta 2010)
Voltage
5v
Current
4 mA (typical)
Frequency
40KHz
Maximum range
600 cm
Minimum range
15 cm
Gain
64-Step Automatic Control
Connectivity
Bus I2C, Serial UART
Adjustment (calibration)
Automatic on start
Measurement units
µs, cm, inch
Weight
0.16226 oz
Dimensions
24 mm w, 20 mm d, 17 mm h
Although the navigation control module takes readings from three sensors, two on one side (S0, S1) and one on front (S2), the SenCom_Card supports up to five sensors, two additional located on the left side of the platform.
4.3.2 Communication Interface A wireless communication base station was designed and built in order to allow the bidirectional transfer of data between the computer and the robotic agent Walling, this station was given the nave of SCoI (Subsistema de Comunicaciones Inalámbricas). The SCoI module consists of: – – – – – –
USB to Serial conversion cable. MAX-232 integrated circuit. Power Jack. Voltage regulation circuit. XBee Wireless Communications Module. Power on led.
The main component of the SCoI system is the XBee module, with this one, it is possible to implement a point-to-point wireless communications link that operates at a frequency of 2.4 GHz in the ISM (Industrial, Scientific and Medical Band) band. The module complies with the IEEE 802.15.4 standard and allows data integrity up to distances of 30 cm in indoor and urban environments. Figures 4.4 and 4.5 show the electronic diagram and the distribution of components on the PCB (printed circuit board) of the wireless communications subsystem. The SCoI system supports a bi-directional data flow compatible with UART-type interfaces, a 9600 bit/s transfer rate without parity checking. Each frame is formed by a start bit (active low), 8 data bits (least significant bit first), and a stop bit (active high).
4 Map-Bot: Mapping Model of Indoor Work Environments …
83
Fig. 4.4 Electronic diagram of the SCoI wireless communications subsystem. Source Authors
Fig. 4.5 Distribution of components on the PCB. Source Authors
4.3.3 Navigation Module Magalhães The Magalhães navigation module ensures the navigation of the mobile robotic agent Walling in its environment. This software module implements two reactive behaviors for navigation. The first one corresponds to a control algorithm for contour tracking, such as walls. Navigation based on contour tracking is supported by the S0 and S1 ultrasonic sensors. The implemented fuzzy controller allows to keep a distance of about 50 cm to objects located on the right side of the robot. The second behavior is to avoid obstacles located in front of the robot. This one is established from the readings the sonar S3 provides.
84
G. A. Acosta-Amaya et al.
Once an obstacle is detected at a distance less or equal to 50 cm, the control module processes the measurement and generates the required control actions to avoid the detected obstacle. These two actions are executed sequentially and consist of stopping both motors and activating the right one to achieve a quarter-turn displacement on the right wheel, making it possible to make a 90° turn to the left to avoid the obstacle. Once the corresponding turn has been accomplished, the fuzzy wall-tracking control retake the control of the navigation of the mobile robotic agent. Figure 4.6 illustrates a typical navigation scenario and one of the possible paths taken by the robot. The SenCom_Card PCB, located on the upper part of Walling, incorporates the exteroceptive sensors, necessary for navigation, as well as the wireless communication module for data transfer between the computer and the robot. Figure 4.7 shows the electronic diagram of the electronic card and the final arrangement of components on the PCB. Two input variables were considered in the fuzzy contour tracking algorithm based on the identification and characterization of the linguistic variables involved in the process. The two were named as e-dilade0 and e-dilade1, corresponding to the error in the lateral distances provided by the ultrasonic sensors S0 and S1 respectively. A reference or set point of 50 cm was established for this controller, it means, the robot must maintain this separation regarding the wall located on the right side. Figures 4.8 and 4.9 present the navigation control structure in the Walling robot, it can be identified how the PWM control signal is applied to the right motor through an H-bridge. This way, it is possible to decrease or increase the angular speed of the right wheel relative to the left, therefore, the robot is able to move away or approach to the wall as required.
Fig. 4.6 Typical navigation path of the robotic agent Walling in a structured environment. Source Authors
4 Map-Bot: Mapping Model of Indoor Work Environments …
85
Fig. 4.7 Electronic diagram for the exteroceptive perception and wireless communications of the Walling robot. Source Authors
Fig. 4.8 Walling robot fuzzy navigation controller block diagram. Source Authors
86
G. A. Acosta-Amaya et al.
Fig. 4.9 Sketch of the Walling robot control structure. Source Authors
Navigation tests were initially executed using a single sensor. However, trajectories turned out to be quite oscillatory around the set-point. In order to improve the performance of the controller, an additional sensor was added, achieving better performance in navigation. This measurement redundancy scheme allowed to reduce the uncertainties in the proportions due to specular reflections. While one of the sensors may be giving a wrong reading, the other sensor may be giving a correct reading, thus maintaining control of navigation. To further reduce the negative impact caused by the specularity phenomenon, associated with every ultrasonic sensor, a software saturation control was incorporated into the navigation module. For measurements equal to zero or greater than 100 cm, the value of the variable that stores the sensor reading is adjusted to 50 cm. Experimental navigation curves are shown in Figs. 4.10 and 4.11; tests were accomplished in a classroom on a 560 cm section of a wall. The wall is located on the right side of the robot, taking the direction of its linear movement as a reference. The readings were taken at a sampling rate of 10 Sa/s or 10 Hz. In total, 500 samples were acquired per route, with a navigation time given by: t=
500Sa 10Sa s
= 50s
(4.2)
The tests were performed at an average linear speed of 11.2 cm/s. The linear distance traveled by Walling during the navigation tests is calculated as follows:
4 Map-Bot: Mapping Model of Indoor Work Environments …
87
Fig. 4.10 Navigation tests on the Walling robot, a path of 560 cm with a spurious reading of zero centimeters from the S0 sensor after traveling 400 cm. Source Authors
Fig. 4.11 Navigation tests on the Walling robot, a path of 560 cm with a false reading greater than 100 cm after 420 cm of travel. Source Authors
88
G. A. Acosta-Amaya et al.
dnav = v × t = 50s × 11.20
cm = 560 cm s
(4.3)
Referring again to Figs. 4.10 and 4.11, the results of the tests in the first figure were obtained by positioning the robot at an initial distance equal to the set-point, that is, 50 cm from the wall. A navigation trajectory such as these is suitable, allowing the construction of maps adjusted to the characteristics of indoor work environments. A spurious reading of the S0 sensor can be identified, it gives a false measurement of distance from the wall of zero centimeters. Despite of this, the robot completes the circuit without any problems. In this case, the saturation control and a correct distance reading supplied by the S1 sensor let the robot accomplished a good navigation path. In the second figure, the robot is located, at the beginning of the circuit, at a distance of 55 cm, that is, five centimeters above the set-point. In this case the fuzzy navigation controller corrects and maintains the distance to the wall according to the established design criteria. A false distance reading is also observed due to specular reflections of the emitted signal. This time the S0 sensor gives a distance data greater than 100 cm. Once again, the controller keeps the robot on a safe and appropriate navigation path to the established specifications. Figure 4.12 presents a specular reflection-free navigation test on both sensors. The design of the navigation controller through fuzzy logic initiated by identifying the input and output linguistic variables required to the control process. Two input variables were considered, the error or difference between the 50 cm set-point, and the distance measurements supplied by the sensors S0 and S1. The PWM signal applied
Fig. 4.12 Specular reflection-free navigation in the Walling robot. Source Authors
4 Map-Bot: Mapping Model of Indoor Work Environments …
89
to the robot’s right effector was considered as the output variable. The variables were characterized as a tuple of the form:
(4.4)
where: x: Name of the linguistic variable. For example: temperature, speed, angular position. T(x): Set of terms or labels that allow the assignment of values to x. The values correspond to fuzzy sets defined in the universe of discourse U. U: Universe of discourse. G: Syntactic or grammatical rule for the generation of the terms in T(x). M: Semantic rule that allows associating each element of T(x) with a fuzzy set in U. The detailed characterization of the linguistic variables involved in the navigation control is presented in Table 4.2 and in Fig. 4.13. Table 4.3 shows the matrix of rules where the fuzzy interaction between the inputs is established to generate the output. These interactions are specified through a set of IF–THEN rules. Figure 4.13 shows the partitions in fuzzy sets of the linguistic variables e-dilade0, e-dilade1 and PWM, alongside with the control surface, which indicates specific values adopted by the input variables and the value assigned by the controller to the output variable. For example, for each linguistic variable in Fig. 4.13 a fuzzy set has been highlighted, the indicated selection gives rise to the rule: IF e-dilade0 IS E-NME0 AND e-dilade1 IS E-PPEQ1 THEN PWM IS PWM-ag.
4.3.4 Vespucci: Mapping Module for Indoor Environments The three functions of the Vespucci mapping module are (Acosta 2010): – Establishing bi-directional communications with the robot to transfer commands and data. Establishing bi-directional communications with the robot to transfer commands and data. – Storing and processing the data transferred by the robot. – To create a graphic model of the environment.
90
G. A. Acosta-Amaya et al.
Table 4.2 Characterization of linguistic variables for the fuzzy navigation controller of the robot walling Input linguistic variables “e-dilade0: Lateral distance error, right wall, sensor S0. Distances taken in centimeters “Set of terms: – E-ZER0: Error equal to zero (S0). The robot is at a very close distance to the set-point – E-PPEQ0: Small positive error (S0) – E-PME0: Medium positive error (S0) – E-PGR0: Large positive error (S0). The robot is at a very close distance to the right wall – E-NPEQ0: Small negative error (S0) – E-NME0: Medium negative error (S0) – E-NGR0: Large negative error (S0). The robot is too far from the right wall “Universe of discourse: U = [-6,6], in cm “Syntactic rules: The expression “Error of lateral distance” is used and the modifiers “small”, “medium” and “large” “Semantic rules: according to the UART data package related to the communications module SCoI (transmission of the ASCII character “or” 111 in base 10, 0 × 6F in hexadecimal)
“e-dilade1: Lateral distance error, right wall, sensor S1 Distances taken in centimeters “Set of terms: – E-ZER1: Error equal to zero (S1). The robot is at a very close distance to the set-point – E-PPEQ1: Small positive error (S1) – E-PME1: Medium positive error (S1) – E-PGR1: Large positive error (S1). The robot is at a very close distance to the right wall – E-NPEQ1: Small negative error (S1) – E-NME1: Medium negative error (S1) – E-NGR1: Large negative error (S1). The robot is too far from the right wall “Universe of discourse: V = [-6,6], in cm “Syntactic rules: The expression “Error of lateral distance” is used and the modifiers “small”, “medium” and “large” *Semantic rules: according to the UART data package related to the communications module SCoI (transmission of the ASCII character “or” 111 in base 10, 0 ×6F in hexadecimal)
Output linguistic variables “PWM: Duty cycle applied to the H bridge of the right motor “Set of terms: – PWM-z: Correct PWM duty cycle (the robot does not move away or approach the wall) – PWM-fp: The PWM duty cycle decreases very little (the robot very slowly approaches the wall) – PWM-fm: The PWM duty cycle decreases modelately (the robot approaches the wall) – PWM-fg: The PWM duty cycle decreases largely (the robot quickly approaches the wall) – PWM-ap: The PWM duty cycle increases very little (the robot moves very slowly away from the wall) – PWM-am: The PWM duty cycle increases moderately (The robot moves away from the wall) – PWM-ag: The PWM duty cycle increases largely (the robot quickly moves away from the wall) “Universe of discourse: Z = [20,80] “Syntactic rules: The expression “PWM duty cycle” is used and the modifiers “correct”, “decreases/increases very little”, “decreases/increases moderately”, “decreases/increases largely” “Semantic rules: according to the UART data package related to the communications module SCoI (transmission of the ASCII character “or” 111 in base 10, 0 × 6F in hexadecimal)
4 Map-Bot: Mapping Model of Indoor Work Environments …
91
Fig. 4.13 Partition of linguistic variables in fuzzy sets and control surface of the fuzzy navigation system of the Walling robot. Source Authors
The software was developed in MATLAB using the Visual programming environment GUIDE. This allows to develop applications in which the user needs to configure software operating parameters and / or enter initial data. Once the Vespucci module is executed, a graphical interface with an initial presentation window is presented to the user. When the “continuar” button has been clicked, a pop-up window offers the possibility of selecting between two modes of operation: one for data acquisition and another for mapping (Fig. 4.14).
4.4 Conclusions and Future Work The chapter described a model for the autonomous construction of computational representations of the environment. The description of each hardware and software module that are part of the system was performed, showing the results of experimental tests accomplished for the characterization of the self-perception and extereceptive module of the mobile robotic agent Walling. Furthermore, the design and characterization process of the fuzzy controller for contour monitoring was presented. This controller is part of Magalhães navigation agent, and implements the reactive behavior “follow walls". The controller went through navigation tests, allowing the robot to track walls located on its right side over 560 cm paths. The controller achieved safe and stable navigation in indoor work environments. The tests carried out on real office environments allowed us to obtain computational representations that largely adjust to the characteristics of the environment, reflecting its most relevant geometric properties. It was calculated for the maps with a cell resolution of 5 cm, which corresponds to the average value of the maps produced during the tests. This value turned out to be quite close to the unit, indicating that
e-dilade1
PWM-ag
PWM-fp
PWM-fm
PWM-fg
E-NPEQ1
E-NME1
E-NGR1
PWM-am
E-PME1
E-PGR1
PWM-z
PWM-ap
E-PPEQ1
E-ZER0
E-ZER1
e-dilade0
PWM-fm
PWM-fp
PWM-ag
PWM-am
PWM-ap
PWM-fp
E-PPEQ0
PWM-fm
PWM-ag
PWM-am
PWM-fp
PWM-fm
E-PME1
Table 4.3 Simplified FAM for the fuzzy navigation controller of the robot walling
PWM-ag
PWM-z
PWM-fp
PWM-fg
E-PGR1
PWM-fg
PWM-fm
PWN-fp
PWM-ag
PWM-ap
PWM-ap
E-NPRQ1
PWM-fm
PWM-fp
PWM-fm
PWM-ag
PWM-am
E-NME1
PWN-fg
PWMfm
PWNfm
PWM-ag
N-GNGR1
92 G. A. Acosta-Amaya et al.
4 Map-Bot: Mapping Model of Indoor Work Environments …
93
Fig. 4.14 Visual environment and “Adquisicion” mode of operation of the Vespucci mapping module. Source Authors
the computational representation approximates the real characteristics of the environment by 93.5%, thus verifying the consistency, reliability and robustness of the developed model. The model is not intended for the definition of metrics and costs of individual systems (such as the use of certain sensors) and interactive components (such as those related to communication). The learning and conflict resolution were not used. In order to start solving the uncertainty management, it would be through three prototypes: functional, where the characteristics of the work environment are specified, modular description, where the nature of the required sensors is described, and finally, the design, where the choice of a determined sensor must depend on the nature of the required information. Future work will address the incorporation of new tasks and their required sensors, including “move to target” and “avoid obstacle” tasks, which will be combined through a suitable architecture. Through these tasks, more complex behaviors can be achieved and the mobile robotic platform can be used in service robotics and search and rescue applications, where a priori maps of the working environment are not available.
94
G. A. Acosta-Amaya et al.
References Aboshosha A, Zell A (2003) Robust mapping and path planning for indoor robots based on sensor integration of sonar and a 2D laser range finder. In IEEE 7th international conference on intelligent engineering systems Acosta-Amaya GA, Acosta-Gil AF, Jimenez Builes JA (2020) Sistema robótico autónomo para la exploración y construcción de mapas en entornos estructurados. Investig e Innovación en Ing 8:69–84. https://doi.org/10.17081/invinno.8.1.3593 Acosta GA (2019) SLAM Monocular en tiempo real. Univ Nacional de Colombia, Medellín, Colombia Acosta GA (2010) Ambiente multi-agente robótico para la navegación colaborativa en escenarios estructurados. Univ Nacional de Colombia, Medellín, Colombia Ajeil FH, Ibraheem IK, Sahib MA, Humaidi AJ (2020) Multi-objective path planning of an autonomous mobile robot using hybrid PSO-MFB optimization algorithm. Appl Soft Comput J 89:106076. https://doi.org/10.1016/j.asoc.2020.106076 Al-Taharwa I, Sheta A, Al-Weshah M (2008) A mobile robot path planning using genetic algorithm in static environment. J Comput Sci 4:341–344. https://doi.org/10.3844/jcssp.2008.341.344 Andrade-Cetto J, Sanfeliu A (2001) Learning of dynamic environments by a mobile robot from stereo cues. In IEEE international conference on multisensor dusion and integration for intelligent systems, 305–310 Betskov A V, Prokopyev I V, Ilinbaev AE (2019) Problem of cost function synthesis for mobile robot’s trajectory and the network operator method for its solution. In Procedia computer science. Elsevier B.V., pp 695–701 Bouhoune K, Yazid K, Boucherit MS, Chériti A (2017) Hybrid control of the three phase induction machine using artificial neural networks and fuzzy logic. Appl Soft Comput J 55:289–301. https:// doi.org/10.1016/j.asoc.2017.01.048 Bozhinoski D, Di Ruscio D, Malavolta I et al (2019) Safety for mobile robotic system: a systematic mapping study from a software engineering perspective. J Syst Softw 151:150–179. https://doi. org/10.1016/j.jss.2019.02.021 Budianto A, Pangabidin R, Syai’In M, et al (2017) Analysis of artificial intelligence application using back propagation neural network and fuzzy logic controller on wall-following autonomous mobile robot. In 2017 international symposium on electronics and smart devices, ISESD 2017. Institute of Electrical and Electronics Engineers Inc., pp 62–66 Darintsev OV., Yudintsev BS, Alekseev AY, et al (2019) Methods of a heterogeneous multi-agent robotic system group control. In Procedia computer science. Elsevier B.V., pp 687–694 Dufourd D (2005) Des cartes combinatoires pour la construction automatique de modèles d’environnement par un robot mobile. Institut National Polytechnique de Toulouse, France Habib MK (2007) Real time mapping and dynamic navigation for mobile robots. Int J Adv Robot Syst 4:35. https://doi.org/10.5772/5681 Islam N, Haseeb K, Almogren A et al (2020) A framework for topological based map building: a solution to autonomous robot navigation in smart cities. Futur Gener Comput Syst 111:644–653. https://doi.org/10.1016/j.future.2019.10.036 Jian Z, Qingyuan Z, Liying T (2020) Market revenue prediction and error analysis of products based on fuzzy logic and artificial intelligence algorithms. J Ambient Intell Humaniz Comput, 1–8 Labidi S, Lajouad W (2004) De l-intelligence artificielle distribuée aux systèmes multi-agents. INRIA, France Li H, Savkin AV (2018) An algorithm for safe navigation of mobile robots by a sensor network in dynamic cluttered industrial environments. Robot Comput Integr Manuf 54:65–82. https://doi. org/10.1016/j.rcim.2018.05.008 McGuire KN, de Croon GCHE, Tuyls K (2019) A comparative study of bug algorithms for robot navigation. Rob Auton Syst 121:103261. https://doi.org/10.1016/j.robot.2019.103261 Minelli M, Panerati J, Kaufmann M et al (2020) Self-optimization of resilient topologies for fallible multi-robots. Rob Auton Syst 124:103384. https://doi.org/10.1016/j.robot.2019.103384
4 Map-Bot: Mapping Model of Indoor Work Environments …
95
Shiguemi et al (2004) Simultaneous localization and map building by a mobile robot using sonar sensors. ABCM Symp Ser Mechatronics 1:115–223 Souza J (2002) Cooperação entre robôs aéreos e terrestres em tarefas baseadas em visão. In Proceeding of SPG. Philippines, Manila, pp 135–147 Thrun S, Burgard W, Fox D (2000) Real-time algorithm for mobile robot mapping with applications to multi-robot and 3D mapping. In Proceedings—IEEE international conference on robotics and automation, pp 321–328 Tiwari K, Chong Y (2020) Simultaneous localization and mapping (SLAM). In Multi-robot exploration for environmental monitoring. Elsevier, pp 31–38
Chapter 5
Production Analysis of the Beekeeping Chain in Vichada, Colombia. A System Dynamics Approach Lizeth Castro-Mercado, Juan Carlos Osorio-Gómez, and Juan José Bravo-Bastidas Abstract The purpose of this work is to model beekeeping production in the region of Vichada in Colombia. The beekeeping chain was chosen because it is a sector of great economic importance in the mentioned region which has the highest indices of multidimensional poverty in Colombia, but also it is one of the places with the greatest conservation of its biodiversity. A systems dynamics approach is used from a causal diagram to explain the interactions among bee rearing, wax production, honey production, and transformation, and then simulations were performed to determine the behavior of inventories concerning the production and demand. This model highlights the dynamics of the system and the management of the supply chain and is presented as a useful tool to predict production-demand scenarios in the beekeeping sector where similar studies are scarce. As future research, it is recommended to include the economic nature of the products in this kind of models so that scenarios can be proposed to help beekeepers make production decisions according to demand, and develop inventory policies. Keywords Honey · Wax · System dynamics · Inventory
5.1 Introduction The beekeeping supply chain involves the rearing of bees, the products derived from the hive, and their relationship with the demand. The science of caring for bees considers the production environment with the availability of honey flora and agents that can cause mortality in bees, as well as the productive purpose of the beekeeper for which hives are used, which are made of wooden boxes with frames . They
L. Castro-Mercado (B) · J. C. Osorio-Gómez · J. J. Bravo-Bastidas Valle del Cauca, Escuela de Ingeniería Industrial, Logistic and Production Research Group, Universidad del Valle, Cali, Colombia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_5
97
98
L. Castro-Mercado et al.
contain wax cells built by bees and are used to store food made of honey and pollen. All these elements that combine within the hive are used by the beekeeper to extract and market them, and even the same bees are sold as biological packages, although keeping a minimum population of bees within the hive to avoid an imbalance. Vichada has about 95,940 forest hectares within which we can find plantations of Acacia mangium that provides nectar for bees. In this region, the river Bita basin, with a Ramsar wetland declaration that preserves a natural landscape, contains 1.474 specimens of plants, belonging to 103 families, 278 genera, and 424 plant species and at the same time, it provides clean water and food for bees and an adequate environment for the development of the beekeeping chain. For regions with a tropical climate such as Vichada, honey is the main product of the hive. In particular, Vichada has an approximate production of 140,000 kg of honey with yields of 40 kg/hive-year. Beekeeping is prioritized as one of the productive bets of the region due to the apicultural potential given by the environmental conditions and botanical resources for the development of the activity (Castro 2018). However, nowadays this sector presents the need to analyze the bee chain to a higher extent, its products, and innovative uses to generate value (Ministerio de Agricultura y Desarrollo Rural 2019). This coincides with the low competitiveness of bee honey production in Vichada, which influences the inappropriate existing inventory. Therefore, it is intended to evaluate honey production in terms of its dynamic demand, and it seems convenient to establish production systems that consider other kinds of products such as bees for breeding, and by-products derived from honey, and wax. The latter one has an interesting demand by beekeepers and mainly by the pharmaceutical industry and also bees have a demand related to recover degraded agroecological systems or support agricultural productions that require the pollination service provided by them. In this sense, system dynamics is presented as a useful tool to represent problems and provide solutions to a given situation, through a relationship of variables and simulation of the built model. The aim is to develop a dynamic simulation model to evaluate beekeeping production in Vichada.
5.2 Literature Review When evaluating the evolution of system dynamics (SD) as a scientific support tool, 10,988 scientific papers were found in Scopus in the period 1960–2020 and the Fig. 5.1a shows the profile of the number of publications per year. The results show a growing trend with an average of 587 documents per year in the last ten years, which indicates that system dynamics is an emerging and active research field used in various applications such as management, medicine, economics, government policies, energy and environment, software engineering, among others. This demonstrates the importance of understanding complex systems through the feedback structures conceived in systems thinking (Angerhofer and Angelides 2000; Forrester 2007).
Number of publications
5 Production Analysis of the Beekeeping Chain in Vichada, Colombia …
99
700 600 500 400 300 200 100 0 1960 1970 1980 1990 2000 2010 2020 year
Number of publications
(a)
40 35 30 25 20 15 10 5 0 2000
2005
2010 year
2015
2020
Number of publications
(b)
5 4 3 2 1 0 2007
2010
2013 year
2016
2019
(c) Fig. 5.1 Publication trend in the area of systems dynamic (a), system dynamic supply chain (b), and c systems dynamic food supply chain
100
L. Castro-Mercado et al.
In the period 2000–2020, 180 English-written papers were found associated with the use of system dynamics for the analysis of supply chains. The Fig. 5.1b shows the profile of publications per year, and the results show an exponential growth trend with an annual average of 23 publications in that period. The modeling of supply chain management focuses on inventory decisions and policy development, time compression, demand amplification, supply chain design and integration, and international supply chain management (Angerhofer and Angelides 2000). When restricting the search to “Food supply chain” (FSC), 15 papers were found (Fig. 5.1c), a very low number of annual publications, which shows the novelty of the application of system dynamics in food supply chains. In addition to the traditional elements that make up a chain, these involve operational factors such as harvest seasons, food perishability, processing, climatic variability, product geometry, geography, transportation, food waste, location of facilities, governance problems, and unfair competition that makes them complex (Jan van der Goot et al. 2016; Jonkman et al. 2019). For an improved analysis of the last search, the data were exported in.CSV format from Scopus, and the VOS viewer 1.6.15 software was used. Three clusters are observed in the Fig. 5.2. It can be seen that the green color associates the SD with profitability and simulation, while the purple cluster associates the SD with agriculture, costs, stochastic demand, systems theory, and stochastic systems in general. The foregoing is consistent with the need for food supply chains to improve their processes and to propose scenarios that allow them to make decisions that lead them to improve their competitiveness (Mutanov et al. 2020). As for the leading countries publishing SD studies on food supply chains, the Fig. 5.3 shows that China ranks first with five papers, Indonesia ranks second with three publications, and India with two ranks third. Regarding Latin America, publications are registered in Colombia. In general, FSCs face situations that require
Fig. 5.2 Visualization of network clusters of research topics in publications related to system dynamics with application in the food supply chain, period 2007–2020. Note The minimum number of occurrences of a keyword is two
5 Production Analysis of the Beekeeping Chain in Vichada, Colombia …
101
Fig. 5.3 Countries with reported research on system dynamic in the food supply chain
analysis either through mathematical tools or through simulations that merit an evaluation of them to improve the performance of the chain. However, FSC diversity has not yet been fully addressed in the state of the art, which shows interesting research opportunities in some areas. According to the Table 5.1, SD is a method that is applied in the area of risk management, logistics, and economic factors, and mainly contributes to propose scenarios of inventories, transportation policies, price, supply, and demand. Ramos-Hernández et al. (2016) with the help of a sensitivity analysis, found the values that would allow the company to improve its order fulfillment indicator and to increase profits, assuming an expected demand by the introduction of this new product into the market, vinasse. Wu et al. (2015) applied the causal conceptualization of SD as a qualitative method and combined it with Dematel as a quantitative method to analyze the performance of the dairy chain. Susanty et al. (2020) used the SISP model (Subjects, Indices, Standards and Phases of performance evaluation) and the ACSSN model (Aquatic Product, Customer, Supply Chain, Society and Node of companies in the supply chain) to evaluate the logistics performance of the aquatic products cold-chain and then applied a system dynamics model to simulate the impact of temperature on profits in the sales section of the same chain. Here it is observed that the use of SD depends on the conceptualization of the problem and also on the researcher strategy when using one or more methods that allow him to achieve an adequate analysis of the problem. Regarding the beekeeping chain, only 12 articles were found and some limitations were identified in the form that authors face the system dynamics modeling in
Product
Foodgrains
Rice
Vinasse
Mango
Patchouli oil
Dairy
Author(s)
Rathore et al. (2020)
Napitupulu (2014)
Ramos-Hernández et al. (2016
Orjuela-Castro et al. (2017)
Rahmayanti et al. (2020)
Susanty et al. (2020)
System dynamic-DEMATEL
System dynamic
System dynamic
System dynamic
System dynamic
System dynamic
Method
Table 5.1 State of the art on the systems dynamic in food supply chain Key contributions
Application
Formulate the right policies for improving the performance of the chain based on the success factor
Identify the inputs required to develop patchouli oil agroindustry using dynamic systems
Indonesia
India
Countries
Demand, dairy farmers
Demand, inventory
(continued)
Indonesia
Indonesia
Colombia
Profits, demanda, storage Mexico capacity
The model allows the study Supply, demand, of the logistic performance, transportation policies quality, costs and responsiveness of the Mango Supply Chain
Assessing the impact of a vinasse pilot plant scale-up on the key processes of the ethanol supply chain
Useful in predicting the Inventory and Price economic and non-economic implication of both rice stock and price
Impact of risks in foodgrains inventory and transportation system transportation policies
102 L. Castro-Mercado et al.
Product
Aquatic product
Water
Sugar cane, sorghum/bioethanol
Author(s)
Wu et al. (2015)
Mota-López et al. (2019)
Rendon et al. (2014)
Table 5.1 (continued)
System dynamic
System dynamic
SISP model and ACSSN model and system dynamic
Method Performance evaluation
Application
The model explores Demand, production scenarios for evaluates the availability of area sowing of sugarcane and grain sorghum crops, the production capacity for ethanol and fuel
Simulate the impact of water Demand, production supply disruptions in policies bioethanol production
Simulate the impact of temperature on the profits in aquatic products cold-chain sales section
Key contributions
Mexico
México
China
Countries
5 Production Analysis of the Beekeeping Chain in Vichada, Colombia … 103
104
L. Castro-Mercado et al.
a beekeeping context. Carlevaro et al. (2004) present a conceptualization and simulation with system dynamics of the behavior of the Argentine honey chain with an export profile and analyze climatic, economic, and technological scenarios. However, they do not present a Forrester model that allows readers to replicate it. Ward and Boynton (2009) used an econometric model to analyze the honey demand and showed the impact of its generic promotion, and (Russell et al. 2013) evaluated the factors that may have the greatest influence on the growth and survival of the colonies but the products of the hive were not considered. Therefore, there are few known studies of system dynamics in bees or in honey production in the Colombian and Vichada context that lead to having a model that considers productive factors of the hive and its by-products that opens the possibility of evaluating a wider range of decisions in the beekeeping chain. This aspect reflects the innovative component of this chapter.
5.3 Methodological Approach The proposed methodological approach is based on Aracil (1995), Sterman (2000) and comprises a sequence of steps in which one can regress a step to fine-tune the model as shown in the Fig. 5.4. Feedback is one of the core concepts of system dynamics. Yet our mental models often fail to include the critical feedbacks determining the dynamics of our systems. In system dynamics, we use several diagramming tools to capture the structure of systems, including causal loop diagrams, stock, and flow maps (Sterman 2000). The case study was the beekeeping production model in the region of Vichada, Colombia. The conception and construction of the model were carried out in Vensim PLE.
5.3.1 Causal Diagram Causal loop diagrams (CLDs) are an important tool for representing the feedback structure of systems. Long used in academic work, and increasingly common in business, CLDs are excellent for quickly capturing the hypotheses about the causes of dynamics, eliciting and capturing the mental models of individuals or teams, and communicating the important feedbacks we believe are responsible for a problem (Sterman 2000). A causal diagram consists of variables connected by arrows denoting the causal influences among the variables (Sterman 2000). A positive link means that if the cause increases, the effect increases above what it would otherwise have been, and if the cause decreases, the effect decreases below what it would otherwise have been (Sterman 2000). A negative link means that if the cause increases, the effect decreases below what it would otherwise have been, and if the cause decreases, the effect increases above what it would otherwise have been (Sterman 2000).
5 Production Analysis of the Beekeeping Chain in Vichada, Colombia …
105
Fig. 5.4 Methodological approach (Aracil 1995)
The Fig. 5.5 presents the causal diagram that indicates the cause and effect processes originated in the dynamic behavior of beekeeping contexts, and this is based on the literature review and the knowledge of beekeeping experts in Vichada. The arrows in the diagram are used to link the causal elements and the signs symbolize the direction of the effect generated by the cause. Feedback loops are also employed, with a reinforcing or compensating nature. The Fig. 5.5 contains two reinforcement loops and three compensation loops. In the R1 loop, as there is greater availability of honey plants, the number of bees increases, which coincides with what was expressed by Martell et al. (2019). During the visit of the bees to the plants in search of nectar, they carry out pollination, which would cause the growth of more honey and nectariferous flora, that is, R1 has a double delay (Fig. 5.6). For the R2 loop, as wax production increases, honey production decreases.
106
L. Castro-Mercado et al.
Fig. 5.5 Causal diagram of beekeeping production
Fig. 5.6 Reinforcement loop R1 and R2
In addition to the nectar supply, the reproduction of bees is influenced by mortality, loop B1, and if this mortality of bees is high then the population of bees decreases significantly. The balancing loop B2 means that if the number of bees in a hive is significantly increasing, then the beekeeper can increase the boxes, vertically in the same hive for honey production, but at the same time reduces the possibility of dividing colonies of bees for breeding (Jimenez 2017). However, the number of bees continues increasing (Fig. 5.7). Regarding the B3 loop, as the diversification of products from honey increases, the inventory of honey decreases (Fig. 5.8). By having honey-producing hives, the hive/year yield increases, expressed in kg of honey/hive, which is mainly influenced by the harvest season (Medina et al. 2014). If the honey yield increases, so will the honey production in the hive. Although here Fig. 5.7 Balancing loop B1 y B2
5 Production Analysis of the Beekeeping Chain in Vichada, Colombia …
107
Fig. 5.8 Balancing loop B3
it can be presented that the honey is converted by the bees into wax, if there is a demand for wax, it increases sales, and therefore the inventory of honey decreases. Honey production can be extracted from the hive and the resultant inventory is affected by sales which are influenced by demand. The beekeeper, then, can follow a line of product diversification based on honey. In the context of Vichada, honey is used as raw material for the production of alcoholic beverage mead (Hernández et al. 2016), and it is also used as a coating for cashew nuts, and it is marketed in doypack bags and in glass containers with dispensers, as a strategy to reduce honey inventory.
5.3.2 Forrester Diagram Causal loop diagrams (CLD) are clearly useful in many situations. However, CLD suffer from a number of limitations and can easily be used in an inadequate way. One of the most important limitations of causal diagrams is their inability to capture the stock and flow structure of systems. Stocks and flows, along with feedback, are the two central concepts of dynamic systems theory (Sterman 2000). In this sense, Forrester diagrams allow the representation of stocks and flows. Stocks are accumulations. They characterize the state of the system and generate the information upon which decisions and actions are based. Stocks give systems inertia and provide them with memory. Stocks create delays by accumulating the difference between the inflow to a process and its outflow. By decoupling rates of flow, stocks are the source of disequilibrium dynamics in systems (Sterman 2000). The system dynamics model includes stock variables, flow variables, and auxiliary variables represented in the Fig. 5.9.
5.3.2.1
Initial Operating Condition
We set an initial value for each variable (for example, initial inventory level, bees for breeding and bees for honey, and others). In this work, we set the length of the period as 50 months (four years), and the time step size is set as one month for the simulation process (Table 5.2).
108
L. Castro-Mercado et al.
Fig. 5.9 Forrester diagram of the beekeeping production in Vichada. a Bees for breeding and Bees for honey, bHoney hive inventory, Virgin honey collection center, Pasteurized honey, Diversified product
5 Production Analysis of the Beekeeping Chain in Vichada, Colombia …
109
Table 5.2 Description and role of major variables used to model the Beekeeping Supply Chain Variable name
Type
Unit
Equation
Birth rate
Flow
Eggs/month
Eggs laid per month1/incubation time
Sale of bees
Flow
Bee/month
IF THEN ELSE (decision > = 1,0, IF THEN ELSE(demand bees < (Bees for breeding/half-life), MIN(demand bees, (Bees for breeding/half-life)-minimum number of bees for breeding),(Bees for breeding/half-life)-minimum number of bees for breeding))
Decision
Flow
Bee/month
(demand bees + discrepancy number of bee honey)/Bees for breeding
Bee feedback
Flow
Bee/month
IF THEN ELSE(decision > 1,0, IF THEN ELSE(discrepancy number of bees honey < Bees for breeding, MIN(discrepancy number of bees honey, Bees for breeding),0))
Discrepancy number Flow of bees honey
Bee/month
ABS( minimum quantity bees strong honey colony-Bees for honey)
Death
Flow
Bee/month
Bees for honey/half-life
Bees for breeding
Stock
Bee
Reproduction-bees feedback-sale of bees
Bees for honey
Stock
Bee
Bees feedback + birth–death
Honey hive inventory
Stock
Kg
Honey production-Extraction-wax production
Wax inventory
Stock
Kg
Wax production-wax delivery
Honey Production
Flow
Kg/month
Honey hive yield*Number of hives
Virgin honey collection center inventory
Stock
Kg
Extraction-diversification-Pasteurized honey
Diversified producto inventory
Stock
Kg
Diversification-delivery finished product
Pasteurized honey inventory
Stock
Kg
Pasteurized honey-Delivery
5.3.2.2
Data to the Model
In addition to the loops already presented, relevant information to the model was obtained for testing: it was considered a minimum number of bees for breeding, a weight of one kg of bees which implies between 10.000 and 12.000 bees. However, a strong and healthy colony has at least 5 kg of bees, with an average incubation time of 21 days of the bee in the breeding cell, and it is estimated that, on average, a queen bee has a posture of 500 eggs/day in the winter season and up to 1.500 eggs per day in summer. The half-life of the queen bee is three years and of the worker bees of up to 25 days in winter, and 19 days in summer which is the time when the flowers expel
110
L. Castro-Mercado et al.
nectar and provide pollen and this is why bees work harder in winter to collect and store food. In the harvest season, the beekeepers visit the hive and extract hive boxes with honey, but they should leave a minimum amount of honey for internal food of the bees, which is about five hive boxes with honey, six kg of honey (according to the Technical Director of Apícola de Inverbosques S.A.S, Manuel Bernal, Puerto Carreño-Vichada 2020). As for the conversion rate of honey into wax, it is necessary eight kg of honey to produce one kg of wax, and this occurs in the glands of the abdomen of bees (Monreal 2019). Although wax is an indispensable input for developing beekeeping activity and therefore has a commercial demand, the demand for wax is not known for certainty and in this chapter is considered to have a uniform distribution between 10 and 1.000 kg. For the demand for pasteurized honey, the (Ministerio de Agricultura y Desarrollo Rural 2019) in Colombia has established a consumption of 0,83 g of honey per person, and we assume a behavior of a normal distribution of 6.500 to 7.778 kg of pasteurized honey/month. It is also taken into account that the quality of the product of extrafloral honey of Vichada is supported by studies of the physicochemical composition and known sensory and bioactive attributes, and the extraction process is carried out in compliance with good handling practices (Castro 2018). Concerning the diversification of the products from honey as a raw material or input, we assume that there is a qualified workforce in the region, equipment and inputs are available to transform the products, and also storage capabilities are available for the processed products. It is considered to have a demand with normal distribution behavior of 2.500 to 2.121 kg of diversified product/month.
5.4 Results The interpretation of the model expressed in the Forrester diagram is presented from left to right. A first scenario was analyzed that corresponds to the breeding of bees and hives for honey in the summer season where the half-life of the bees is 19 days because there is flowering in the plants and the worker bees leave the hive to perform “pecoreo” work, and this process is performed approximately from 4:30 am to 6:30 pm in Vichada. Likewise, the queen bee is considered to increase its egg posture. It is considered that the demand for pasteurized honey/month follows a normal distribution behavior between 8.000 and 10.000, and regarding the stock called “virgin honey collection center inventory” from which comes to a flow of honey to pasteurize and honey to be transformed into other products, that is diversification, in a ratio of 80:20. The scenario 2 is based on beekeeping behavior in the winter season where the bee posture decreases to 500 eggs per day and the half-life of bees increases to 25 days. It is considered that the demand for pasteurized honey/month increases and follows a normal distribution behavior of 15.000 and 20.000 and in this case above-mentioned ratio is now changed to 95:5.
5 Production Analysis of the Beekeeping Chain in Vichada, Colombia …
111
Fig. 5.10 Behavior at the stock of bees for breeding in two scenarios
The Fig. 5.10 exhibits a comparison between the scenario 1 and the scenario 2 of the “Bees for breeding” where there is a slight increase in bee reproduction until the eighth month and then a fluctuation that follows typical inventory behavior. The population of breeding bees and honey bees tends to be lower in the winter (blue line) compared to the summer (red line). In addition, the worker bees, having a longer half-life, causes the queen bee to decrease its posture to avoid overpopulation in the hive and this is observed in the blue line that has a more widely spaced fluctuation. The Fig. 5.11 presents the inventory of bees to produce honey. It is observed that honey hives have a stable bee population because they continually receive feedback from the inventory of bees for breeding. However, in the scenario 1 the bee population is fluctuating because bees are continually being born and dying. The death of bees can be due to the impact on hives from attacks by animals such as the Plain Ocarro or Armadillo which consume bee and honey larvae, as well as the burning of Vichada savannahs that reach forest plantations and cause the escape of bees (Asociación de Apicultores y Meliponicultores de Vichada, comunicación personal 2020). Other
Fig. 5.11 Behavior at the stock of bees for honey in two scenarios
112
L. Castro-Mercado et al.
Fig. 5.12 Behavior at the flow bee feedback in two scenarios
factors are considered in a lesser proportion, based on the literature, such as exposure to pesticides (Klein et al. 2017), malnutrition of the colony (Naug 2009; Montoya et al. 2016), diseases and parasites (Potts et al. 2010), which according to the reviewed literature, these are also important factors in the mortality of bees. Whereas, in the scenario 2 the population of bees is less fluctuating, preserving the category of being a strong and healthy hive. The Fig. 5.12 depicts a bee feedback flow coming out of the beehive inventory for brood. For the scenario 1 there is continuous feedback of bees and in similar amounts, although there are some seasons when it is not necessary, while in the scenario 2 there are maximum peaks of feedback of bees. It is observed that the feedback occurs without affecting the minimum number of bees that must be in a breeding hive. The Fig. 5.13 shows the flow of bee sales wherein the Summer season there is a growing population that allows meeting the demand for bees requested by clients, while in winter it would not be possible to sell in certain months to avoid affecting the minimum quantity of bees. In this flow of sales of bees, the demand remains constant.
Fig. 5.13 Behavior at the flow sales of bees in the two scenarios
5 Production Analysis of the Beekeeping Chain in Vichada, Colombia …
113
Fig. 5.14 Behavior in the wax inventory in the two scenarios
The Fig. 5.14 presents the wax inventory for the two scenarios. In summer (Scenario 1), the wax inventory is high and supplies a requested demand, while in winter the wax inventory decreases possibly because the demand for pasteurized honey is higher. Therefore, the beekeeper prefers to sell honey rather than wax and also requires wax for the bees to build cells and prepare for the honey harvest season in Summer. The Fig. 5.15 shows two scenarios with the behavior of the inventory of honey collected, pasteurized, and diversified product. For both scenarios, the virgin honey inventory fluctuates each month because there is an input from the apiaries and an output for the pasteurization and diversification processes. In the scenario 1 (Fig. 5.14a) the pasteurization inventory presents a build-up based on the demand that is not enough to maintain a stable inventory, and the same is true for diversification where the product diversification order depends on the storage capacity of the inventory. While in the Fig. 5.14b an ideal behavior of all inventories is observed, this since as the demand for pasteurized honey increases, this inventory maintains lower values and the need to diversify products is enough to supply its demand. When detailing what happens with the inventory of virgin honey collected, it is observed that the graphs follow a typical behavior of inventories, where the levels of the scenario 1 are higher than those of the scenario 2, which coincides with the harvest season and with a lower demand compared to the supply of honey that is high. The opposite case occurs in the scenario 2 where the supply of honey decreases, causing activation or increase in the demand for this product. From the Fig. 5.17, the scenario 1 presents an accumulation of honey inventory that can have negative effects on the performance of the chain. Although honey is considered a non-perishable food, it can change its physicochemical composition if it is stored under inappropriate conditions and generates storage costs. Regarding the scenario 2, an ideal inventory of pasteurized honey is observed.
114
L. Castro-Mercado et al.
Fig. 5.15 Inventory behavior of pasteurized honey, collection center honey, and diversified product inventory in scenario 1 (a) and scenario 2 (b)
The Fig. 5.18 shows a behavior of the diversified product inventory similar to that of the Fig. 5.16. However, the storage capacity is limited thus restricts the production of these honey derivatives.
5.5 Conclusions The production of bee honey and wax depends directly on the performance of the hive which in turn is influenced by the half-life that is strongly related to the degree of health, the bee population, and the clean environment where the bees are located.
5 Production Analysis of the Beekeeping Chain in Vichada, Colombia …
Fig. 5.16 Virgin honey collection center inventory
Fig. 5.17 Pasteurized honey inventory
Fig. 5.18 Inventory diversified product inventory
115
116
L. Castro-Mercado et al.
Product diversification is a strategy to help reduce virgin honey inventory when low demand for honey is presented for direct consumption. As the demand for pasteurized honey varies, inventory changes occur in the collected honey center and in the diversified product, which implies changes in the production scheme according to the harvest season, product supply, and quantities required by the client. It is recommended to introduce into the analysis the production costs, sales prices and, profitability of each analyzed product in this model of beekeeping production, to have a sustainable approach and to be able to make decisions regarding production. Acknowledgements Lizeth Castro thanks to the Ministerio de Ciencia, Tecnología e InnovaciónMinciencias funding received in doctoral training in Engineering. Emphasis Industrial Engineering at Universidad del Valle.
References Angerhofer BJ, Angelides MC (2000) System dynamics modelling in supply chain management: research review. In Winter simulation conference proceedings, pp 342–351 Aracil (1995) Dinámica de Sistemas. España, Madrid Carlevaro M, Quagllano J, Fernandez S CH (2004) Honey agri-food Chain in Argentina: model and simulation Castro L (2018) Evaluación de la composición, calidad y generación de valor de miel de abejas originaria de zonas forestales en la laltillanura del departamento de Vichada. Universidad Nacional de Colombia Forrester JW (2007) System dynamics—the next fifty years. Syst Dyn Rev 23:359–370. https://doi. org/10.1002/sdr.381 Hernández C, Blanco A, Quicazán M (2016) Establecimiento de las Condiciones de la Elaboración de Hidromiel Mediante Diseño de Experimentos. I: Memorias Encuentro Nac. e Investig. y Desarro Jan van der Goot A, Pelgrom PJ, Berghout JA, et al (2016) Concepts for further sustainable production of foods. https://doi.org/10.1016/j.jfoodeng.2015.07.010 Jimenez E (2017) Manejo y mantenimiento de colmenas. Mundi pren, Madrid, España Jonkman J, Barbosa-Póvoa AP, Bloemhof JM (2019) Integrating harvesting decisions in the design of agro-food supply chains. Eur J Oper Res 276:247–258. https://doi.org/10.1016/j.ejor.2018. 12.024 Klein S, Cabirol A, Devaud JM, Barron AB, Lihoreau M (2017) Why Bees Are So Vulnerable to Environmental Stressors. Trends Ecol Evol 32:268–278. https://doi.org/10.1016/J.TREE.2016. 12.009 Martell A, Lobato F, Landa M, Luna G, García L, Fernández G (2019) Variables de influencia para la producción de miel utilizando abejas Apis mellifera en la región de Misantla. Rev Mex Ciencias Agrícolas 10 Medina S, Portillo M, García C, Terrazas G, Nevárez A (2014) Influencia del ambiente sobre la productividad de la segunda cosecha de miel de abeja en Aguascalientes de 1998 a 2010. Rev Chapingo Ser Ciencias For Y Del Ambient 20:159–165 Ministerio de Agricultura y Desarrollo Rural (2019) Cifras sectoriales Cadena Apícola. Tercer Trimestre 2019. Bogotá, Colombia Monreal B (2019) Productos de la Colmena: Cera. Jalea Real y Veneno De Abejas, BIOZ Rev Divulg UACB, p 4
5 Production Analysis of the Beekeeping Chain in Vichada, Colombia …
117
Montoya P, Chamorro F, Nates G (2016) Apis mellifera como polinizador de cultivos en Colombia. In: Nates-Parra G (ed) Iniciativa Colombiana de Polinizadores: Abejas ICPA, Universida. Bogotá, Colombia, pp 95–110 Mota-López DR, Sánchez-Ramírez C, Alor-Hernández G et al (2019) Evaluation of the impact of water supply disruptions in bioethanol production. Comput Ind Eng 127:1068–1088. https://doi. org/10.1016/j.cie.2018.11.041 Mutanov G, Ziyadin S, Serikbekuly A (2020) View of application of system-dynamic modeling to improve distribution logistics processes in the supply Chain. Comunications 22 Napitupulu TA (2014) Agent based solution of system dynamics simulation modeling: a case of rice stock by the national logistic agency of Indonesia. J Theor Appl Inf Technol 62:762–768 Naug D (2009) Nutritional stress due to habitat loss may explain recent honeybee colony collapses. Biol Conserv 142:2369–2372. https://doi.org/10.1016/j.biocon.2009.04.007 Orjuela-Castro JA, Diaz Gamez GL, Bernal Celemín MP (2017) Model for logistics capacity in the perishable food supply chain. In Communications in computer and information science. Springer, pp 225–237 Potts S, Biesmeijer J, Kremen C, Neumann P, Schweiger O, Kunin W (2010) Global pollinator declines: trends, impacts and drivers. Trends Ecol Evol 25:345–353. https://doi.org/10.1016/j. tree.2010.01.007 Rahmayanti D, Hadiguna RA, Santosa S, Nazir N (2020) Conceptualization of system dynamic for patchouli oil agroindustry development. Bus Strateg Dev 3:156–164. https://doi.org/10.1002/bsd 2.85 Ramos-Hernández R, Mota-López DR, Sánchez-Ramírez C et al (2016) Assessing the Impact of a Vinasse Pilot Plant Scale-Up on the Key Processes of the Ethanol Supply Chain. Math Probl Eng 2016. https://doi.org/10.1155/2016/3504682 Rathore R, Thakkar JJ, Jha JK (2020) Impact of risks in foodgrains transportation system: a system dynamics approach. Int J Prod Res 1–20. https://doi.org/10.1080/00207543.2020.1725683 Rendon M, Sanchez C, Cortes G, Alor G, Cedillo M (2014) Dynamic analysis of feasibility in ethanol supply chain for biofuel production in Mexico. Appl Energy 123:358–367 Russell S, Barron AB, Harris D (2013) Dynamic modelling of honey bee (Apis mellifera) colony growth and failure. Ecol Modell 265:158–169. https://doi.org/10.1016/j.ecolmodel.2013.06.005 Sterman J (2000) Business dynamics: system thinking and moddeling for a complex world, McGraw-Hil Susanty A, Puspitasari NB, Prastawa H, Renaldi SV (2020) Exploring the best policy scenario plan for the dairy supply chain: a DEMATEL approach. J Model Manag. https://doi.org/10.1108/JM208-2019-0185 Ward R, Boynton B (2009) U.S. Honey supply Chain: structural change, promotions and the China connection. Int J Food Syst Dyn. https://doi.org/10.22004/ag.econ.91137 Wu W, Deng Y, Zhang M, Zhang Y (2015) Performance evaluation on aquatic product cold-chain logistics. J Ind Eng Manag 8:1746–1768. https://doi.org/10.3926/jiem.1784
Chapter 6
Effect of TPM and OEE on the Social Performance of Companies Adrián Salvador Morales-García, José Roberto Díaz-Reza, and Jorge Luis García-Alcaraz
Abstract This chapter reports a model of structural equations that integrates three independent variables: Total productive maintenance, just in time and Overall equipment efficiency and the relationship they have with Social sustainability as a dependent variable. The four variables are related through six hypotheses that are validated with information gathered from 239 questionnaires answered by executives laboring at Mexican maquiladora industry. The partial least squares technique is used to statistically validate the relationships among variables. Findings indicate that Total predictive maintenance has a strong impact on Overall equipment efficiency and Just in time, and the variables that most influence Social sustainability are Total predictive maintenance and Just in time. It is concluded that Social sustainability can be obtained through proper use and maintenance of the machines and with timely fulfillment of production orders.
6.1 Introduction The high competitiveness in the market causes that industries choose to look for alternatives to make their processes more effective, both in terms of material savings and in shorter cycle time periods, therefore, the lean manufacturing methodology enhance companies through the implementation of their tools with multiple benefits throughout the business structure. The lean manufacturing provides tools to identify activities that generate value in processes, as well as for those processes that do not add any value, which helps decision-making to optimize production processes and increase efficiency (Garza-Reyes et al. 2018); however, it must be careful when A. S. Morales-García · J. L. García-Alcaraz (B) Department of Industrial Engineering and Manufacturing, Universidad Autónoma de Ciudad Juárez, Ciudad Juárez, Chihuahua, Mexico e-mail: [email protected] J. R. Díaz-Reza Department of Electric Engineering and Computation, Universidad Autónoma de Ciudad Juárez, Ciudad Juárez, Chihuahua, Mexico © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_6
119
120
A. S. Morales-García et al.
deciding to remove some of the activities, since even though they do not add value, some are necessary to preserve the requirements and/or standards that are required for the product development (Jimenez et al. 2019). The lean manufacturing methodology aims to meet business objectives for companies that choose to adopt their tools, as well as multiple competitive advantages in the planning and management of the production processes (Singh et al. 2020), which allow them to have a better implementation of available resources as to a large percentage of the waste generated in the stages of transformation and the levels of inventory are reduced, such as raw material in process and finished product, which are equally considered by experts as one of the largest type of waste (Purushothaman et al. 2020). One of the main precursors of this ideology is Mr. Sakichi Toyoda, who at the end of the nineteenth century sought to solve a problem that was constantly happening in his loom industry, since the necessary threads for manufacturing were continuously tearing apart where the operator was not able to observe that event on time, which had serious consequences on the quality of the product, consequently, he designed a device that sends an alert when one of those threads blast for the operator to fix it in order that defects were not generated. Through this procedure he changed a work that was originally perform manually to one that was done in a more automated way, which is a technique known as “Jidoka” (Romero et al. 2019).
6.1.1 JIT After Mr. Toyoda released the usefulness of the Jidoka tool and continued to look for alternatives that would allow him to improve the productivity of his industry, as a result, he focused his attention on new ways of carrying out the processes, seeking to bring machines, people, and infrastructure together, where the main goal was that by working in the same direction, more value would be added to the processes, which would be working flawlessly and without overproduction, since only the required and requested amount of demanded parts was manufactured, this new technique was called JIT (Just In Time) (Kim and Shin 2019). Carrying out a manufacturing process controlled by the JIT system involves several aspects that must be considered in order for it to succeed, this methodology is focused on the flow of materials, allowing to have a raw material inventory low level, in the process and finished product. This helps keep handling costs low and cash flow constantly, as the time they remain in stock is significantly short, because it is determined by the demand and delivery, regardless the raw material to the process and the finished product to the customer is a short-term cycle (Green et al. 2019). JIT philosophy is focused on simplifying the complex aspects that are in process, so it adopts a system that is simple to apply in order to eliminate activities that are not vital to the transformation of the product, and thereby achieve efficiency in a more focused aspect. It is understood that its methodology is focused on that the required materials are delivered to the right place for product development, at the right time and volume (Alcaraz et al. 2014).
6 Effect of TPM and OEE on the Social Performance of Companies
121
6.1.2 Overall Equipment Effectiveness (OEE) The Overall equipment effectiveness (OEE) system has a significant impact on production planning, because it handles the quantity that can be manufactured in a desired period, as this tool is aware of the necessary equipment availability to complete the transformation tasks; in addition, it provides information related to the performance that is available, in order to design the planning according to the production order dates when they have to be delivered, starting on the right day to avoid delays (Heng et al. 2019).
6.1.3 Total Productive Maintenance (TPM) Another aspect covered in this chapter is the impact of the total productive maintenance (TPM) methodology on business performance, since its main objective is to lower the likelihood of having unforeseen failures and shutdowns in the machinery, consequently, a plan that contemplates preventive maintenance carried out by the operator, in order to avoid wasting time by the department responsible for solving the breakdowns, avoiding idle equipment time and delays in delivery (Shen 2015). For the above, machinery operators receive training to perform these activities in an appropriate manner to be aware of the time in which adjustments that must be done are already established by the manufacturer. The TPM and JIT relationship with the OEE system is very high, as its activities are available for the periods of time presented in the other two methodologies. For JIT, it helps take into account the current times available; this system has to be replaced sooner, since the scheduled stoppages that are set must be subtracted and with OEE system the periods of time that have to be met in JIT can be acknowledged (Sahoo and Yadav 2020; Guo and Huang 2020).
6.1.4 Social Benefits One of the main goals that companies have with the implementation of new strategies to carry out their most slender operations are linked to obtaining mainly economic, environmental, and social sustainability. Since the 60’s, the consequences of the resources poor management were noticed and the World Organization for Cooperation and Development (OCDE) was created to promote policies for economic and social preservation and growth (Mckenzie 2004). However, the social impact on industries was forgotten since those years, being very important in the performance of the company (Milanesi et al. 2020). Nevertheless, in recent years, social sustainability has been greatly considered within industries, changing their systems in order to improve working conditions,
122
A. S. Morales-García et al.
also helping to have better and more effective use of resources. One of the main social aspects that is considered is the safety and health care of employees (Mani et al. 2015), where the Apple brand is one of the principal companies that has the best achieved results with the implementation of its program, taking into account the two aspects previously mentioned, in addition, it has become a good influence for other companies to consider giving greater importance to social sustainability.
6.1.5 Chapter Objective Faced with the previous issue, the objective of the present research is focused on the social sustainability benefits that can be achieved with the three lean manufacturing tools (TPM, JIT, and OEE), since it is an issue that is rarely given the necessary importance within companies, because its calculation is qualitative, being working conditions one of its greatest goals to be achieved, which has been shown that by having an improvement in the working environment a higher performance in manufacturing operations is obtained (Digalwar et al. 2019).
6.2 Hypothesis Once the defined objectives to finding the relationship between the implementation of JIT, TPM, and OEE with the social sustainability are achieved, it was decided to make a delimitation in sections that allow to know individually the relationship among all of them where six working hypotheses were proposed; one for each relationship of the factors that are analyzed. The following is the base for the relationships established among the previously defined variables. Implementing the TPM and JIT system has been shown to have a significant impact on the productivity of companies, because the relationship among them is high, since both tools are focused on having a control of the time management. Specifically, TPM seeks to ensure that machines are always updated and running properly and that there are any defects that may cause a large waste of time (Heravi et al. 2020), while JIT takes into account for its planning that the machinery will be in good condition and available to carry out the production that is scheduled, as well as finish orders in the time period that has been established (Abdallah and Matsui 2007; Iqbal et al. 2018). Another important aspect to emphasize is over time, the performance of each tool is analyzed separately; however, it has been shown that there is a high relationship among them due to the approaches they handle making them similar, since one of their main objectives is to reduce time losses and have lower operating costs (TPM in serious breakdowns and JIT in inventory management) (Heravi et al. 2019), also both are focused on obtaining a great performance in deliveries by always considering the quality of the product (Cua et al. 2001; Wang et al. 2017).
6 Effect of TPM and OEE on the Social Performance of Companies
123
Another advantage that has been obtained within the work environment with the implementation of both tools, it is that costs for the use of labor force is reduced, since the methodology that is managed in JIT causes low volumes of materials to be handled in inventories requiring fewer employees for delivery, similarly, TPM low the number of corrective maintenance personnel, because operators are already trained to do regular maintenance, consequently, their philosophy would make them avoid failures (Al Mannai et al. 2017). After analyzing the behavior that both tools can have within the production processes, the following hypothesis can be proposed: H1
The TPM methodology implementation has a significant impact on the JIT performance
The relationship between TPM and the OEE system is high, because, when they are implemented, the availability obtained that can be achieved through the machinery that can be very close to the operators available time, because together machines are in optimal conditions that are required to perform the processes of the product transformation. The main goal is to obtained the maximum performance in operations, but to achieve it, the equipment has to be in perfect operational conditions, where it is when there is the greatest importance related to the TPM relationship with OEE (Suryaprakash et al. 2020). An important aspect that resembles in the relationship with the OEE is defined in the formula to estimate effectiveness, since the aspects that are considered for the estimation is initially the availability that it has for production. Second, the performance is taken into account to know the total time available, the volume that was actually produced, and finally, the production quality is considered (Heng et al. 2019). Therefore, the OEE analyzes the losses generated by the process, similarly the TPM contemplates the losses in a study known as “the six large losses”, being the main theme in both tools seeking the highest productivity with the already installed capacity of the company (Chikwendu et al. 2020). However, the implementation of both tools has a set of important tasks that need to be included, such as staff cooperation that is a vital element for the success of the proposal, which cooperation is sometimes incomplete due to their fear of losing their jobs. Despite this, once the training process has been carried out, positive results have been observed (Thorat and Mahesha 2020), both in the management that allows them to achieve better planning with the machinery effectiveness and in the operators’ skills to be improved over time as they are carrying out tasks (Gupta and Garg 2012). Therefore, once the relationship between the tools was observed, the following hypothesis can be proposed: H2
The TPM methodology implementation has a significant impact on the OEE performance
Moreover, by combining the JIT system with other lean manufacturing tools, some benefits can be acquired. This section will address the relationship with the OEE system, which helps maintain an adequate rate of production (Muñoz-Villamizar
124
A. S. Morales-García et al.
et al. 2018) and with TPM, which allows leaner operating times to be possible, where together will see the generated impact on the social sustainability of companies. The JIT and the OEE system together help have a production and inventory control at low levels, as the time products remain in each of the warehouses is not a long, since they are combined and employed according to demand. OEE performs the availability estimated that is counted on computers which according to the performance shown above can generate the estimation of the time required to complete orders (baghbani et al. 2019). Similarly, JIT knowing the quantity demanded, requests the materials and manages the logistics of deliveries to each of the work areas that need to be done, being the coordination of time a priority, since only the quantities that are needed at the requested time are delivered (Chen and Bidanda 2019). One aspect that must be considered is that the OEE system is focused on maintaining at a high level the machinery effectiveness, therefore, the materials that are necessary for the transformation of the product available have to be in stock, however, its methodology is not focused on that aspect, therefore, the JIT system is required being a very important pillar in the logistics management of the available resources, bringing the necessary items for the machines, which is achieved to have the performance that is budgeted before the planning, otherwise, delays can provide several consequences due to the way they are performed, but, this tool helps control and does not lead to overproduction by handling quantities according to demand, so it is crucial to perform properly both methodologies implementation (Ng et al. 2014; Stamatis 2017). Once the theoretical relationship between the OEE and JIT tools is observed, the working hypothesis can be proposed: H3
The OEE methodology implementation has a direct and positive effect on the JIT performance
The relationship that has been identified between the use of the TPM philosophy and social sustainability has been an important point to find opportunities for employees improvement and well-being, because the great technological advances that have been recently made and their integration into the production processes, it has generated a considerable breakdowns reduction in machinery, as well as a reduction of potential injuries in operators has been achieved, reflecting the importance of carrying out an appropriate implementation of this tool. On the other hand, on the subject of the organization, it has been shown that it helps identify production problems before they occur, increasing the well-being of operational staff (Longoni and Cagliano 2015). Therefore, after reviewing and analyzing the relationship of the TPM tool with social sustainability, the following working hypothesis is proposed: H4
The TPM methodology implementation has a direct and positive effect on the company Social Sustainability performance
Although the JIT system is focused on production processes and lower costs among other benefits, it also has a significant impact on social sustainability, both positively
6 Effect of TPM and OEE on the Social Performance of Companies
125
and negatively, so managing this philosophy must be careful in order to the impact to be positive, because in that area a progress in reducing the cycle time process and operators workloads increment have been shown, which can lead to stress from overtime (Longoni and Cagliano 2015). Furthermore, a suitable way to combine the JIT philosophy with social sustainability is that the production orders that operators must complete are programmed, as mentioned above, because when orders are large it can bring stressful consequences, also it has been shown that this philosophy helps not to manufacture more than what is being demanded, therefore, when orders do not require much production, work rhythms are low and employees have a greater state of mind when they know the amount of products that are needed to be manufactured and how much time it may take to do so, reducing risks of extreme fatigue (Ciccullo et al. 2018). Once it has been observed how the JIT system operates within social sustainability, the following working hypothesis can be raised: H5
The JIT methodology implementation has a significant impact on the company Social Sustainability performance
Achieving to have sustainability within industries is an important challenge faced by those who have decided to make improvements in their production processes, with social sustainability being an essential area to keep employees in optimal working conditions, for this purpose, with the OEE help, it has been shown that it brings good results by increasing the safety of operators, because it focuses on always keeping equipment in good conditions to operate, which decreases the risk of stoppages and breakdowns that would cause staff to take action to solve the problems, thus, significantly lowering the chances of working accidents (Wan Mahmood et al. 2015). Another aspect that helps the combination of both tools is that it creates within the operators a peace of mind that is needed to have the machines in good operation in order to complete the production orders, which generated a better work environment without stress by having the ultimate satisfaction of accomplish a high productivity according to the results that can be achieved with the high performance of equipment and labor. In addition to the previously mentioned benefits, an important point within companies is that by having a high competitiveness within the business market, it will be easier to subsist over time, therefore, having a program that maintains the productivity of equipment and operators at high levels is a good way to achieve that goal: This combination keeps the equipment available and in the same way the specialized workforce is preserved, since in case of accidents, the employee must be replaced with another operator who probably does not have the same skills and experience as the one that performs the operations daily, which would result in activities effectiveness decrease (Anderson et al. 2020). Once the relationship and behavior of implementing both philosophies in the productivity of industries has been observed, the following working hypothesis can be formulated:
126
A. S. Morales-García et al.
Fig. 6.1 Proposed model and relationships among variables
H6
The OEE methodology implementation has a significant impact on the company Social Sustainability performance
Figure 6.1 illustrates in a graphical way the hypotheses that were previously proposed, which shows the relationships among variables.
6.3 Methodology The methodology for validating the relationships among variables proposed in Fig. 6.1 is described.
6.3.1 Questionnaire Developing In order to obtain information to statistically validate the hypotheses proposed in Fig. 6.1, a questionnaire was designed to obtain empirical information about company executives’ perspectives. For the development of the questionnaire, a literature review was made in different databases (Sciencedirect, Springer, EmeraldInsight) with keywords for each of the lean manufacturing (LM) tools such as: TPM, Cellular Manufacturing, Hoshing kanri, Hemba, One Piece Flow, Jidoka, Poka-Yoke, Kanban, Heijunka, JIT, Takt Time, bottleneck analysis, OEE, and Kaizen. Similarly, the search was made to look for benefits such as: economic, social, and environmental
6 Effect of TPM and OEE on the Social Performance of Companies
127
sustainability benefits. From this literature review a total of 87 items were obtained, in a total of 17 LM tools and a total of 20 LM tools in the three different benefits. Moreover, to answer each of the questions, a 5-point Likert scale was included, where 1 means totally disagree; 2 disagree; 3 neutral; 4 agree, and 5 totally agree, this in order to know the extent to which the activities are carried out and in order to know the extent to which the benefits are obtained or not. In addition, a demographic section was added in the questionnaire asking about the type of company participants work at, years of experience, job position, and gender.
6.3.2 Questionnaire Application Once the questionnaire was designed, it was applied to company managers in different fields (automotive, electrical, electronic, medical, among others) of the maquiladora industry in northern Mexico. It was decided to focus on this industrial sector due to the economic importance it has being one of the main industrial sectors in the region, according to (García-Alcaraz et al. 2017). The participants who answered this questionnaire were personnel staff (managers, engineers, supervisors) who are involved in the production area and have been involved in the implementation of these tools. To contact people who could potentially answer the questionnaire, a personalized e-mail was sent to each participant to schedule an appointment to conduct the interview. If the established appointment was cancelled, a second e-mail was sent, and if after three e-mails there was no response from the person, this case was excluded. To be suitable to answer the questionnaire, individuals must have at least 2 years working in the company in the same job position as well as have participated or participating in the implementation of the previously mentioned tools. The questionnaire was applied from January to June 2020 where a stratified sampling was used to identify the type of respondent required. Also, a method known as “snowball technique” is used, as participants were encourage to share the survey if they knew anyone who could be a potential participant (Zhicun et al. 2020).
6.3.3 Registration and Data Debugging For the collected data analysis and debugging a database is created in the SPSS 21® software, where latent variables and their respective items are integrated. After the database is debugged, the following is done (Hair et al. 2014): • The standard deviation of each questionnaire is calculated, with the aim of identifying people’s commitment when answering the questions. The minimum acceptable value in the standard deviation is 0.50, if any questionnaire was under this value, it was removed from the study.
128 Table 6.1 Questionnaire validation indexes
A. S. Morales-García et al. Indexes
Measurement
Suggested value
R2
Predictive parametric validation
≥0.02
Adj. R2
Predictive parametric validation
≥0.02
Cronbach’s Alpha
Internal consistency
≥0.70
Composite reliability
Internal consistency
≥0.70
Average variance extracted
Discriminant validity
≥0.50
Full Collin. VIF
Collinearity
≤3.30
Q2
Predictive nonparametric validation
≥0.00 and similar to R2
• The lost values of each questionnaire are identified, if the percentage of lost values is over 10%, it is eliminated, otherwise, these values were replaced by the median. • Extreme values were identified by standardizing the values of the items, these were set to −4 and 4, any values under or over them were removed and replaced by the median.
6.3.4 Questionnaire Validation Once the database is depurated, validation of the questionnaire is performed, which is validation of each of the latent variables. To do this, tests that were conducted were proposed by (Kock 2017). Predictive validity tests were performed from a parametric (R2 and adjusted R2 ) and nonparametric (Q2 ) perspective. Internal consistency was also obtained Cronbach alpha (Cronbach and Meehl 1955) and composite reliability), the discriminatory validity (average variance extracted), and the Collinearity (Full Collinearity Factor, VIF). Regarding this, the indexes shown in Table 6.1 were estimated. Column one shows the different indexes, column 2 shows the validation type for each index, and column 3 shows the suggested value.
6.3.5 Structural Equation Model If the latent variables have been successfully validated, the following step is to relate them using a structural equation model (SEM) to test the hypotheses proposed in
6 Effect of TPM and OEE on the Social Performance of Companies
129
Fig. 6.1. SEM is chosen because it provides explicit error variance parameters estimations, incorporating unobserved (that is latent) and observed variables; in addition, it is able to model multivariate relationships and estimate direct and indirect effects of the variables under study (Blach et al. 2017). SEM methods provide a statistical adjustment to causal model data consisting of unobserved variables (Westland 2015; Mendoza-Fong et al. 2019), which integrates factorial analysis and path analysis to establish, estimate, and test the causality of the model (Blunch 2012). The model presented in this chapter was evaluated in WarpPls 7.0 ® software (ScriptWarp Systems, Laredo, TX, USA) which is based on partial least squares (PLS) and recommended in cases lacking normality or where values are expressed in ordinal scales (Kock 2018, 2019). According to (Hair et al. 2014), Partial least squares can be calculated from Eqs. 6.1 and 6.2: X = T P T + Eb
(6.1)
Y = U QT + F
(6.2)
where: • • • • • • •
X is a nxm array of independent latent variables. Y is an array of latent dependent variables. T is an X projection array (X scores). U is a Y projection array (Y scores). P is a mxl array of orthogonal loads. Q is a pxl array of orthogonal loads. E and F are the errors after the calculation. PLS are all about generating a regression equation for Y based on X, that is, Y = βX
(6.3)
It is possible that a dependent variable is explained by independent variables, which is why the previous equation is transformed into Eq. 6.4. Y = β1 X 1 + β2 X 2 + . . . + βn X n
(6.4)
The model was tested with a 95% confidence level, indicating that the p-value must be under 0.05 for the hypotheses to be statistically significant. Before interpreting the model, it is important to analyze some quality and efficiency indexes shown in Table 6.2, which have been proposed by Kock (2017).
130
A. S. Morales-García et al.
Table 6.2 Model fit and quality indexes
6.3.5.1
Index
Measurement
Validation
Average path coefficient (APC)
Average path coefficient
P < 0.05
Average R2 (ARS)
Predictive validity
P < 0.05
Average adjusted R2 (AARS)
Predictive validity
P < 0.05
Average block VIF (AVIF)
Multicollinearity
1); that is, that the activities within a latent variable are performed above the average and the probability that a latent variable occurs at a low level P (Z < 1) (that the activities within a latent variable performed below average). • The probability of finding both latent variables (which are represented by the & symbol) of each hypothesis at its high or low level, indicating that the following combinations occur: a. b. c. d.
P(Z > 1) & P(Zi > 1); P(Zd > 1) & P(Zi < −1); P(Zd < −1) & P(Zi > 1); P(Zi < −1) & P(Zd 1) | P(Zi > 1); P(Zd > 1) | P(Zi < −1); P(Zd < −1) | P(Zi > 1); P(Zi < −1) | P(Zd < −1).
132
A. S. Morales-García et al.
6.4 Results 6.4.1 Description of the Sample The questionnaire was applied from January to June 2020, a total of 239 cases were collected that could be analyzed, since a response was given in full to the observed variables, although there were cases in which some questions from the demographic section were not answered. Table 6.3 shows a cross-table to analyze gender and years of experience in the job position held by each of the respondents. It is observed that only 211 respondents provided a complete answer to those variables analyzed, while 28 participants omitted at least one of the variables. Out of the total participants who responded the questionnaire, 69 were women and 142 were men. Regarding the years of experience, it is observed that the highest value is in the range from one to two years with a total of 92 cases, followed by those with two to five years with a total of 54 cases. This shows that the respondents have enough experience to answer the survey due to the time they had been working in their current job position. Table 6.4 shows the number of employees the company has, and the position held by the respondent. A total of 227 people answered both questions while 12 omitted at least one of them. It is observed that only 59 respondents come from companies that have less than 300 employees, therefore, the information comes from medium and large companies. Likewise, it is observed that 126 of those who responded are engineers who are in charge of several supervisors that are in charge of production Table 6.3 Gender versus years of experience Years of experience
Gender
Total
Female
Male
1a2
30
62
92
2a5
17
37
54
5 a 10
10
22
32
>10
12
21
33
Total
69
142
211
Table 6.4 Number of employees versus job position Job position
0–49
50–299
300–999
1000–4,999
5,000–9,999
>10,000
Total
Manager
2
2
6
15
1
4
30
Engineer
14
18
30
42
12
10
126
Supervisor
12
11
11
28
6
3
71
Total
28
31
47
85
19
17
227
6 Effect of TPM and OEE on the Social Performance of Companies
133
Table 6.5 Validation of latent variables Index
TPM
R2 Adj.
R2
OEE
JIT
SS
0.498
0.491
0.56
0.496
0.486
0.554
Composite reliablity
0.92
0.874
0.882
0.953
Cronbach Alpha
0.899
0.82
0.831
0.941
Average variance extracted
0.623
0.582
0.601
0.773
Full Collin. VIF
2.429
2.432
2.101
2.149
0.499
0.492
0.559
Q2
line technicians and operators. Consequently, it is concluded that all respondents have direct administrative positions with the production system.
6.4.2 Validation of Latent Variables As illustrated in Fig. 6.1, the model is integrated by four latent variables and Table 6.5 indicates the validation indexes obtained. The values obtained for adjusted R2 and R2 allow to conclude that the model has sufficient parametric predictive validity, while the composite reliability index and Cronbach’s alpha indicate sufficient internal validity, likewise, the value of the variance inflation indexes allows to conclude that there are no linearity problems among the observed variables in each latent variable. Finally, the Q2 value is very similar to R2 and over zero, which indicates sufficient non-parametric validity.
6.4.3 Descriptive Analysis of Variables Table 6.6 illustrates the median as a measure of central tendency and the interquartile range as a measure of dispersion of the data. Regarding total productive maintenance, it is observed that the most important aspect for the respondents is to dedicate periodic inspections to keep machines in operation (TPM2), since it has the highest median and the lowest interquartile range. Concerning just-in-time (JIT), it is observed that the most important variable is that the machinery works uniformly according to schedule (JIT4). Regarding the efficiency index of the equipment, the most relevant variable is that the personnel of the production system is highly efficient and trained on each of its machines to carry out their tasks. Finally, social benefits that can be obtained from implementing TPM and JIT seeking high rates of equipment efficiency, it is observed that the most important variable is that operators’ safety in their workplace is improved.
134
A. S. Morales-García et al.
Table 6.6 Descriptive analysis of the items Variable
Median
Interquartile range
TPM1. We ensure machines are always in a high state of readiness for production
3.757
1.755
TPM2. We dedicate regular inspections to keep machines running appropriately
3.851
1.627
TPM3. We have a daily maintenance sound system to prevent machine failure
3.328
2.041
TPM4. We scrupulously clean workspaces (including machines and work) to cause unusual events to occur
3.7
1.940
TPM5. We have a time set aside each day for maintenance activities
3.653
2.124
TPM6. Operators are trained to keep their own machines running appropriately
3.462
2.181
TPM7. We highlight the good maintenance system as a strategy to achieve quality compliance
3.709
1.802
JIT1. Is the finished product inventory constantly rotating?
3.691
1.821
JIT2. Does raw material inventory have strictly what is necessary?
3.56
1.940
JIT3. Are production orders delivered in the estimated time?
3.703
1.717
JIT4. Does the machinery work uniformly on schedule?
3.762
1.624
JIT5. Is there no waste in the production process?
3.191
1.652
OEE1. Is the production department highly productive?
3.75
1.626
OEE2. Do you have knowledge about the percentage of productivity of each production area?
3.702
1.802
OEE3. Is the production goal achieved in the stipulated time and 3.662 does it always have the required quality?
1.709
OEE4. Is there no waste in the production process?
3.017
2.078
OEE5. Are there no major stoppages in the production process?
3.331
1.779
SS1. Improvement in working conditions
3.848
1.605
SS2. Improved safety in the workplace
4.018
1.605
SS3. Employees’ health improved
3.892
1.629
SS4. Improvement of labor relations
3.769
1.727
SS5. Morale improved
3.703
1.854
SS6. Decrease in working pressure
3.401
1.793
6.4.4 Structural Equation Model Since the latent variables have sufficient validity, they are integrated into the model and evaluated. Figure 6.2 illustrates the evaluated model, which presents the following efficiency indexes:
6 Effect of TPM and OEE on the Social Performance of Companies
135
Fig. 6.2 Evaluated model
• • • • • •
Average path coefficient (APC) = 0.386, P < 0.001 Average R-squared (ARS) = 0.517, P < 0.001 Average adjusted R-squared (AARS) = 0.512, P < 0.001 Average block VIF (AVIF) = 2.052, acceptable if ≤ 5 Average full collinearity VIF (AFVIF) = 2.278, acceptable if ≤ 5 Tenenhaus GoF (GoF) = 0.577, small ≥ 0.1, medium ≥ 0.25, large ≥ 0.36.
The previous information shows that the evaluated model has sufficient predictive validity since the average values of R2 and adjusted R2 have associated p-values under 0.05, in addition to the fact that the average of the coefficients per segment is acceptable. Similarly, it is observed that there are no problems of linearity within the latent variables or among them because the efficiency indexes of the model are under 5. Finally, it is observed that the model has an adequate fit, since the Tenenhaus fit index is over 0.36.
6.4.4.1
Direct Effects
The direct effects and coefficients that are presented in Fig. 6.2 allow to make conclusions regarding the hypotheses proposed, which are expressed as follows: H1 . There is enough statistical evidence to declare that TPM has a direct and positive effect on JIT, since when the first variable increases its standard deviation in one unit, the second variable increases in 0.33 units.
136
A. S. Morales-García et al.
Table 6.7 Direct effects contribution TPM
OEE
OEE
0.498
JIT
0.212
0.27
SS
0.21
0.151
JIT
R2 0.498 0.482
0.199
0.560
H2 . There is enough statistical evidence to declare that the MPR has a direct and positive effect on the OEE index, since when the first variable increases its standard deviation in one unit, the second variable increases in 0.71 units. H3 . There is enough statistical evidence to declare that the OEE index has a direct and positive effect on JIT, since when the first variable increases its standard deviation in one unit, the second variable increases in 0.42 units. H4 . There is enough statistical evidence to declare that TPM has a direct and positive effect on the company social sustainability, since when the first variable increases its standard deviation in one unit, the second variable increases in 0.32 units. H5 . There is enough statistical evidence to declare that JIT has a direct and positive effect on the company social sustainability, because when the first variable increases its standard deviation in one unit, the second variable increases in 0.31 units. H6 . There is enough statistical evidence to declare that the OEE index has a direct and positive effect on the company social sustainability, since when the first variable increases its standard deviation in one unit, the second variable increases in 0.23 units.
6.4.4.2
Contributions of Direct Effects to Variability
Latent dependent variables that are illustrated in Fig. 6.2 are associate with their R2 value, while Table 6.7 illustrates the contribution that each of the direct effects from the independent variables has on the R2 value. For example, for SS, the R2 value is equal to 0.560, but TPM contributes 0.21, OEE 0.151, and JIT 0.199. Thus, TPM and JIT are two of the most important variables to explain the benefits from the company social sustainability.
6.4.4.3
Sum of Indirect Effects
According to Fig. 6.2, TPM has several indirect effects on the company social sustainability through the mediating variables: OEE and JIT. In this case, three indirect effects are identified among the latent variables, which contribute to explain the variability of the dependent variable and denote the importance of the mediating variables. The indirect effects are the following: TPM has an indirect effect on JIT of 0.299 and will be done through OEE.
6 Effect of TPM and OEE on the Social Performance of Companies Table 6.8 Total effects
TPM
OEE
OEE
0.706 (P < 0.001) ES = 0.498
JIT
0.633 (P < 0.001) ES = 0.401
0.424 (P < 0.001) ES = 0.279
Sus_Soc
0.674 (P < 0.001) ES = 0.449
0.362 (P < 0.001) ES = 0.235
137 JIT
0.305 (P < 0.001) ES = 0.199
TPM has an indirect effect on the company social sustainability of 0.358, which is done by three paths or segments. OEE has an indirect effect on the company social sustainability of 0.129, which is given through JIT.
6.4.4.4
Total Effects
The sum of the direct and indirect effects generates the total effects that are illustrated in Table 6.8 for the model in Fig. 6.2, in which the associated p-value and the size effects measure the explained variability. The most important total effects are the relationship between TPM and OEE with 0.706 that refers to a direct effect, as well as the relationship between TPM and SS that includes a direct effect and the sum of indirect effects, giving a total of 0.674. In the case of the relationships that have indirect effects, since they were all positive, then the value of the total effects confirms the high relationship that exists among them and strengthens the hypotheses, since, in all cases, the total effect has increased.
6.4.5 Sensitivity Analysis Table 6.9 illustrates the conditional probabilities of certain scenarios occurring when latent variables have high and low levels. Typically, two situations are analyzed, the probability that the analyzed variables occur simultaneously and the conditional probability that the dependent variable occurs at a certain level given that the independent variable has occurred at another level. For example, the probability that TPM and OEE occur simultaneously at their high levels is 0.086, but the probability that OEE occurs at their high level given that TPM has occurred at its high level as well, is 0.576; a high probability. In other words, the execution of the TPM plans and programs guarantees a high OEE index. Likewise, the probability of obtaining high levels of OEE given that there are low levels of TPM is 0.086, a relatively low value.
138
A. S. Morales-García et al.
Table 6.9 Sensitivity analysis TPM + (high) P = 0.150
OEE − (low) P = 0.850
+ (high) P = 0.159
JIT − (low) P = 0.841
+ (high) P = 0.168
− (low) P = 0.832
OEE + (high) & = 0.086 & = 0.073 P = 0.159 If = 0.576 If = 0.086 − (low) & = 0.064 & = 0.777 P = 0.841 If = 0.424 If = 0.914 JIT
+ (high) & = 0.086 & = 0.082 & = 0.086 & = 0.082 P = 0.168 If = 0.576 If = 0.096 If = 0.543 If = 0.097 − (low) & = 0.082 & = 0.768 & = 0.073 & = 0.759 P = 0.832 If = 0.096 If = 0.904 If = 0.457 If = 0.903
SS
+ (high) & = 0.064 & = 0.091 & = 0.055 & = 0.100 & = 0.050 & = 0.105 P = 0.155 If = 0.424 If = 0.107 If = 0.343 If = 0.074 If = 0.297 If = 0.126 − (low) & = 0.086 & = 0.759 & = 0.105 & = 0.741 & = 0.118 & = 0.727 P = 0.845 If = 0.576 If = 0.893 If = 0.657 If = 0.881 If = 0.703 If = 0.874
Similar interpretations can be made for the other relationships among the latent variables analyzed in the model in Fig. 6.2.
6.5 Conclusions Four latent variables related to total preventive maintenance, just-in-time systems, the efficiency index of the available equipment, and the company social sustainability have been analyzed. They have been related through six hypotheses and it has been concluded that all of them are accepted. The results indicate that total productive maintenance can be the base to achieve greater social sustainability in companies, having as mediating variables the just-intime methodology and the efficiency and availability rates of the equipment. This is mainly because TPM has the benefit not only of increasing the availability of tools and equipment, but also preserving operators’ health and improving their work environment. From the analysis of Fig. 6.2, the following is concluded: • The greatest relationship is the one among the variables in which the total productive maintenance and equipment efficiency are in between, with a value of 0.71 that represents almost double of all the others. The previous information indicates that an adequate maintenance program will always allow having equipment available with high efficiency. • The just-in-time variable is explained by the efficiency of the equipment and the total productive maintenance; However, the efficiency of the equipment in this case has a greater effect, therefore in the variable with the greatest explanatory
6 Effect of TPM and OEE on the Social Performance of Companies
139
power. That is, for just-in-time it is more important that there is high availability of equipment, since this allows greater production system fluency and meeting delivery times. • For the social sustainability variable, which is explained by the other three variables, productive and just-in-time maintenance are the most important variables and with a greater explanatory capacity for it. This is a result of the total productive maintenance programs because they have as a consequence a decrease in employees’ accidents when handling the equipment and machinery, since they feel satisfied with the fulfillment of the production plans and programs on time.
References Abdallah AB, Matsui Y (2007) JIT and TPM: Their relationship and impact on JIT and competitive performances. In: Conference of the International Decision Sciences Institute (DSI). Bangkok, Thailand Al Mannai B, Suliman S, Al Alawai Y (2017) An investigation into the effects of the application of TQM, TPM and JIT on performance of industry in Bahrain. Int J Ind Eng Res Dev 8. https:// doi.org/10.34218/ijierd.8.1.2017.002 Alcaraz JLG, Maldonado AA, Iniesta AA et al (2014) A systematic review/survey for JIT implementation: Mexican maquiladoras as case study. Comput Ind 65:761–773. https://doi.org/10.1016/j. compind.2014.02.013 Anderson JA, Glaser J, Glotzer SC (2020) HOOMD-blue: a Python package for high-performance molecular dynamics and hard particle Monte Carlo simulations. Comput Mater Sci 173:109363. https://doi.org/10.1016/j.commatsci.2019.109363 Baghbani M, Iranzadeh S, Bagherzadeh khajeh M (2019) Investigating the relationship between RPN parameters in fuzzy PFMEA and OEE in a sugar factory. J Loss Prev Process Ind 60:221–232. https://doi.org/10.1016/j.jlp.2019.05.003 Blach S, Zeuzem S, Manns M et al (2017) Global prevalence and genotype distribution of hepatitis C virus infection in 2015: a modelling study. Lancet Gastroenterol Hepatol 2:161–176. https:// doi.org/10.1016/S2468-1253(16)30181-9 Blunch N (2012) Introduction to structural equation modeling using IBM SPSS statistics and AMOS. Sage Chen Z, Bidanda B (2019) Sustainable manufacturing production-inventory decision of multiple factories with JIT logistics, component recovery and emission control. Transp Res Part E Logist Transp Rev 128:356–383. https://doi.org/10.1016/j.tre.2019.06.013 Chikwendu OC, Chima AS, Edith MC (2020) The optimization of overall equipment effectiveness factors in a pharmaceutical company. Heliyon 6:e03796. https://doi.org/10.1016/j.heliyon.2020. e03796 Ciccullo F, Pero M, Caridi M et al (2018) Integrating the environmental and social sustainability pillars into the lean and agile supply chain management paradigms: a literature review and future research directions. J Clean Prod 172:2336–2350 Cronbach LJ, Meehl PE (1955) Construct validity in psychological tests. Psychol Bull 52:281–302. https://doi.org/10.1037/h0040957 Cua KO, McKone KE, Schroeder RG (2001) Relationships between implementation of TQM, JIT, and TPM and manufacturing performance. J Oper Manag 19:675–694. https://doi.org/10.1016/ S0272-6963(01)00066-3 Digalwar AK, Dambhare S, Saraswat S (2019) Social sustainability assessment framework for indian manufacturing industry. In: Materials today: proceedings. Elsevier, pp 591–598
140
A. S. Morales-García et al.
Farooq MS, Salam M, Fayolle A et al (2018) Impact of service quality on customer satisfaction in Malaysia airlines: a PLS-SEM approach. J Air Transp Manag 67:169–180. https://doi.org/10. 1016/j.jairtraman.2017.12.008 García-Alcaraz J, Maldonado-Macías A, Alor-Hernández G, Sánchez-Ramírez C (2017) The impact of information and communication technologies (ICT) on agility, operating, and economical performance of supply chain. Adv Prod Eng Manag 12:29–40 Garza-Reyes JA, Kumar V, Chaikittisilp S, Tan KH (2018) The effect of lean methods and tools on the environmental performance of manufacturing organisations. Int J Prod Econ 200:170–180. https://doi.org/10.1016/j.ijpe.2018.03.030 Green KW, Inman RA, Sower VE, Zelbst PJ (2019) Impact of JIT, TQM and green supply chain practices on environmental sustainability. J Manuf Technol Manag 30:26–47. https://doi.org/10. 1108/JMTM-01-2018-0015 Guo F, Huang B (2020) A mutual information-based Variational Autoencoder for robust JIT soft sensing with abnormal observations. Chemom Intell Lab Syst 204:104118. https://doi.org/10. 1016/j.chemolab.2020.104118 Gupta A, Garg RK (2012) OEE improvement by TPM implementation: a case study. Undefined Hair JF, Ringle CM, Sarstedt M (2014) Corrigendum to “Editorial partial least squares structural equation modeling: Rigorous applications, better results and higher acceptance” [LRP, 46, 1–2, (2013), 1–12], https://doi.org/10.1016/j.lrp.2013.01.001. Long Range Plann. 47:392 Heng Z, Aiping L, Liyun X, Moroni G (2019) Automatic estimate of OEE considering uncertainty. In: Procedia CIRP. Elsevier, pp 630–635 Heravi G, Kebria MF, Rostami M (2019) Integrating the production and the erection processes of pre-fabricated steel frames in building projects using phased lean management. Eng Constr Archit Manag. https://doi.org/10.1108/ECAM-03-2019-0133 Heravi G, Rostami M, Kebria MF (2020) Energy consumption and carbon emissions assessment of integrated production and erection of buildings’ pre-fabricated steel frames using lean techniques. J Clean Prod 253:120045. https://doi.org/10.1016/j.jclepro.2020.120045 Iqbal T, Huq F, Bhutta MKS (2018) Agile manufacturing relationship building with TQM, JIT, and firm performance: an exploratory study in apparel export industry of Pakistan. Int J Prod Econ 203:24–37. https://doi.org/10.1016/j.ijpe.2018.05.033 Jimenez G, Santos G, Sá JC, et al (2019) Improvement of productivity and quality in the value chain through lean manufacturing—a case study. In: Procedia manufacturing. Elsevier, pp 882–889 Kim SC, Shin KS (2019) Negotiation model for optimal replenishment planning considering defects under the VMI and JIT environment. Asian J Shipp Logist 35:147–153. https://doi.org/10.1016/ j.ajsl.2019.09.003 Kock N (2017) WarpPLS user manual: version 6.0 Kock N (2018) WarpPLS 6.0 user manual. Laredo, TX Kock N (2019) WITHDRAWN: factor-based structural equation modeling with WarpPLS. Australas Mark J. https://doi.org/10.1016/j.ausmj.2018.12.002 Kock N (2020) WarpPLS user manual: version 7.0 Longoni A, Cagliano R (2015) Cross-functional executive involvement and worker involvement in lean manufacturing and sustainability alignment. Int J Oper Prod Manag 35:1332–1358. https:// doi.org/10.1108/IJOPM-02-2015-0113 Mani V, Agrawal R, Sharma V (2015) Supply chain social sustainability: a comparative case analysis in Indian manufacturing industries. Procedia Soc Behav Sci 189:234–251. https://doi.org/10. 1016/j.sbspro.2015.03.219 Mckenzie S (2004) Social sustainability: towards some definitions Mendoza-Fong JR, García-Alcaraz JL, Avelar Sosa L, Díaz Reza JR (2019) Effect of green attributes in obtaining benefits in the manufacturing and marketing process, pp 46–72 Milanesi M, Runfola A, Guercini S (2020) Pharmaceutical industry riding the wave of sustainability: review and opportunities for future research. J Clean Prod 261:121204 Mueller RO (1996) Linear regression and classical path analysis. Springer, New York, pp 1–61
6 Effect of TPM and OEE on the Social Performance of Companies
141
Muñoz-Villamizar A, Santos J, Montoya-Torres JR, Jaca C (2018) Using OEE to evaluate the effectiveness of urban freight transportation systems: a case study. Int J Prod Econ 197:232–242. https://doi.org/10.1016/j.ijpe.2018.01.011 Ng KC, Chong KE, Goh GGG (2014) Improving overall equipment effectiveness (OEE) through the six sigma methodology in a semiconductor firm: a case study. In: IEEE international conference on industrial engineering and engineering management. In: IEEE computer society, pp 833–837 Purushothaman M babu, Seadon J, Moore D (2020) Waste reduction using lean tools in a multicultural environment. J Clean Prod 265:121681. https://doi.org/10.1016/j.jclepro.2020. 121681 Romero D, Gaiardelli P, Powell D et al (2019) Rethinking jidoka systems under automation & learning perspectives in the digital lean manufacturing world. IFAC-PapersOnLine 52:899–903. https://doi.org/10.1016/j.ifacol.2019.11.309 Sahoo S, Yadav S (2020) Influences of TPM and TQM practices on performance of engineering product and component manufacturers. In: Procedia manufacturing. Elsevier, pp 728–735 Shen CC (2015) Discussion on key successful factors of TPM in enterprises. J Appl Res Technol 13:425–427. https://doi.org/10.1016/j.jart.2015.05.002 Singh C, Singh D, Khamba JS (2020) Understanding the key performance parameters of green lean performance in manufacturing industries. Mater Today Proc. https://doi.org/10.1016/j.matpr. 2020.06.328 Stamatis DH (2017) The OEE primer: understanding overall equipment effectiveness, reliability, and maintainability. CRC Press Suryaprakash M, Gomathi Prabha M, Yuvaraja M, Rishi Revanth RV (2020) Improvement of overall equipment effectiveness of machining centre using TPM. Mater Today Proc. https://doi.org/10. 1016/j.matpr.2020.02.820 Thorat R, Mahesha GT (2020) Improvement in productivity through TPM implementation. In: Materials today: proceedings. Elsevier, pp 1508–1517 Tofighi D, MacKinnon DP (2016) Monte Carlo confidence intervals for complex functions of indirect effects. Struct Equ Model 23:194–205. https://doi.org/10.1080/10705511.2015.1057284 Ubøe J (2017) Conditional probability, pp 55–74 Wan Mahmood WH, Abdullah I, Md Fauadi MHF (2015) Translating OEE measure into manufacturing sustainability. Appl Mech Mater 761:555–559. https://doi.org/10.4028/www.scientific. net/amm.761.555 Wang H, Gong Q, Wang S (2017) Information processing structures and decision making delays in MRP and JIT. Int J Prod Econ 188:41–49. https://doi.org/10.1016/j.ijpe.2017.03.016 Westland JC (2015) A brief history of structural equation models. In: Studies in systems, decision and control. Springer, pp 9–22 Zhicun X, Meng D, Lifeng W (2020) Evaluating the effect of sample length on forecasting validity of FGM(1,1). Alexandria Eng J. https://doi.org/10.1016/j.aej.2020.08.026
Chapter 7
ENERMONGRID: Intelligent Energy Monitoring, Visualization and Fraud Detection for Smart Grids Miguel Lagares-Lemos, Yuliana Perez-Gallardo, Angel Lagares-Lemos, and Juan Miguel Gómez-Berbís Abstract The current obsolete electricity network is being transformed into an advanced, digitalized and more efficient one known as Smart Grid. The deployment of an Automatic Metering Infrastructure (AMI) will make an unseen quantity of rich information available in near real-time, processed to make decisions for the optimal energy production, generation, distribution, and consumption. This document presents an analysis of the ENERMONGRID tool, a tool used for intelligent energy monitoring, data visualization and fraud detection in electric networks. Keywords ENERMONGRID: intelligent energy monitoring · Visualization and fraud detection for smart grids
7.1 Introduction The electrical grids of today managed by utility companies are complex systems with many different aspects that require expertise to operate successfully, such as grid management, data visualization, load prediction, loss prediction, and fraud prevention. The complexity of this domain is not only architectural, due to many different devices communicating and interoperating amongst the network, but there also exists a logical complexity, given the fact that incredible amounts of data must be properly managed in order to optimize the efficiency of the electric network.
M. Lagares-Lemos (B) · Y. Perez-Gallardo · A. Lagares-Lemos · J. M. Gómez-Berbís Department of Computer Science, Universidad Carlos III de Madrid, Madrid, Spain e-mail: [email protected] Y. Perez-Gallardo e-mail: [email protected] A. Lagares-Lemos e-mail: [email protected] J. M. Gómez-Berbís e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_7
143
144
M. Lagares-Lemos et al.
Due to this difficulty and the many problems that can arise when dealing with energy load estimates, loss estimates, as well as fraud detection and prevention, the ENERMONGRID tool was developed to aid those entities in charge of managing an electric network, with the focus of doing so in Smart Grids. This document is divided into several parts. The following section describes previous work related to this project. After that, the ENERMONGRID system is analyzed, detailing specific parts such as its architecture and involved elements, the restrictions and considerations that were taking into account when developing and testing the system, the anomalies that were found in the electric grid once the system was put into operation, the KPI’s used to measure efficiency, as well as the amount of alerts that were fired during the duration of the analysis. Once the system has been explained, the findings are concluded in the final section.
7.2 Related Work According to studies by the Galvin electricity initiative, in the United States the technologies Smart Grid will lower the costs of power supply and reduce the need for massive infrastructure investment in at least the next twenty years with a larger capacity electric grid. In the environmental aspect there is a great interest of the countries in developing policies and regulations that encourage the creation of social awareness with respect to consequences of greenhouse gases. The problem lies in the fuel used by traditional power generation plants and is produced during demand peaks that force the activation of special plants to be able to supply those additional energy requirements (Basso et al. 2009). These plants are used only during these periods, with the resulting cost overruns—which have a direct impact on bills-. A very significant fact: in the United States, a developed country, 40% of carbon dioxide emissions come from electricity generation, while that only 20% are caused by transport. This presents a huge challenge for the electricity industry in terms of climate change global. According to the National Renewable Energy Laboratory (NREL), “The utilities are pressured on many fronts to adopt business practices that respond to environmental concerns in the world” (Energy 2020). There are currently many parallel activities related to standardization of Smart Grid networks. Since these activities are relevant to the same topic, it is some overlap and duplication of them is inevitable (Andrés et al. 2011). There are several development agencies and standardization, among them: IEC Smart Grid Strategy Group. The International Electrotechnical Commission (IEC) is the natural focal point for the electrical industry. It aims to provide a unique reference source for the many projects of Smart Grids that are being implemented around the world. It has developed a framework for standardization that includes protocols and reference standards for achieve interoperability of Smart Grid systems and devices (Czarnecki and Dietze 2017).
7 ENERMONGRID: Intelligent Energy Monitoring, Visualization …
145
National Institute of Standards and Technology (NIST). It is not a body of standardization but has been designated by the government of the United States to manage the project of selecting a set of standards for the Smart Grid network in that country (Mowbray 2013). EU Commission Task Force for Smart Grids. Its mission is to assist the Commission and guidelines for European regulation and to coordinate the first steps towards the implementation of Smart Grid in the provision of the third energy package (García Jiménez et al. 2010). IEEE P2030. It is an IEEE working group for the development of a guide for the interoperability of Smart Grid in the operation of energy technologies and information technology with the electrical power system (EPS) and the loads and end-user applications. Many demonstration projects are currently underway, and some results are available. The most representative initiatives in the field of Smart Grid are present in the United States, Europe, Japan and China.
7.3 ENERMONGRID The ENERMONGRID tool is a tool used for intelligent energy monitoring, data visualization and fraud detection in electric networks. It is responsible for collecting data from the meters and transformation centers of the measurement system deployed in the metering infrastructure, also known as AMI, which stands for Advanced Metering Infrastructure (Cleveland 2008). This information is processed through the MDM module (Meter Data Management) that collects, consolidates, and manages this information. The MDM module, among other tasks, also provides security through the anonymization of data, as well as advanced security measures; offers information to external systems through the communications module, through WEB REST services; manages the topological, cartographic and electrical information of the network; preprocesses and filters data relevant to the system; etc. The ENERMONGRID tool, via the use of algorithms developed by the different project members, calculates estimates, energy losses in the network, predictions, as well as energy balances. These results are treated as new reports, which have been designed for the project following the STG standard.
7.3.1 Architecture The system architecture can be described with the following Fig. 7.1. The information flow of the architecture is given by three main elements: the meters, the transformation centers, and the backoffice. The role of each one of these elements will be briefly described.
146
M. Lagares-Lemos et al.
Fig. 7.1 System architecture
7.3.1.1
Meters
A meter is the device typically used to gather information regarding the energy consumption inside of a household or other location that consumes electrical energy. Most meters are analogical, but companies and technological drive have made it possible for these meters to be electrical devices in and of themselves, coining the term “Smart Meter”. Smart Meters are especially important in the context and domain of Smart Grids and Smart Cities. The role of the meter is to accrue the consumption rate of a location every fixed amount of time. Smart Meters can perform this data gathering duty over nontraditional methods such as OTA (over the air) by establishing a wireless connection with smart appliances. If this is not the case, however, a more standard approach is followed, which is simply to take measures at the power lines feeding the household or location. Once the meter has the information regarding energy consumption, it sends this information to a Transformation Center. This communication can be achieved via several different communication protocols, such as PLC, Ethernet, WiFi, or Zigbee.
7.3.1.2
Transformation Centers
The transformation center (referred to as TC in the rest of this document), is the element in charge of aggregating the data obtained from a group of meters and then relaying it downwards towards the backoffice. A single transformation center obtains information from many meters, so the relationship be a 1-N relationship. Transformation centers can be seen as the intermediate step tasked with grouping lots of small measurements taken at meters and sent over low-throughput communication protocols, to sending great amounts of data through higher-throughput communication protocols and allowing the data to reach its ultimate destination.
7 ENERMONGRID: Intelligent Energy Monitoring, Visualization …
147
Fig. 7.2 Energy flow of transformation centers
The communication protocols used to relay the data from the transformation center to the backoffice are typically BPC, Ethernet, GPRS/UMTS, WiMAX, or ADSL (Gordon and de Bucs 2000). The ENERMONGRID tool allows to visualize the energy flow between transformation centers as can be detailed in the following Fig. 7.2.
7.3.1.3
Backoffice
In this system architecture, the backoffice be the destination of the data gathered from the meters. Once the data flows from the meters to the transformation centers, aggregated, and finally from the transformation centers to the backoffice, the utility company is free to process the consumption data from each meter.
7.3.2 Types of Data and Reports For ENERMONGRID to work the way it does and allow data prediction, visualization, and fraud prevention, it must first have access to the original reports obtained from electric utility services. A flow chart of the different types of reports is presented in the following Fig. 7.3, and will be explained in detail. In the first level of the data tree, we have the original reports. These are reports provided by the utility companies. These reports contain hourly data with information
148
M. Lagares-Lemos et al.
Fig. 7.3 Types of reports and data dependencies
regarding the consumption observed by a specific meter. Each company has its own way of reporting this information, so two different formats are considered: the S02 file and S05 file. On the next level, we have the estimation layer. This layer takes as input the original reports provided by the utility companies and processes the information in a way that allows to create estimates regarding further values. This process allows for a preliminary view at how data might change in upcoming moments. On the third level, we have the process in charge of predicting the curves of load on the system as well as potential losses. This sets the threshold for sampled values, allowing the detection of possible fraud-like activities in the network. Finally, the last process involves the calculation of energy balances by using the previous layer’s process output as a starting point. It should also be noted that the tool internally processes information of real data (not estimated, nor of prediction), through its algorithms to have measurements of real energy balances, which can be compared with the estimates and predictions, as well as algorithms for the generation of alerts when certain conditions are met. Through its web interface the tool allows access to all the information handled in it, both the monitoring and control of the network, the status of the various reports, the information of each report, cartographic information, consumption, voltages and intensities in nodes and stretches, power flows in sections, energy balances, alerts generated by the system, etc. In the following Fig. 7.4, we can observe an interface which analyzes different metrics and KPI’s of various types of reports detailed above. In this case, the S02, S352, S332 and S362 files are analyzed.
7 ENERMONGRID: Intelligent Energy Monitoring, Visualization …
149
Fig. 7.4 Analysis of energy balances
7.3.3 Considerations and Restrictions In this section, some considerations and restrictions that have been present during the energy analysis process using the ENERMONGRID tool will be detailed.
7.3.3.1
Samples with a Quality Bit Other Than ‘00’ Have Not Been Considered
Samples with a quality bit of ‘00’ have been, however, considered to calculate the Reading rate. This modification has a side effect, because in certain scenarios when many of the samples of the meters come without the quality bit at ‘00’ at a certain moment, when not taken into account, the total sum can produce very low values, throwing losses greater than those that actually exist in the network and which would be obtained by the calculations when the quality bit is taken into account. An example of this can be seen in the following Fig. 7.5. The opposite situation can also occur, as in the following example in which one of the supervisors of one of the TC6 trades having values of Bc other than ‘00’ is not taken into account, giving lower values of the sum of the supervisors compared to the sum of the meters. Resulting in negative losses, or in other words, apparent profits (Fig. 7.6).
150
M. Lagares-Lemos et al.
Fig. 7.5 Side effect of discarding measurements with quality bit other than ‘00’
Fig. 7.6 Apparent profits
7.3.3.2
Meters that Did Not Have a Registered Location in the System Have Not Been Considered in the Analysis
The values obtained from meters not associated to a location were only used to adjust the reading rate value. This change arises because the estimator and the loss calculation algorithm do not take them into account and therefore, in order to compare with the real values, they have also been discarded inside the algorithms. Given this condition, it can be observed that the consumption of the sum of meters is lower and therefore apparently increases the differences with the value of the supervisor, therefore yielding higher values of total losses.
7.3.4 Real-Time Data Anomalies In the following section, the real-time data anomalies that were observed during the analysis period of the ENERMONGRID tool will be detailed.
7.3.4.1
Supervisor Values Set to 0
There are certain scenarios (11.99% of cases in the UTLA network and 1.58% for UTLB for the entire project) in which the supervisor shows consumptions equal to
7 ENERMONGRID: Intelligent Energy Monitoring, Visualization …
151
0, when there is consumption in the meters associated with them. All these values are associated with quality bits other than ‘00’ for UTLB, but not for UTLA, where in some cases, not in the majority, the quality bit offers values to ‘00’. If we do this TC analysis for the UTLA network, in TC3 and TC4, whenever the supervisor gives a null value, it has the quality bit marked other than ‘00’. It is not the same case for TC 1 and 2. It should also be noted that this is a case that has not been presented since April 2, 2014.
7.3.4.2
Abnormally High Values
In some cases, abnormally high values have been detected in some meters that caused the balance to provide erroneous calculations and that did not allow to render in detail the data visualization graphs containing the balances of the TCs. Twenty-nine abnormally high values have been detected over the entire duration of the project for the UTLA network.
7.3.4.3
Negative Losses at Specific Moments
There are certain times when the total losses are negative, that is, the subtraction of the consumption data recorded by the supervisor minus the sum of the consumption record of all the meters associated with said supervisor (or TC) is less than 0. The following tables show the number of moments (schedules, since they have been calculated through S02) where this circumstance occurs for the TCs of the UTLA and UTLB network (Figs. 7.7 and 7.8). CT1
CT2
CT3
CT4
3587
3182
33
2657
Moments negative losses without counting values Bc!= 3588
3085
31
1722
0
2988
0
953
CT5
CT6
CT7
158 158
11575 11564
433 425
0
10024
105
Negative losses moments '00' Moments negative losses with all the perfect data
Fig. 7.7 Data from TC1 to TC4
Negative losses moments Moments negative losses without counting values Bc!= '00' Moments negative losses with all the perfect data
Fig. 7.8 Data from TC5 to TC7
152
M. Lagares-Lemos et al.
The first row represents the totality of moments where total negative losses are recorded, which considering the number of records taken, represent 0.20% of cases for UTLA and 0.18% for UTLB. The second row shows the moments with negative losses when values with a quality bit other than ‘00’ are not taken into account. Its value is slightly lower, which does not seem to have a relevant effect and simply reduces the number of moments due to the fact that these data are not taken into account in the sum. The last row represents those moments with negative losses when all the data coming from the network have the quality bit at ‘00’. This anomaly is still occurring (except in TC5), which should be studied further to determine the cause (meters or supervisors that measure incorrectly, meters that do not belong to this TC and are transmitting values to it).
7.3.5 Reading Rates of Original Reports In this section, the different reading rates obtained from analyzing reports across the different layers are detailed.
7.3.5.1
UTLA
The reading rates obtained for the hourly reports (S02) are shown in the following table for the different TCs during the month of September (Fig. 7.9). As can be observed, the reading rate is around 80% in September. The lowest rate can be found in TC3 given that it experienced technical communication problems during these dates, and the maximum values belong to TC4, which coincides with the TC that does not have any meter of industrial or commercial type (type 3 or 4), although this should not have a direct relationship, because these are not taken into account to establish the reading rate. If we do not consider the TC3 due to breakdown, the total rate is around 89%. Considering that TC3 has suffered a breakdown, this is what the ENERMONGRID tool provides in terms of data visualization for a fully operational TC, namely TC1, and the faulty TC3 (Figs. 7.10 and 7.11).
September
CT1
CT2
CT3
CT4
88.16 %
79.73 %
10.42 %
95.52 %
Fig. 7.9 Reading rates of September
7 ENERMONGRID: Intelligent Energy Monitoring, Visualization …
153
Fig. 7.10 Reading rate of TC1
Fig. 7.11 Reading rate of TC3
7.3.5.2
UTLB
The reading rates obtained for the hourly reports (S02) are shown in the following table for the different TCs during the months of August and September (Fig. 7.12). As we can observe, the reading rate is around 47% in September, while in August it was 63%. Analyzing the availability of reports in September (next figure), we see that the reason for the lowest percentage during the month of September, is due to the lack of availability of reports on certain days (Fig. 7.13).
August September Fig. 7.12 UTLB readings
CT5
CT6
CT7
95.10 % 69.91 %
88.41 % 65.29 %
45.07 % 34.61 %
154
M. Lagares-Lemos et al.
Fig. 7.13 Availability of reports
7.3.6 KPIs of Energy Balances 7.3.6.1
UTLA
The UTLA values corresponding to the energy balances for September for UTLA and for August 2014 for UTLB are shown below (since the reading rate is better in that month) (van der Meijden et al. 2010). The KPIs are based on the real hour data (S02), except for the case of technical losses that are obtained from the estimated ones. These data can be consulted through the ENERMONGRID tool (Fig. 7.14). As can be observed, the non-technical losses are high, especially in TC1 and TC2, which have type 3 and type 4 meters, which, being not tele-managed and registered in the system, virtually increase non-technical losses. This can be seen in the following figure, which shows the supervisor’s graph (in purple) and the sum of the meters (blue), in addition to the technical losses (red). The graph of the supervisor is far above the consumption reported by the meters.
Fig. 7.14 UTLA analysis
7 ENERMONGRID: Intelligent Energy Monitoring, Visualization …
7.3.6.2
155
UTLB
The UTLB values corresponding to the energy balances for September for UTLA and for August 2014 for UTLB are shown below (since the reading rate is better in that month). The KPIs are based on the real hour data (S02), except for the case of technical losses that are obtained from the estimated ones. These data can be consulted through the ENERMONGRID tool (Fig. 7.15). As can be observed, the non-technical losses are more moderate than in the UTLA network, since there is no existence of type 3 or 4 meters. The TC7 offers the best performance values and the TC6 registers apparently high values for this type of TC. The following figures show these two cases, where you can see the supervisor’s graph (in purple), the sum of the meters (blue), and the technical (red) and non-technical (white) losses. It is relevant that the TC7 in which the lowest reading rates were presented, offers the best results. This confirms that the meters that the system considers are probably meters written off or that are not yet operating.
7.3.7 Alerts In the following section, details regarding the alerts set off b the ENERMONGRID will be detailed, namely, the UTLA and UTLB alert values will be analyzed.
Fig. 7.15 UTLB analysis
156
M. Lagares-Lemos et al.
7.3.7.1
UTLA
The system controls the tensions in the nodes and the saturation in the sections to generate alerts when the marked thresholds are passed: • 7% deviation in voltage at nodes in the 230 V reference voltage • 90% saturation limit in sections Since September 2014, 3917 alerts have been registered, 13 regarding the saturation in sections and the rest due to deviations
7.3.7.2
UTLB
The system controls the tensions in the nodes and the saturation in sections to generate alerts when the marked thresholds are passed: • 7% deviation in voltage at nodes in the 230 V reference voltage • 90% saturation limit in sections Since November 22, 2014, 737 alerts have been registered, all due to the tension in nodes.
7.4 Work Approach The project will be validated with several procedural challenges that we will detail below: (a)
(b)
(c)
(d)
Testing with various stakeholders to see how the system is affected by various people or businesses It is understood that not all participants need to understand and address the needs and functionality of the system. Checking the complexity of the system, in addition to checking the good performance of the system. Some aspects will be sensitive to human response and interaction, while others will require instant and automated responses. The security of the cybernetic systems will be checked to validate that the system is safe. Information security technologies are not sufficient to achieve secure operations if policies, risk assessment and training are not applied. The intelligent network is an evolving goal, so you have to study all the capabilities it has.
As discussed in this chapter, there are many areas of work, but especially for ICTs, necessarily involved in this transition. There are numerous factors to consider and steps that can be taken now to get ready. The interesting thing is that there is a lot to do and no country leads to the moment a very marked lead in the definitive development of a system like Enermongrid.
7 ENERMONGRID: Intelligent Energy Monitoring, Visualization …
157
7.5 Related Projects: INDIGO The ENERMOGRID project is strongly related to the industrial data management system called INDIGO. This system was created because there are critical information sources that can be found in Smart Factories, the cornerstone of the so-called upcoming new industrial breed codenamed “Industry 4.0”. Interoperability between the different industrial components could be hampered due to different data sources and formats, which traditionally have been proprietary systems or be contained into isolated “data silos”. INDIGO, an IIOT Data Management platform built on a semantics-based software platform which could overcome some of the caveats from these “data chaos” in Industrial Internet of Things. The architecture of the system is represented by Fig. 7.16. Components of the reference architecture is described below. 1.
Sensor: A Sensor is a hardware component, which is used to measure parameters of its physical environment and to translate them into electrical signals. Typically, Sensors are connected to or are integrated into a Device to which the gathered data is sent.
Fig. 7.16 IoT reference architecture
158
2.
3.
4.
5.
6.
M. Lagares-Lemos et al.
Actuator: An Actuator is a hardware component, which can act upon, control, or manipulate the physical environment, for example, by giving an optic or acoustic signal. Actuators receive commands from their connected Device. They translate electrical signals into physical action. Device: A Device is a hardware component, which is connected to Sensors and/or Actuators via wires or wirelessly or even integrates these components. To process data from Sensors and to control Actuators, typically software in the form of Drivers is required. A Driver in our architecture enables other software on the Device to access Sensors and Actuators. It represents the first possibility to use software to process data produced by Sensors and to control Actuators influencing the physical environment. Thus, Devices are the entry point of the physical environment to the digital world. Gateway: Devices are often connected to a Gateway in cases when the Device is not capable of directly connecting to further systems, e.g., if the Device cannot communicate via a particular protocol or because of other technical limitations. To solve these problems, a Gateway is used to compensate such limitations by providing required technologies and functionalities to translate between different protocols and by forwarding communication between Devices and other systems. IoT Integration Middleware: The IoT Integration Middleware is responsible for receiving data from the connected Devices to process the received data. A Device can communicate directly with the IoT Integration Middleware if it supports an appropriate communication technology, such as WiFi, a corresponding transport protocol, such as HTTP or MQTT, and a compatible payload format, such as JSON or XML. Otherwise the Device communicates over a Gateway with the IoT Integration Middleware. Application: The Application component represents software that uses the IoT Integration Middleware to gain insight into the physical environment by requesting Sensor data or to control physical actions using Actuators.
INDIGO uses a specific protocol called OPC-UA. OPC Unified Architecture (OPC-UA) is an M2M communication protocol specifically developed by the OPC foundation for industrial automation with the goal of combining all previous and/or existing protocols to a common (unified) data model in order to facilitate interoperability at different levels. • The protocol is specified by the IEC 62541 norm. Its main feature is to enable the exchange of information models of any complexity—both instances and types (metadata)—thus, realizing interoperability between machines at semantic level. It was designed to support a wide range of systems, ranging from PLC’s in production to enterprise servers, which are characterized by their diversity in terms of size, performance, platforms and functional capabilities. To this aim, OPC-UA was specifically developed considering explicitly following features: Being independent from the underlying Physical Layer, i.e. solutions like CAN, Profinet IO or industrial WiFi can be used for the actual data transfer. • Being independent from the functionalities of the available operative system.
7 ENERMONGRID: Intelligent Energy Monitoring, Visualization …
159
• Considering explicitly security communication aspects such that data manipulation can be prevented. • Describing data semantically such that complex data types rather than simple bits or bytes can be easily exchanged. The protocol supports two different type of communication paradigms: client/server as well as publish and subscribe, where the OCP-UA-Server is configured such that specific information is automatically delivered to OPC-UA-Clients that are interested in receiving certain information. Both solutions are also independent from the underlying Transport Layer and, depending on the application and performance that shall be realized, can be easily implemented over TCP, HTTP, HTTPs as well as UPD, AMQP and MQTT, by implementing different transport profiles. The standard specifies mainly two different paradigms: OPC-UA for Services (client/server) and OPC-UA for Message Model (public and subscribe). As the name suggests, the first paradigm is specifically thought for the realization of (web)services, where information is exchanged in XML- or JSON-format. This particular encoding makes the exchanged data easy to read and to process, but it can be poorly performing for industrial application that have restricted available resources. The second paradigm, on the other hand, is specifically though for industrial automation systems. In this case, data are represented in a binary way such that the exchanged messages require less overhead and less resources in order to achieve higher system performance. The OCP-UA-standard is currently still in evolution and subject to current and future standardization, in particular for industrial applications. The Conceptual Model in INDIGO is based on a two-fold approach. On the one hand, the INDIGO upper ontology, which has a major concept codenamed “Machine”. In the early days of the Semantic Web, formal ontologies were designed in RDFS (Resource Description Framework Schema) or OWL were the key concepts to model and share knowledge. From upper ontologies such as DOLCE (Descriptive Ontology for Linguistic and Cognitive Engineering) or SUMO (The Suggested Upper Merged Ontology) to specific vocabularies such FOAF (Friend Of A Friend) or SKOS (Simple Knowledge Organization System), the process to share knowledge consisted in designing a formal ontology for a particular domain and populate data (instances) for that domain. Although the complete reuse of existing ontologies was expected, the reality demonstrated that every party willing to share knowledge and data would create its own ontologies. Thus, the main idea behind web ontologies was partially broken since just a few concepts were really reused. The Upper Ontology is the interlinked with a particular specific data model that we will define as a Domain Ontology. The main goal of this Domain Ontology is to manage the vast amount of existing IIOT data. The management of this data has been coming into increasingly sharp focus for some time, since, as we mentioned before, industrial systems are very heterogeneous and tend to propietary and closed data models. Consequently, there is a need for open and well-defined models, such as the Semantic Sensor Network Ontology, a W3C initiative that we will use as Domain Ontolgy for three major reasons: (a) it can describe and enact most Sensor or IIOT
160
M. Lagares-Lemos et al.
Fig. 7.17 Main concepts of the SSN ontology
sensor related action and functionality, (b) it has a comprehensive and efficient RDF syntax, and (c) SSN follows a horizontal and vertical modularization architecture by including a lightweight but self-contained core ontology called SOSA (Sensor, Observation, Sample, and Actuator) for its elementary classes and properties. An excerpt of this ontology is shown in the following picture (Fig. 7.17). The INDIGO software architecture builds on the principle of modularity and seamless integration. Each software component has a particular functionality which relates with the overall system functionality (Fig. 7.18). Firstly, the Machine elements are the different systems we can find in the assembly line on the industrial environment we are trying to model. These could be mechanical arms, robots, cranes or any other kind of device, whereby the users stands by. Secondly, the INDIGO platform will interoperate with them through the communication protocols specified previously. The interaction will be based on the Data Model of the different Machines, which is represented through a particular Domain Ontology. In this case, since our Domain Ontology is the Semantic Sensor Network Ontology, there will be an understanding of the different Machine Data Models through ontology alignment and mapping. This is the functionality of the Rule-based Semantic Policies component. Finally, the Rule-based Semantic Engine will mediate through the different data semantics to ensure the correct persistence in the Semantic Storage component. The persistence layer will follow an INDIGO Ontology Schema, which, for this proofof-concept implementation, will be the SSN ontology. For a summary of how SSN works, we depict its major parts (Fig. 7.19). Regarding the implementation of INDIGO, we have defined three different software architecture layers. The presentation layer deals with the different elements at the assembly line in the Smart Factory. The second layer, so called Business or Application Layer, will rely more on how to combine Rule-based Semantic Policies to actual elements of the Semantic Sensor Network Ontology, which, for this implementation, stays at the Domain Ontology (Fig. 7.20).
7 ENERMONGRID: Intelligent Energy Monitoring, Visualization …
Fig. 7.18 INDIGO software architecture Fig. 7.19 Major parts of SSN
161
162
M. Lagares-Lemos et al.
Fig. 7.20 Software architecture layers
We will show the different screens that Indigo can display in a factory. Indigo is a production planning control interface module. In the following figure you can see the main screen of Indigo where the different connected devices are shown. In them you can distinguish both their name and a representative picture. In addition, 4 icons have been implemented to show the operation, security, connection, and data access of each device. Accessing each of the devices displays all the attributes and custom dashboards for each one. In the Fig. 7.21 there is both a small description of the previously commented values (operation, connection and data access) of each device and the section Current Data that shows the values collected by the sensors of the report. Each complex type shows a different type of graph (Fig. 7.22). Each device has a section called History where representative graphics of each complex type chosen are shown. Figure 7.23 shows the general dashboards that all the devices have, showing the number of reports made by the sensors and a percentage of the success of the insertions. Depending on the attributes that each device has, the system will show a type of graph. In the Fig. 7.24 you can see how there are graphs for the last speeds, last
7 ENERMONGRID: Intelligent Energy Monitoring, Visualization …
163
Fig. 7.21 Indigo home
Fig. 7.22 Indigo current data
accelerations, a graph that shows the percentage of appearance of the different modes of a cobot and a representative map of the location of the device. By implementing this type of visualizations and graphics, we have tried to make some dashboards as detailed as possible without letting the system be totally customizable. The Indigo tool also allows new devices with customized sensors to be generated from the system itself. Indigo is also a middleware module, capable of connecting devices with standardized language for all the systems integrated in their APIS. In Fig. 7.25 we can see the Data model screen, from where we can identify the different devices initialized in the system and their reports. The system also has a button to create new devices. INDIGO was designed and implemented so that a factory worker could check the availability of the system without the possibility of making any changes to the database, avoiding possible errors.
164 Fig. 7.23 Indigo read rate
Fig. 7.24 Indigo multiple dashboards
M. Lagares-Lemos et al.
7 ENERMONGRID: Intelligent Energy Monitoring, Visualization …
165
Fig. 7.25 Indigo datamodel menu
7.6 Conclusions, Related Work and Future Work The ENERMONGRID management tool complies very closely with the objectives presented in the project. It is true that the tool does not manage the network, because it was not intended to operate on the network, since access was restricted. However, the functionality that has been provided in other cases exceeds the initial claims, because in addition to collecting, pre-processing and analyzing all information safely through the MDM module, it also serves as a tool to establish communications with external systems and their visual function has been seen working on many other aspects not considered at first. As future work we will seek to increase the completeness of ENERMONGRID to help more entities in charge of managing an electricity network, with the focus on creating a large smart grid. We will analyze the ENERMONGRID system, considering the elements involved, the restrictions that were taken into account when developing and testing the system, solving the anomalies that it presented and being able to develop a competitive system that can be used by any entity. Acknowledgements This work is supported by the ITEA3 OPTIMUM project, ITEA3 SECUREGRID project and ITEA3 SCRATCH project, all of them funded by the Centro Tecnológico de Desarrollo Industrial (CDTI).
References Andrés C, Andrade D, Hernández JC (2011) Smart Grid: Las TICs y la modernización de las redes de energía eléctrica-Estado del Arte. Sist y 9:53–81 Basso T, DeBlasio R, Basso T, et al (2009) Advancing smart grid interoperability and implementing NIST’s interoperability roadmap advancing smart grid interoperability and implementing NIST’s interoperability Roadmap: IEEE P2030 TM Initiative and IEEE 1547 TM Interconnection Standards Cleveland FM (2008) Cyber security issues for Advanced Metering Infrastructure (AMI). In: IEEE power and energy society 2008 general meeting: conversion and delivery of electrical energy in the 21st century, PES
166
M. Lagares-Lemos et al.
Czarnecki C, Dietze C (2017) Reference architecture for the telecommunications industry. Springer, Cham Energy HOW THE SMART GRID PROMOTES A GREENER FUTURE|Department of Energy. https://www.energy.gov/oe/downloads/how-smart-grid-promotes-greener-future. Accessed 7 Dec 2020 García Jiménez A, Beltrán Orenes P, Núñez Puente S (2010) Una aproximación al concepto de frontera virtual. Identidades y espacios de comunicación. Rev Lat Comun Soc ISSN-e 1138-5820, No 65, 2010 16 Gordon JJ, de Bucs FH-S (2000) A model of self-similar data traffic applied to ethernet traces. Springer, Boston, MA, pp 3–21 Mowbray TJ (2013) Cybersecurity: managing systems, conducting testing, and investigating intrusions. Undefined van der Meijden CM, Veringa HJ, Rabou LPLM (2010) The production of synthetic natural gas (SNG): a comparison of three wood gasification systems for energy balance and overall efficiency. Biomass Bioenergy 34:302–311. https://doi.org/10.1016/j.biombioe.2009.11.001
Part II
Decision-Making Systems for Industry
Chapter 8
Measuring Violence Levels in Mexico Through Tweets Manuel Suárez-Gutiérrez, José Luis Sánchez-Cervantes, Mario Andrés Paredes-Valverde, Erasto Alfonso Marín-Lozano, Héctor Guzmán-Coutiño, and Luis Rolando Guarneros-Nolasco Abstract Measuring Mexico’s violence through tweets can play a crucial role in decision-making at the political level. In this chapter, we propose a novel way to evaluate what people say and how they feel about violence by analyzing Twitter data. To do this, we describe a methodology to create an indicator to denote social perception. This methodology uses technologies like Big Data, Twitter analytics, Web mining, and Semantic Web by manipulating software like ELK (Elasticsearch for data storage, Logstash for collecting data, and Kibana for data visualization); SPSS and R for statistical data analysis, and; Atlas.ti, Ghephi, and Wordle for semantic analysis. At the end of the chapter, we show our results with a word cloud, a social graph, and the indicator of social perception of violence in Mexico at the federal entities and metropolitan zones level. Keywords Violence · Social perception · Twitter analytics · Big Data
M. Suárez-Gutiérrez · E. A. Marín-Lozano · H. Guzmán-Coutiño Universidad Veracruzana, Xalapa, Veracruz, México e-mail: [email protected] E. A. Marín-Lozano e-mail: [email protected] H. Guzmán-Coutiño e-mail: [email protected] J. L. Sánchez-Cervantes (B) CONACYT-Tecnológico Nacional de México/I.T. Orizaba, Orizaba, Veracruz, México e-mail: [email protected] M. A. Paredes-Valverde Instituto Tecnológico Superior de Teziutlán, Teziutlán, Puebla, México e-mail: [email protected] L. R. Guarneros-Nolasco Tecnológico Nacional de México/I.T. Orizaba, Orizaba, Veracruz, México © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_8
169
170
M. Suárez-Gutiérrez et al.
8.1 Introduction Numbers do not lie; in 2018, the crime incidence on citizens was estimated by a rate of 28.3 crimes per 100,000 inhabitants, compared to 27.3 in 2012. In 2018, not reported crimes corresponds to 93.2%, compared to 92.1% in 2012, the crime incidence classification by crime occurrence was 28.5% stealing, 17.3% extortion, 14.3% fraud, 11.5% car theft, and other crimes. An average of 78.9% of Mexicans does not feel secure in their federal states. That means that insecurity is the most critical problem for Mexicans, followed by unemployment (32.8%), price inflation (28.1%), and health (27.6%) (Instituto Nacional de Estadística y Geográfia 2019a). These statistics are not only alarming but also disturbing. The violence in Mexico has become a severe concern for society. Some victims of criminal violence expose repercussions associated with posttraumatic stress disorder (Georgina et al. 2013). Moreover, Mexican society has familiarized to coexist with high levels of violence. In other words, it is usual to coexist with those rates of crime. Thus, we focus on the phenomena interrelated by the daily interaction of the users on Twitter. These interactions on the online social network allow to do researches about sentiment analysis (Mehta et al. 2012), opinion mining (Nahili and Rezeg 2018), classification of language comments (Amali 2020), hate speech detection (Watanabe et al. 2018), gender classification (Vashisth and Meehan 2020), sexual crimes (Khatua et al. 2018), cyberbullying detection (Dalvi et al. 2020), and others. Online social networks (OSN) are attracting internet users more than any other kind of web service. The leading online activities in Mexico are 82% OSN, 78% chats or messenger services, 77% email, 76% information searches, and 68% mapping services (Asociación de Internet.mx 2019). An OSN is a group of services and applications built from Web 2.0, which allow the exchange and creation of collaborative content (Filho et al. 2016). Web 2.0 establishes a more significant interaction from society, creating virtual social links where users play, tag, work, and socialize online (Mrabet 2010). On an OSN, people establish forms of collaboration, communication, and intelligence that were unimaginable until a few years ago. Twitter is an online social network based on microblogging services that enable users to send and read short text messages called tweets, each one with a maximum of 280 characters (Molla et al. 2014; Nahili and Rezeg 2018). Twitter is one of the top five online social networks in Mexico, with a reach of 39% of internet users (Asociación de Internet.mx 2019). That is why Twitter is an indispensable source of information to research sentiment analysis and social perception. Compared to the traditional research methods based on a poll on paper (Instituto Nacional de Estadística y Geográfia 2019a), online social networks provide real-time information with relatively reliable results (Yang et al. 2015). Twitter allows government planners to access data in realtime about many topics, including the perception of security (Camargo et al. 2016). Online social networks are revolutionizing how people interact, even with the government. Citizens can interact with other users of the online social networks by freely expressing their ideas, sentiments, needs, and discomforts.
8 Measuring Violence Levels in Mexico Through Tweets
171
We gathered social media messages like tweets with sentiment information to analyze users’ perceptions about violence. We believe that online social networks can be a powerful tool to inform; however, they can also misinform society if misused. Therefore, we must consider users’ existence dedicated to misinforming, which can influence the results generated. Nevertheless, it is something that we must deal with when working with these types of data sources. This chapter presents a method that uses Big Data, Twitter analytics, web mining, and semantic web to measure the social perception of violence in Mexico. We examine Twitter to analyze each tweet’s metadata to identify the origin, language, hashtags, text, and users’ interaction. Using this selected tweet’s metadata, we computed social graphs, and the degree of social perception of violence at a level of federal entities and metropolitan zones based on qualitative information. This chapter consists of five sections. Section 1 is the introduction. Section 2 presents a review of related works on perception analysis through Twitter and the subject of violence in social media. Section 3 describes the overall design of the methodology. In Sect. 4, we show the analysis of hashtags and the social perception level of violence at a scale of federal entities and metropolitan areas. Finally, in Sect. 5, we present our conclusions and remarks for future work.
8.2 Related Works Perception analysis through Twitter has become a benchmark for identifying analysis patterns. The Twitter site has become a giant database of textual information on various topics like medicine (Devraj and Chary 2015; Mahata et al. 2018), sentiment analysis (Patankar et al. 2016; Garg et al. 2017; Bustos López et al. 2018; Singh et al. 2018; del Pilar Salas-Zárate et al. 2020), and digital marketing (Al-Hajjar and Syed 2015; Nahili and Rezeg 2018), to name a few. Advances in data science analysis motivated researchers in the past decade to explore textual sentiment analysis. Semantic technologies allow keeping data interrelated (Sánchez-Cervantes et al. 2018). We can create data flows to exploit multiple connections and extract relationships between the hashtags and the user’s interactions. The exploration of the information contained on these datasets is related to the principles of Linked Data (Flores-Flores et al. 2019; López-Ochoa et al. 2019), where we take data from Twitter to create domain-specific applications. Social Media, Twitter, and Big Data can help identify violence indicators. Multiple authors consider different impact domains and techniques (see Table 8.1). LópezOchoa et al. (2019) describes how citizens affected by the drug war in Mexico measure the complex tensions between users interacting on the OSN. In Khatua et al. (2018) the authors expose an innovative model based on the #MeToo movement to classify sexual assaults. In Saha and De Choudhury (2017), a model analyzes students’ stress levels after violent events on a university campus. In Ottoni et al. (2018) they relate a model to expose how rightwing politics is less tolerant over
172
M. Suárez-Gutiérrez et al.
Table 8.1 Selection of related works about violence in social media Research
OSN
Impact domain
Technique
Monroy-Hernández et al. (2015)
Twitter
Drug war
Twitter data analytics
Khatua et al. (2018)
Twitter
Sexual violence
Deep learning
Saha and De Choudhury (2017)
Reddit
Stress on violence events
Machine learning
Ottoni et al. (2018)
YouTube
Violence aggression from videos
Semantic analysis, natural language
Ristea et al. (2017)
Twitter
Mass events violence
Pearson’s correlation
specific topics. In Ristea et al. (2017) authors analyze a correlation of violence and crime near stadiums in England influenced by tweets. In summary, information coming from the Twitter site can be relevant in many domains related to violence, as shown in Table 8.1. Moreover, perception analysis through Twitter is an opportunity to establish preventive activities to mitigate or decrease feelings of vulnerability. Furthermore, the related works consulted help us deduce the importance of tweets and other social, economic, education, and safety indicators. Our contribution describes the procedure carried out, which can be replicated by changing the analysis keywords for other studies.
8.3 Model for Knowledge Acquisition The knowledge acquired from an online social network requires understanding data organization applied to people. To acquire information considered as knowledge from tweets, we propose to apply the Big Data model, using 4 of the 10 Vs (Volume, Value, Veracity, and Velocity) (Khan et al. 2018). The volume corresponds to the entire universe of available tweets. The value refers to the knowledge acquired on Twitter. The veracity describes filters that give quality to the information. The velocity is the server cluster’s capacity to process and analyze the data (Lugmayr et al. 2017; Nguyen and Jung 2018). The relation between data and knowledge is showed in Fig. 8.1. We propose to filter the tweets universe (data) with a search pattern, generating information and knowledge. This model can be applied to many topics because the information of interest can be defined and limited by the search pattern applied to the tweets universe. We use the Twitter API to gather the data for this study. The universe of available tweets corresponds to public accounts (excluding private ones) and 1% of tweets published globally per second. Therefore, the knowledge (K) acquired in the online social network of Twitter (T) can be established through a relationship equation of content analysis. A set of tweets
8 Measuring Violence Levels in Mexico Through Tweets
173
Fig. 8.1 Model for knowledge acquisition from Twitter
(t) create knowledge by complying with an existing search pattern (P), validated by a group of keywords (p). K = (t 1 , t2 , t3 , . . . , tn )|∃P, |P|( p 1 , p2 , p3 , . . . , pn ) Simply expressed: K = {lim T, ∃P lim P} t→n
p→n
In other words, the knowledge acquired over Twitter directly depends on the relationships between the semantic associations of the search pattern, each tweet, and the origin of the tweet (Nguyen and Jung 2018).
8.4 Methodology This section presents the method used to understand the social perception of violence from the online social network of Twitter. In Fig. 8.2, we show an overview of this model. There are four layers: the keywords selection layer, the pre-processing data layer, the processing data layer, and the data analytics layer. The keyword selection layer consists of three components: (1) Keyword selection; (2) Keyword validation and, (3) Configuration of the server cluster. In the pre-processing data layer, there are five components: (1) Data crawling; (2) Data backups; (3) Data cleaning; (4)
174
M. Suárez-Gutiérrez et al.
Fig. 8.2 Methodology model applied
Data normalization and, (5) Data validation. There are two components in the data processing layer: (1) The descriptive analysis of frequencies and, (2) Classification by quartiles. Finally, the data analysis layer shows the results through a tag cloud graph, a social graph, and a map visualization at a level of Mexican federal entities and metropolitan zones.
Fig. 8.3 The cluster of keywords
8 Measuring Violence Levels in Mexico Through Tweets
175
8.4.1 Keyword Selection Layer This layer is related to keyword selection and validation. To do this, we identified key users who disseminate information about violence. We need to be aware that a user on Twitter can be a person, a company, an organization, or a government institution. For this research, we selected government institutions to delimit our users’ group because their profiles can be verified. We implemented a Twitter analytics tool called Tweetreach to generate each keyword’s reports to validate the keywords. From these reports, we calculated the relevance of every keyword.
8.4.1.1
Keyword Selection
To create our keyword dataset, we selected the method of Bag-of-Words (BOW) (Ceh-Varela et al. 2018) to classify tweets by selecting key users. The BOW method allows to representation of word frequency from the selected users. We searched on the websites of each of the Public Security Institutions, Prosecutor’s Offices, and the 911 emergency number of the 32 federal entities to identify if they had an active Twitter account. After this search, we defined a list of key users to collect data from Twitter. This list contains information about 64 Twitter user accounts linked to a government institution in all the levels. This list contains sixty Twitter user accounts related to government institutions, including federal states and municipalities, and four at the national level (see Table 8.2). Then, we extracted keywords from tweets where the selected user accounts posted about violent events. For keyword extraction, we used the TweetReach tool, where each user account selected was analyzed. TweetReach tool gave us a report with the top keywords published by the user account. So, we used the most published keywords by these selected users, giving us 150 keywords.
8.4.1.2
Keyword Validation
We used TweetReach as a Twitter Analytics tool to analyze each of the keywords through reports. These reports contain indicators about the estimated reach and exposure of a tweet from a random sample of 100 tweets of the previous days (Union Metrics 2020). Based on these indicators, we proposed the keyword’s variable’s relevance to detect a keyword’s possibility to appear in a tweet’s body. In Table 8.3, we show a sample of the top 20 keywords with the highest relevance. For this, we used the following equation: Relevance o f the K eywor d =
(E x posur e − Estimated Reach) E x posur e
176
M. Suárez-Gutiérrez et al.
Table 8.2 List of user key Twitter accounts User account
Level
User account
Level
User account
Level
@911_C4BC
State
@PGR_Coah
State
@ssp_sonora
State
@911Jalisco
State
@PGR_Durango
State
@UCS_CDMX
State
@911Puebla
State
@PGR_mx
State
@Fiscalia_Chih
State
@AIC_Guanajuato
State
@PGR_Guerrero
State
@PGJDF_CDMX
State
@Amicsonora
State
@PGR_Hgo
State
@CNSeguridadmx
Federal
@C5_CDMX
State
@PGR_Mor
State
@PGR_AIC
Federal
@C5Edomex
State
@PGR_NL
State
@PoliciaFedMx
Federal
@DgoFiscalia
State
@PGR_Qro
State
@SEDENAmx
Federal
@FGECoahuila
State
@PGR_QRoo
State
@911chetumal
Municipal
@FiscaliaEdomex
State
@PGR_Sin
State
@911SanPedro
Municipal
@FiscaliaJal
State
@PGR_Sonora
State
@c4_ver
Municipal
@FiscaliaNayarit
State
@PGR_Tab
State
@C4Nauc
Municipal
@FiscaliaPuebla
State
@PGR_Tamps
State
@FSPE_Gto
Municipal
@PespSonora
State
@PGR_Ver
State
@MexicaliDspm
Municipal
@PFNuevoLeon
State
@PGR_Yuc
State
@Policia_HMO
Municipal
@PGJEGuanajuato
State
@poesqro
State
@Policia_PNegras
Municipal
@PGR_Ags
State
@policiachih
State
@PoliciaGDL
Municipal
@PGR_BC
State
@PrevencionSSP
State
@policiamazatlan
Municipal
@PGR_BCSur
State
@SeguridadGto
State
@PoliciaZapopan
Municipal
@PGR_Camp
State
@SS_Edomex
State
@SPM_SLP
Municipal
@PGR_Chih
State
@SSP_CDMX
State
@PGR_Chis
State
@SSP_SLP
State
Subsequently, we determined and measured the social perception of violence in Mexico through the selected keywords and an analysis cluster in a multidimensional way. As a result, we detected the need to develop a model that quantifies old problems with an innovative perspective, using emerging technologies (Vergara Ibarra 2018). We recognized a simplification of the reality and the interaction that the selected keywords had among them. They involve five clusters of analysis: Events, Weapons, Crime, Security, and Gender/Sexual. Each one of them has a sub-classification by keyword categories, as detailed in Fig. 8.3.
8.4.1.3
Configuration of the Server Cluster
It is essential to consider that we worked with large amounts of data. Therefore, the server’s storage capacity and availability were factors that we considered when implementing the cloud servers using the Google Cloud ™ platform. This platform
8 Measuring Violence Levels in Mexico Through Tweets
177
Table 8.3 Top 20 keyword validation list Keyword
Estimated reach
Exposure
The relevance of the keyword (%)
1
Prevención del delito
1,074,307
3,838,922
72.02
2
Aprehendimos
1,514,997
5,242,496
71.10
3
Amenazada
729,112
2,226,465
67.25
4
Amagó
738,810
1,607,486
54.04
5
Sicarios
2,507,355
5,328,253
52.94
6
Atraco
2,838,436
6,019,241
52.84
7
Denunciaron
3,354,910
7,071,119
52.55
8
Corrupción
886,412
1,535,690
42.28
9
Atracarlo
2,146,801
3,651,734
41.21
10
Huachicolero
117,113
194,986
39.94
11
Balean
2,698,616
4,128,677
34.64
12
Flagrancia
1,195,124
1,797,536
33.51
13
Fallecidos
564,291
831,395
32.13
14
Narcofosas
64,038
91,502
30.01
15
Delitos Cibernéticos
126,875
177,050
28.34
16
Narcomanta
1,391,236
1,900,877
26.81
17
Mafia
50,418
67,358
25.15
18
Defensa de las mujeres
221,692
296,036
25.11
19
Acoso Sexual
2,110,298
2,808,787
24.87
20
Puñaladas
375,286
497,175
24.52
provides accessibility twenty-four hours a day, seven days a week, with encryption of information, security with VPN access, and dynamic storage (Google Cloud Plattform 2017). The server cluster configuration consisted of four nodes. Three were configured virtually on the Google Cloud™ platform (Google Cloud Plattform 2017), and one node was configured locally (Fig. 8.4). See Table 8.4 for the detail of node configuration where: Node 1 corresponds to the cluster’s central server; it stores the database (DB), is the cluster’s central nucleus and has Elasticsearch (Kononenko et al. 2014) installed. Node 2 visualizes the data stored in node 1 through the Kibana server. Node 3 manages the connection between the cluster and the Twitter API through Logstash (Union Metrics 2020) using a JSON format file. This file contains three sections: input (containing access codes, list of keywords, geographic location, and language), filter, and output (establishing the connection to node 1, and indicating the output format).
178
M. Suárez-Gutiérrez et al.
Node 4 is the local server that performs the cluster’s remote administration, stores the backup of the database using Elasticdump, and uses Oracle® MySQL to manage the DB. Furthermore, this node must have the necessary tools to carry out the DB analysis. Table 8.4 Detail of node configuration Node 1
Node 2
Node 3
Node 4
Platform
Virtual
Virtual
Virtual
Local
Operating system
Linux Debian 9
Linux Debian 9
Linux Debian 9
macOS® 10.13.6
Server
Elasticsearch
Kibana
Logstash
Gcloud compute; MySQL
CPU
1 Cpu
1 Cpu
1 Cpu
2.4 GHz Intel Core i5
Memory
3.75 Gb
3.75 Gb
3.75 Gb
16 Gb
HDD
100 Gb
10 Gb
10 Gb
240 Gb
Network interface
10.128.0.2
10.128.0.3
10.128.0.4
–
External IP
–
34.68.248.158
–
–
Fig. 8.4 Server cluster configuration
8 Measuring Violence Levels in Mexico Through Tweets
179
8.4.2 Pre-processing Data Layer The pre-processing data layer identifies tweets that satisfy the desired requirements for data analysis. In this layer, we standardized and normalized the database of captured tweets to create indicators from homologated data. It is relevant to consider that users have different writing styles. They use special characters and emoticons, and they also restrict their vocabulary due to the limitation of 280 characters maximum per tweet (Nahili and Rezeg 2018). Therefore, the relevance of normalizing the data. In Fig. 8.5, we show the summary of the layers considered in the data pre-processing and the main results obtained in each one of them.
8.4.2.1
Data Crawling
Data crawling began when we started the Elasticsearch server by sending a query through the Logstash server to the Twitter API. This process lasted 45 days, from May 14 to June 27 of 2019. During this period, we collected 20,736,887 tweets, with an average of 460,819 tweets per day. Figure 8.6. shows the performance of the capture of tweets during this period. It shows the days of most significant activity about violence; for example, the active day was May 15, with 676,099 tweets. The days with the least activity, between June 11 and 13, are derived from the server cluster’s maintenance.
Fig. 8.5 Summary of the pre-processing data layer
180
M. Suárez-Gutiérrez et al.
Fig. 8.6 Tweets captured between June 11 and 13 of 2019
8.4.2.2
Data Backup
Data storage and backups are becoming more and more critical with Big Data. We needed a flexible method to keep the content safe, secure, and reliable. Selecting a backup method should be oriented to the analytics functions applied to the data. Thus, in this phase, the database management system is implemented (Zhao and Lu 2018). It was complex to analyze the data directly on the remote Kibana server configurated on the cluster, because of the number of tweets stored on the Elasticsearch server. Analyzing the 20 million tweets from the Kibana server resulted in prolonged response times and a server crash. As a result of these difficulties, we backed up the information and analyzed it from the cluster’s local node. In the data processing layer section, we explain the statistical analysis functions in more detail. However, due to these functions, we chose Oracle® MySQL as our database management system. We selected the Node.js tool to create a stable and safe connection to the database stored on the Elasticsearch server mounted on the Google Cloud™ . Node.js uses an event-driven, non-blocking I/O model and suits for CPU intensive applications in real-time (Chitra and Satapathy 2017). It creates an event loop at runtime, handling many simultaneous connections. Every connection is activated only when there is an answer to the call; if there is no answer, the connection is suspended (Node.js 2019). This procedure is helpful when we want to simultaneously download several tweets by making multiple connections to the Elasticsearch server. Then, we selected the elasticdump tool to export the database of tweets stored on Elasticsearch. Elasticdump allows splitting the database into many indexed packages (NPM 2019). We decided to create 100 Mb packages, which means that in one
8 Measuring Violence Levels in Mexico Through Tweets
181
connection created with Node.js, elasticdump realizes a query to download the equivalent of 100 Mb of tweets. Therefore, we downloaded 1,033 packages equivalent to 103.3 Gb of data, representing that a tweet weigh approximately 4.98 kb.
8.4.2.3
Data Cleaning
Data cleaning consisted of excluding those misclassified tweets stored on the database. Firstly, we imported the data into the local Elasticsearch server (Node 4) from the backup database. The Kibana server includes an importation module to upload data. In the Kibana server, we create an index pattern to indicate how the tweets are related. After that, we created queries, graphs, and metadata analysis to the tweets. With Kibana, we designed a dashboard that shows the performance during the tweets’ capture, the total of tweets captured, classification of tweets by language, location of tweets by country of source, and the top hashtags (see Fig. 8.7). Figure 8.7 shows that the tweets’ language was generally Spanish with 86.56% compared to 11.43% in Portuguese and the rest in other languages, which means that we excluded 14.43% of the tweets (2,983,421 tweets) for not being in Spanish. The dashboard shows that metadata related to a tweet’s origin is available only in 1.95% of all the tweets (406,425 tweets), and out of them, just 10.58% came from Mexico (42,994 tweets). The rest of the tweets came from Brazil (18.89%), Colombia (18.77%), Argentina (17.38%), and Spain (16.30%). The dashboard also indicates that the hashtag trending topics from this universe of tweets corresponded to Venezuela, Cuba, EEUU, CDMX, Nicaragua, Colombia, and others.
Fig. 8.7 Kibana Dashboard
182
M. Suárez-Gutiérrez et al.
After this analysis, we detected that it was possible to get an accurate location of the tweet’s origin from the text’s metadata. So, we applied a total of 32 filters (which correspond to each of the Mexican states). We applied filters to language, user location, and text metadata. As each filter is relative to each state, we selected the main cities’ names and municipalities identified inside the tweet’s text. Those filters excluded 17,753,466 (85.61%) tweets for not coming from any Mexican state, leaving 983,336 tweets. Most of the excluded tweets came from homonymous words or cities. We applied a filter to determine the total number of tweets with geographic information at the municipal level from this universe, leaving only 507,023 tweets.
8.4.2.4
Data Normalization
This layer consisted of standardizing the names of federal entities and the municipalities at the national level. To carry out this standardization phase, we used the Unique Catalog of State, Municipal and Local Geostatistical Areas, published by the National Institute of Statistics and Geography (INEGI, for its Spanish acronym) (Instituto Nacional de Estadística y Geográfia 2019b). Also, we made a word correction in the text metadata, where we removed symbols, special characters, duplicated blank spaces, line breaks, and emoticons. We deleted a total of 42,249 emoticons and 2,944,765 special characters. In Table 8.5, we show the top symbols deleted. Table 8.5 Principal emoticons and special characters deleted
8 Measuring Violence Levels in Mexico Through Tweets
8.4.2.5
183
Data Validation
This phase is related to the integrity of the datasets acquired from the Twitter site. However, some authors mention different techniques about the datasets’ validation and veracity (Bodnar et al. 2015; Ashwin et al. 2016; Agarwal et al. 2017; Devi and Karthika 2018; Senapati et al. 2019). For example, in Agarwal et al. (2017), crowdsourcing semantics techniques categorize the data as positive, negative, and neutral; In Senapati et al. (2019) they applied the divide and conquer method, graphbased modeling, and parallel data processing during data capture to improve the certainty integrity of the data. We applied the veracity index technique (a combination of geographic spread index, spam rate, and diffusion index) mentioned by Ashwin et al. (2016). We consider the variables mentioned by Ashwin et al. (2016) to generate the veracity degree (VD) of a set of tweets. We combined indicators like dissemination index (ID), geographical extension index (IEG), and the relevant tweets index (ITR). Dissemination Index (ID) corresponds to identifying the speed in which the information spreads on Twitter on issues of violence and insecurity in Mexico. It shows the calculation among all unique users’ universe about the tweets’ diffusion and penetration. ID = 1−
T otal U nique U ser s T otal T weets
=1−
94, 960 507, 023
= 0.8127
Geographical Extension Index (IEG) identifies the average spread of information on federal entities at a municipal level, where the total locations are 2,457 municipalities spread over 32 states, there are also 74 Metropolitan Zones (MZ), which include 417 municipalities (Consejo Nacional de Población 2018; Instituto Nacional de Estadística y Geográfia 2019b). I EG =
State Reached T otal States
+
Mun Reached T otal Mun
+
Mun cover ed in Z M T otal Mun o f Z M
3 I EG =
32 32
+
1,121 2,457
3
+
312 417
= 0.7348
Relevant Tweets Index (ITR) shows the importance of unique tweets over the total number of tweets spread on the social network. With this indicator, the closer the index is to 0, the more repeated tweets; therefore, a significant number of tweets are considered spam. IT R =
217, 853 unique tweets = = 0.4296 T otal T weets 507, 023
184
M. Suárez-Gutiérrez et al.
Veracity Degree (DV) weighs the diffusion rates, geographic extension, and relevant tweets. In other words, it generates a weighted average of the three indicators, giving certainty about the dataset acquired through the independence of the opinions, scope, and impact of each of the Twitter users.
DV =
(0.8127 + 0.7348 + 0.4296) (I D + I E G + I T R) = = 0.6590 3 3
The result obtained from all the index fluctuates in an interval of 1 and 0. When an index is closer to 1, it shows a more significant correlation and dependence of tweets, which means that the dataset has relevant information to the searched topic. On the other hand, an index closer to 0 implies that tweets came from multiple users and have no correlation to the searched topic. The ID had a value of 0.8127, indicating that users were very active as soon as impacting news of violence occurred. The IEG had a value of 0.7348, derived from 100% of Federal Entities, 45% of Municipalities, and 75% of Municipalities that belong to a Metropolitan Area. The ITR shows a value of 0.4296, where many users forward a tweet that they perceive to be relevant. Finally, the DV (with a value of 0.6590) is obtained from the mean of these three indicators, thus validating the set of acquired tweets since it is higher than 0.5, which shows a significant correlation between the three indicators. Figure 8.8 illustrates third-dimensional cube, representing the Veracity Degree (represented by the circle in the figure), from the three axes (“x” for Spreading Index; “y” for Geographical Extension Index; “z” for Relevant Tweets Index). Its representation provides a comparison of the three indicators and shows how on the “x” axis and the “y” axis, the value is close to 1, and on the “z” axis, the value is located near the middle. Figure 8.8 validate the correlation between the three calculated indexes. Fig. 8.8 Veracity degree
8 Measuring Violence Levels in Mexico Through Tweets
185
8.4.3 Data Processing Layer This layer consisted of two phases. The first one is a descriptive analysis of frequency. The second is a classification of the quartiles to generate the two indicators: Degree of Rate per 100,000 inhabitants and Degree of Total Tweets.
8.4.3.1
Descriptive Analysis of Frequencies
With the descriptive statistical analysis of frequencies, we classified the tweets into two sections: the first corresponded to the set of data at the federal states’ level. The second corresponded to the data set obtained at a geographical level of metropolitan zones. In Mexico, the National Population Council (CONAPO for its acronym in Spanish) delimited the metropolitan zones (Consejo Nacional de Población 2018). In 2015, 74 metropolitan zones covered 17% of the total municipalities (417 municipalities out of 2,456), and these zones agglomerated 63% of the population nationwide (75,082,458 inhabitants of the 119,530,753).
Table 8.6 Descriptive analysis of frequencies at the federal states’ level
Table 8.7 Descriptive analysis of frequencies at the metropolitan zones level
Description
Total tweets
Rate per 100 thousand inhabitants
Valid
32
32
Loss
0
0
Average
15,844.47
447.30
Minimum
1,955
111.41
Maximum
58,416
1,140.05
Quartile Q1(25)
7,193
280.733
Quartile Q2(50)
11,673
352.391
Quartile Q3(75)
17,250
547.226
Description
Total tweets
Rate per 100 thousand inhabitants
Valid
74
74
Loss
0
0
Average
6,268.30
581.6059
Minimum
1
0.8300
Maximum
70,699
2,817.5400
Quartile Q1(25)
1,021.25
254.8325
Quartile Q2(50)
2,487.00
544.0500
Quartile Q3(75)
7,323.00
760.9800
186
M. Suárez-Gutiérrez et al.
According to the information shown in Tables 8.6 and 8.7, the data classification has two indicators: the total number of tweets, and the rate per 100,000 inhabitants. The SPSS software calculates the descriptive analysis of frequencies; in Table 8.6, at the federal states’ level, and Table 8.7 at the metropolitan zones’ level.
8.4.3.2
Classification by Quartiles
To divide the total tweets into quartiles, we considered all the federal states’ tweets with valid data. Furthermore, there is an average of 15,844, the lowest value is 1,955, and the highest 58,416 of tweets. We took the quartile Q1 (7,193) as the first cutoff point, with four cutoff numbers and a width of 5,238 (obtained by subtracting the lowest value from the value of Q1, that is 7,193 − 1,955 = 5,238), which resulted in five ranges shown in Table 8.8. Regarding the Rate per 100,000 inhabitants, we found an average of 447.30 tweets, the lowest being 111.41, and the highest 1,140.05. We took the quartile Q1 (280,733) as the first cutoff point, with four cutoff numbers and a width of 169,323 (obtained from the subtraction between the value of Q1 and the lowest value, this is 280.733 − 111.41 = 169.323), resulting in five ranges shown in Table 8.8. We also considered two moments in the descriptive statistical analysis at the metropolitan zone level (see Table 8.7). The first refers to the total number of tweets, and the second refers to the rate per 100,000 inhabitants. • Total tweets: all the metropolitan zones have valid data. There is an average of 6,268.30 tweets, the lowest value being 1, and the highest 70,699. We took the quartile Q1 (1,021) as the first cutoff point, with four cutoff numbers and a width of 1,020, generating five ranges shown in Table 8.9. Table 8.8 Classification range of analysis intervals at a state level
Table 8.9 Classification of analysis intervals at metropolitan zones
Total tweets
Rate per 100,000 inhabitants
Rate
1,955–7,193
111.410–280.733
Very low
7,194–12,431
280.734–450.056
Low
12,432–17,669
450.057–619.379
Medium
17,670–22,907
619.380–788.702
High
22,908–and more
788.703 and more
Very high
Total tweets
Rate per 100,000 inhabitants
Rate
1–1,021
0.8300–254.8325
Very low
1,022–2,041
254.8326–508.8350
Low
2,042–3,061
508.8351–762.8375
Medium
3,062–4,081
762.8375–1,026.8400
High
4,082–and more
1,026.8401 and more
Very high
8 Measuring Violence Levels in Mexico Through Tweets
187
• Rate per 100,000 inhabitants: there is an average of 581.6059, the lowest value being 0.8300, and the maximum value 2,817.5400. We took the quartile Q1 (254.8325) as the first cutoff point, with four cutoff numbers and a width of 254,0025, obtaining five ranges shown in Table 8.9.
8.5 Results 8.5.1 Hashtag Analysis By performing an analysis based on the hashtags of the tweets stored in the database, we determined that there are 8,600 different and unique ones, spread in 78,871 tweets and 23,324 users. The top 50 hashtags add 24,773 tweets (see Fig. 8.9), representing 31.40% of total tweets. The hashtag “#DíaNaranja” (OrangeDay in Spanish) is the most widely used keyword. According to the National Commission to Prevent and Eradicate Violence Against Women (CONAVIM, for its acronym in Spanish), the orange day is commemorated on the 25th of every month with the purpose of “acting, raising awareness, and preventing violence against women and girls” (Comisión Nacional para Prevenir y Erradicar la Violencia Contra las Mujeres 2019). We identified significant terms based on hashtags to represent them into a vectorspace model, which is the way to generate a social graph (Azam et al. 2016). We considered a hashtag as a node and the hashtag interaction among users as a weighted edge. A social graph shows a variety of clusters through the aggrupation of nodes based on the users’ interaction. To create the social graph, we used the Gephi tool applying the ForceAtlas technique with 15 threads, avowing overlap, and a weight influence of 0.4. This allowed us to understand online social analysis by interweaving an algebraic, numerical model, with a taxonomic study of hashtags. In other words, we measured the frequency in which a term appeared in each tweet to obtain the weights of each graph node, where a node corresponds to the published hashtags. On the other hand, the edges correspond to the interaction of the nodes among them. First, we made a query in the database, obtaining the list of hashtags and the
Fig. 8.9 Word Cloud for most used hashtags
188
M. Suárez-Gutiérrez et al.
Fig. 8.10 Hashtag correlation
users who published them (the origin of the edge). Second, we selected the users identified by the hashtags that they disseminated (edge destiny). Also, we deleted hashtags that pointed to themselves (circular edges). The graph shows that Fig. 8.10 has 1,343 nodes and 152,316 edges, representing all those tweets that have a weighted attraction between them. We identified edges by color and line thickness (weight); that is, the number of times a node is related to another node. Besides, it allows recognizing hashtags or nodes with greater relevance and importance based on the node’s size, which is proportional to the number of times users cited them.
8.5.2 State-Level Analysis Table 8.10 shows the statistical summary of the results obtained from the analysis of intervals applied at a state level. In the case of the indicator of total tweets, the federal entities that tweet the most are Nuevo León (11.52%), Jalisco (10.78%), Mexico City (9.97%), Mexico (7.57%), and Veracruz (with 5.57%); being the most active entities in this online social network concerning violence. We show the correlation of tweets for each federal entity’s population at the rate per 100,000 inhabitants. The entities with the highest dissemination of events about violence on Twitter for every 100,000 inhabitants correspond to Nuevo León with 1,141; Quintana Roo with 1,074; Sonora with 814; Morelos with 716; and Jalisco with 696.
8 Measuring Violence Levels in Mexico Through Tweets
189
Table 8.10 Intervals for the indicators created at the federal entity level Federal entity
Total tweets
Rate per 100 thousand inhabitants
Grade by the rate per 100 thousand inhabitants
Grade by total tweets
Aguascalientes
7,230
550.8387
Medium
Low
Baja California
17,481
527.2085
Medium
Medium
Baja California Sur
2,737
384.3945
Low
Very low
Campeche
4,765
529.4850
Medium
Very low
Coahuila de Zaragoza
16,556
560.2868
Medium
Medium
Colima
3,500
492.1018
Medium
Very low
Chiapas
7,180
137.6030
Very low
Very low
Chihuahua
12,199
342.9986
Low
Low
Ciudad de México 45,965
515.3805
Medium
Very high
Durango
1,955
111.4116
Very low
Very low
Guanajuato
22,838
390.1479
Low
High
Guerrero
9,812
277.7046
Very low
Low
Hidalgo
8,284
289.8166
Low
Low
Jalisco
54,635
696.4459
High
Very high
México
38,372
237.0455
Very low
Very high
Michoacán de Ocampo
10,384
226.5038
Very low
Low
Morelos
13,636
716.2476
High
Medium
Nayarit
2,454
207.7812
Very low
Very low
Nuevo León
58,416
1141.0480
Very high
Very high
Oaxaca
13,162
331.7129
Low
Medium
Puebla
15,414
249.8670
Very low
Medium
Querétaro
13,881
680.9846
Quintana Roo
16,137
1074.6809
San Luis Potosí
High
Medium
Very high
Medium Low
8,606
316.6508
Low
Sinaloa
10,433
351.7151
Low
Low
Sonora
23,229
814.9583
Very high
Very high
Tabasco
7,883
329.1067
Low
Low
12,097
351.4835
Low
Low
Tlaxcala
4,494
353.0668
Low
Very low
Veracruz
28,240
348.1046
Low
Very high
Yucatán
11,249
536.3882
Medium
Low
3,799
240.5635
Very low
Very low
Tamaulipas
Zacatecas
190
M. Suárez-Gutiérrez et al.
The rate per 100,000 inhabitants is classified in five ranges (see Table 8.8), as explained in the quartiles classification section. Based on this classification, we observe 3 entities with a Very High rate; 3 entities with a High rate; 7 entities with a Medium rate; 11 entities with a Low rate; 8 entities with a Very Low rate. This information represents that with a higher grade of the rate per 100,000 inhabitants, the perception of violence from the tweets disseminated is higher. The total tweets grade is also classified in five ranges (see Table 8.8), as explained in the quartiles classification section. Thus, we found 6 entities with a Very High grade; 1 entity with a High grade; 7 entities with a Medium grade; 10 entities with a Low grade; 8 entities with a Very Low grade. This information represents that with a higher grade of total tweets, the usage of this online social network to share tweets related to the subject of violence s higher.
8.5.3 Metropolitan Zone Level Analysis An analysis of the tweets captured at a level of Metropolitan Zones allows identifying those geographical areas that exceed the municipal geopolitical limits, covering two or more municipalities. These zones’ definition considers characteristics like urbanization index, more than 250,000 inhabitants, relation with neighboring cities, socioeconomic integration, and others (Consejo Nacional de Población 2018). Table 8.11 shows the statistical summary of the indicators obtained at the Metropolitan Zones (MZ) level of Mexico. At this level of analysis, we established the Twitter impact on each of the MZ. It is relevant to mention that there are 74 MZ. From these zones, 7 (Valle de México, Monterrey, Guadalajara, Hermosillo, Toluca, Querétaro, and Puebla-Tlaxcala) generated 50% of the total of tweets captured. Being the Valle de Mexico, the most relevant with 15.24%; and 17 MZ published fewer than 1,000 tweets. We classified the grade of social perception of violence at the level of Metropolitan Zones in five ranges (see Table 8.9). Derived from this classification, we highlighted the following results from the Table 8.11: 29 MZ (39.19% of the total MZ) have a Very High grade of perception of violence, which represents the 86.91% of the total tweets captured; 9 of this 29 MZ (Guanajuato, Hermosillo, Oaxaca, Cancún, Xalapa, Monterrey, Cuernavaca, Guadalajara, and Pachuca) have a Very High grade concerning the rate per 100 thousand inhabitants, which means these metropolitan zones publish a lot on the subject of violence and also feel insecure in their cities. We detected 4 MZ with a High grade of total tweets (Tlaxcala-Apizaco, Juárez, Zacatecas-Guadalupe, and Campeche); these grouped 13,911 tweets (2.99% of the total tweets), with an impact on 2,590,106 inhabitants. Finally, we detected the zones with a higher incidence of violence disseminated through Twitter. The indicator establishes specific geographic areas to analyze. We identified the following grades for the metropolitan zones: 24% have a Very High or High grade, 28% with a Medium, and 47% with a Low or Very Low grade.
8 Measuring Violence Levels in Mexico Through Tweets
191
Table 8.11 Intervals for the indicators created at metropolitan zones Metropolitan zone
Total tweets
Aguascalientes
7,182
Ensenada
2,717
Mexicali Tijuana
Rate per 100 thousand inhabitants
Grade by the rate per 100 thousand inhabitants
Grade by total tweets
687.9000
Medium
Very high
558.3200
Medium
Medium
4,329
437.9700
Low
Very high
10,435
566.9000
Medium
Very high
La Paz
1,583
580.4700
Campeche
3,162
1,117.2200
Medium
Low
Very high
High
La Laguna
7,279
542.3200
Medium
Very high
Monclova-Frontera Piedras Negras
1,130
310.6500
Low
Low
1,474
758.6500
Medium
Saltillo
Low
6,070
657.1900
Medium
Very high
Colima-Villa de álvarez 2,648
736.8000
Medium
Medium
95.5600
Very low
Very low
Tecomán
146
Tapachula
1,001
287.5100
Low
Very low
Tuxtla Gutiérrez
4,445
545.7800
Medium
Very high
Chihuahua
7,455
811.7900
High
Very high
Delicias
157
81.4300
Very low
Very low
Hidalgo del Parral
475
414.5000
Low
Very low
Juárez
3,580
257.3400
Low
High
Valle de México
70,699
338.3900
Low
Very high
Durango
1,098
167.6700
Very low
Low
Celaya
2,326
317.9000
Low
Medium
Guanajuato
5,191
2,817.5400
Very high
Very high
León
9,933
561.7600
Medium
Very high
Moroleón-Uringato
200
176.7800
Very low
Very low
San Francisco del Rincón
3
1.5100
Very low
Very low
Acapulco
6,913
779.3900
High
Very high
Chilpancingo
1,888
581.9600
Medium
Low
Pachuca
5,709
1,024.7800
Very high
Very high
Tula
557
Very low
Very low
Tulancingo
465
Guadalajara
50,633
Ocotlán
131
Puerto Vallarta Tlanguistenco
247.3100 181.1700
Very low
Very low
Very high
Very high
74.3700
Very low
Very low
2,171
509.7600
Medium
Medium
206
120.8500
Very low
Very low
1,035.9900
(continued)
192
M. Suárez-Gutiérrez et al.
Table 8.11 (continued) Metropolitan zone
Total tweets
Rate per 100 thousand inhabitants
Grade by the rate per 100 thousand inhabitants
Grade by total tweets
Toluca
13,491
612.4200
Medium
Very high
La Piedad-Pénjamo
318
125.0600
Very low
Very low
Morelia
7,504
822.8400
High
Very high
Zamora
552
207.5600
Very low
Very low
Cuautla
1,405
Cuernavaca
11,235
Tepic
2,087
Monterrey
57,504
Oaxaca
10,478
Tehuantepec
745
Puebla-Tlaxcala Tehuacán
295.5200
Low
Low
Very high
Very high
Low
Medium
1,226.2000
Very high
Very high
1,560.5100
Very high
Very high
413.9900
Low
Very low
12,020
408.5700
Low
Very high
1,141
331.1100
Low
Low
Teziutlán
145
110.0300
Very low
Very low
Querétaro
12,779
965.4400
High
Very high
Cancún
11,309
1,481.9400
Very high
Very high
Chetumal
1,245
Medium
Low
1,142.5100 443.0800
555.6100
Rioverde
134
96.0100
Very low
Very low
San Luis Potosí
7,708
664.5900
Medium
Very high
Culiacán
5,658
625.0100
Medium
Very high
Mazatlán
2,876
572.2800
Medium
Medium
Guaymas
1,175
548.4900
Medium
Low
Hermosillo
16,361
Very high
Very high
1,850.2200
Nogales
1,028
439.4100
Low
Low
Villahermosa
6,322
767.9700
High
Very high
Ciudad Victoria
1,483
428.5800
Low
Low
Matamoros
1,032
198.3200
Very low
Low
Nuevo Laredo
825
206.5400
Very low
Very low
Reynosa
2,919
377.5800
Low
Medium
Tampico
5,544
604.6800
Medium
Very high
Tlaxcala-Apizaco
3,937
728.7100
Medium
High
Acayucan
1
0.8300
Very low
Very low
Coatzacoalcos
2,317
634.7500
Medium
Medium
Córdoba
1,666
479.2200
Low
Low
Minatitlán
440
118.1600
Very low
Very low
Orizaba
1,854
405.5500
Low
Low (continued)
8 Measuring Violence Levels in Mexico Through Tweets
193
Table 8.11 (continued) Metropolitan zone
Total tweets
Rate per 100 thousand inhabitants
Grade by the rate per 100 thousand inhabitants
Grade by total tweets
Poza Rica
1,037
Veracruz
8,065
192.6800
Very low
Low
881.2200
High
Very high
Xalapa
10,360
1,348.4800
Mérida
10,531
921.3100
Very high
Very high
High
Very high
Zacatecas-Guadalupe
3,232
860.4300
High
High
8.6 Conclusions This chapter analyzes a dataset collected on Twitter from May 14 to June 27, 2019. The data crawling captured 20,736,887 tweets, but after a data cleaning phase, we reduced the data set to 507,023 geolocalized tweets at a state and municipal level in Mexico. The dataset evaluates Mexico’s violence by measuring the social perception at federal entities and metropolitan zones level. Also, the dataset shows a hashtag analysis, illustrated by a social graph and a word cloud. The use of Big Data, Twitter analytics, web mining, and the semantic web allows validating the methodology proposed to measure the perception of violence in Mexico through tweets’ publications. The acquired datasets can be useful to carry out other studies, such as word correlation, the influence of applied public policies, and sentiment analysis, to name a few. This indicator can detect socio-territorial inequalities, considering that we can only collect data from zones with an Internet connection. With this approach, we can infer whether acts of violence can limit or motivate public and private investment in specific territorial spaces. Hence, the importance of analyzing metropolitan zones, because we considered only the largest cities in Mexico to incorporate them into the study, leaving aside the populations that do not have a technological infrastructure. For future research, we propose to work in a multidisciplinary way to contrast the indicators in this study against other social and economic indicators. The purpose is to understand the social environment to explain and justify the extraordinary or deficient degree of violence perception obtained by a state or municipality. Acknowledgements The authors are grateful to the Universidad Veracruzana as well as Tecnológico Nacional de México/I.T. Orizaba for supporting this work. This research chapter was also sponsored by Mexico’s National Council of Science and Technology (CONACYT) and the Secretariat of Public Education (SEP) through the PRODEP program.
194
M. Suárez-Gutiérrez et al.
References Agarwal B, Ravikumar A, Saha S (2017) A novel approach to Big Data veracity using crowdsourcing techniques and Bayesian predictors. In: Proceedings of the 2016 15th IEEE international conference machine learning applications (ICMLA). IEEE, pp 1020–1023. https://doi.org/10.1109/ ICMLA.2016.25 Al-Hajjar D, Syed AZ (2015) Applying sentiment and emotion analysis on brand tweets for digital marketing. In: IEEE Jordan conference on applied electrical engineering and computing technologies (AEECT). IEEE, pp 1–6. https://doi.org/10.1109/AEECT.2015.7360592 Amali HI (2020) Classification of cyberbullying Sinhala language comments on social media, pp 266–271 Ashwin KTK, Kammarpally P, George KM (2016) Veracity of information in Twitter data: a case study. In: International conference on Big Data smart computing (BigComp). IEEE, pp 129–136. https://doi.org/10.1109/BIGCOMP.2016.7425811 Asociación de Internet.mx (2019) 15° Estudio sobre los Hábitos de los Usuarios de Internet en México 2019. Mexico city Azam N, Jahiruddin, Abulaish M, Haldar NAH (2016) Twitter data mining for events classification and analysis. In: Proceedings of the 2015 2nd international conference on soft computing machine intelligence (ISCMI 2015). IEEE, pp 79–83. https://doi.org/10.1109/ISCMI.2015.33 Bodnar T, Tucker C, Hopkinson K, Bilen SG (2015) Increasing the veracity of event detection on social media networks through user trust modeling. In: Proceedings of the 2014 IEEE international conference on Big Data. IEEE, pp 636–643. https://doi.org/10.1109/BigData.2014.7004286 Bustos López M, Alor-Hernández G, Sánchez-Cervantes JL et al (2018) EduRP: an educational resources platform based on opinion mining and semantic web. J Univers Comput Sci 24(11):1515–1535. https://doi.org/10.3217/jucs-024-11-1515 Camargo JE, Nari UA, Torres CA, Mart OH (2016) A Big Data analytics system to analyze citizens perception of security, pp 0–4. https://doi.org/10.1109/ISC2.2016.7580846 Ceh-Varela E, Hernández-Chan G, Villanueva-Escalante M, Sánchez-Cervantes JL (2018) MEDIS-IN, an intelligent web app for recognizing non-prescription drugs. In New perspectives on applied industrial tools and techniques. Management and industrial engineering, García-Alc. Springer, Cham, pp 273–292 Chitra LP, Satapathy R (2017) Performance comparison and evaluation of Node.js and traditional web server (IIS). In: 2017 International conference on algorithms, methodol models and applications in emerging technologies (ICAMMAET). IEEE, pp 1–4. https://doi.org/10.1109/ICA MMAET.2017.8186633 Comisión Nacional para Prevenir y Erradicar la Violencia Contra las Mujeres (2019) ¿Qué es el Día Naranja y por qué se conmemora? Consejo Nacional de Población (2018) Delimitación de las zonas metropolitanas de México 2015 Dalvi RR, Baliram Chavan S, Halbe A (2020) Detecting a Twitter cyberbullying using machine learning. In: Proceedings of the international conference on intell computing and control systems (ICICCS), IEEE, pp 297–301. https://doi.org/10.1109/ICICCS48265.2020.9120893 del Pilar S-Zárate M, Alor-Hernández G, Sánchez-Cervantes JL et al (2020) Review of english literature on figurative language applied to social networks. Knowl Inf Syst 62(6):2105–2137. https://doi.org/10.1007/s10115-019-01425-3 Devi PS, Karthika S (2018) Veracity analysis of rumors in social media. In: 2nd International conference on computer, communication, and signal processing (ICCCSP). IEEE, pp 1–4. https:// doi.org/10.1109/ICCCSP.2018.8452852 Devraj N, Chary M (2015) How do Twitter, Wikipedia, and Harrison’s principles of medicine describe heart attacks? BCB 2015. In: Proceedings of the 6th ACM conference on bioinformatics, computational biology and health informatics, pp 610–614. https://doi.org/10.1145/2808719.281 2591 Filho JABL, Pasti R, de Castro LN (2016) Gender classification of Twitter data based on textual metaattributes extraction. In: Rocha Á, Correia A, Adeli H et al (eds) New advances in Iniormation
8 Measuring Violence Levels in Mexico Through Tweets
195
systems and technologies. Advances in intelligent systems and computing. Springer, Cham, pp 1025–1034 Flores-Flores CD, Sánchez-Cervantes JL, Rodríguez-Mazahua L, Rodríguez-González A (2019) ARLOD: augmented reality mobile application integrating information obtained from the linked open drug data. In: Alor-Hernández G, Sánchez-Cervantes JL, Rodríguez-González A, ValenciaGarcía R (eds) Current trends in semantic web technologies: theory and practice. Springer, Cham, pp 121–151 Garg P, Garg H, Ranga V (2017) Sentiment analysis of the Uri terror attack using Twitter. In: Proceedings of the IEEE international conference on computing, communication and automation (ICCCA), IEEE, pp 17–20. https://doi.org/10.1109/CCAA.2017.8229812 Georgina CL, Anabel DLR, Lorena F (2013) Ximena D (2013) A controlled trial for PTSD in mexican victims of criminal violence. In: International conference on virtual rehabilitation (ICVR). IEEE, pp 41–45. https://doi.org/10.1109/ICVR.2013.6662102 Google Cloud Plattform (2017) Ventajas de Google Cloud Plattform Instituto Nacional de Estadística y Geográfia (2019a) Encuesta Nacional de Victimización y Percepción sobre Seguridad Pública (ENVIPE) 2019 Instituto Nacional de Estadística y Geográfia (2019b) Catálogo único de claves de áreas geoestadísticas estatales, municipales y localidades. In: Marco Geoestadístico Khan N, Alsaqer M, Shah H et al (2018) The 10 Vs, issues and challenges of Big Data. In: Proceedings of the 2018 international conference on Big Data and education (ICBDE), pp 52–56. https:// doi.org/10.1145/3206157.3206166 Khatua A, Cambria E, Khatua A (2018) Sounds of silence breakers: exploring sexual violence on Twitter. In: Proceedings of the 2018 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM). IEEE, pp 397–400. https://doi.org/10.1109/ ASONAM.2018.8508576 Kononenko O, Baysal O, Holmes R, Godfrey MW (2014) Mining modern repositories with elasticsearch. In: Proceedings of the 11th work conference on mining software repositories (MSR), pp 328–331. https://doi.org/10.1145/2597073.2597091 López-Ochoa BL, Sánchez-Cervantes JL, Alor-Hernández G et al (2019) FASELOD: a faceted search engine in linked open datasets using voice recognition. In: Alor-Hernández G, SánchezCervantes J, Rodríguez-González A, Valencia-García R (eds) Current trends in semantic Web technologies: theory and practice, studies in computational intelligence. Springer, Cham, pp 269–292 Lugmayr A, Stockleben B, Scheib C, Mailaparampil MA (2017) Cognitive Big Data: survey and review on Big Data research and its implications. What is really “new” in Big Data? J Knowl Manag 21:197–212. https://doi.org/10.1108/JKM-07-2016-0307 Mahata D, Friedrichs J, Shah RR, Jiang J (2018) Detecting personal intake of medicine from twitter. IEEE Intell Syst 33(4):87–95. https://doi.org/10.1109/MIS.2018.043741326 Mehta R, Mehta D, Chheda D, et al (2012) Sentiment analysis and influence tracking using Twitter. Int J Adv Res Comput Sci Electron Eng 1(2):72–79 Molla A, Biadgie Y (2014) Sohn KA (2014) Network-based visualization of opinion mining and sentiment analysis on twitter. In: International conference on IT convergence and security (ICITCS). IEEE, pp 2012–2015. https://doi.org/10.1109/ICITCS.2014.7021790 Monroy-hernández A, Kiciman E, Counts S (2015) Narcotweets: social media in wartime. Artif Intell 515–518 Mrabet RG (2010) The use of web 2.0 and online virtual communities to develop marketing strategies Nahili W, Rezeg K (2018) Digital marketing with social media: what Twitter says! In: Proceedings of the international conference on pattern analysis and intelligent systems (PAIS). IEEE, pp 1–5. https://doi.org/10.1109/PAIS.2018.8598515 Nguyen HL, Jung JE (2018) SocioScope: a framework for understanding internet of social knowledge. Futur Gener Comput Syst 83:358–365. https://doi.org/10.1016/j.future.2018.01.064 Node.js (2019) About | Node.js NPM (2019) Elasticdump
196
M. Suárez-Gutiérrez et al.
Ottoni R, Bernardina P, Cunha E et al (2018) Analyzing right-wing YouTube channels: hate, violence and discrimination. In: Proceedings of the 10th ACM conference on web science, pp 323–332. https://doi.org/10.1145/3201064.3201081 Patankar AJ, Kshama VK, Kotrappa S (2016) Emotweet: Sentiment analysis tool for Twitter. In: IEEE international conference on advances and electronics communication and computer technology (ICAECCT). IEEE, pp 157–159. https://doi.org/10.1109/ICAECCT.2016.7942573 Ristea A, Langford C, Leitner M (2017) Relationships between crime and Twitter activity around stadiums. In: International conference on geoinformatics. IEEE, pp 5–9. https://doi.org/10.1109/ GEOINFORMATICS.2017.8090933 Saha K, De Choudhury M (2017) Modeling stress with social media around incidents of gun violence on college campuses. In: Proceedings of the ACM human-computer interaction, pp 1. https://doi. org/10.1145/3134727 Sánchez-Cervantes JL, Alor-Hernández G, del Salas-Zárate MP et al (2018) FINALGRANT: a financial linked data graph analysis and recommendation tool. Stud Comput Intell 764:3–26. https://doi.org/10.1007/978-3-319-74002-7_1 Senapati M, Njilla L, Rao P (2019) A method for scalable first-order rule learning on Twitter data. In: 2019 IEEE 35th international conference on data engineering workshops (ICDEW). IEEE, pp 274–277. https://doi.org/10.1109/icdew.2019.000-1 Singh A, Shukla N, Mishra N (2018) Social media data analytics to improve supply chain management in food industries. Transp Res Part E Logist Transp Rev 114:398–415. https://doi.org/10. 1016/j.tre.2017.05.008 Union Metrics (2020) Social analytics Vashisth P, Meehan K (2020) Gender classification using Twitter text data, pp 1–6 Vergara Ibarra JL (2018) La Seguridad Nacional en México: hacia una visión integradora, 1a edn. Siglo XXI Editores, Ciudad de México Watanabe H, Bouazizi M, Ohtsuki T (2018) Hate speech on Twitter: a Pragmatic approach to collect hateful and offensive expressions and perform hate speech detection, vol 6, IEEE access, pp 13825–13835. https://doi.org/10.1109/ACCESS.2018.2806394 Yang A, Zhang J, Pan L, Xiang Y (2015) Enhanced Twitter sentiment analysis by using feature selection and combination. In: International symposium on security and privacy in social networks and Big Data (SocialSec). IEEE, pp 52–57. https://doi.org/10.1109/SocialSec.2015.9 Zhao Y, Lu N (2018) Research and implementation of data storage backup. In: Proceedings of the 2nd IEEE international conference on energy internet (ICEI). IEEE, pp 181–184. https://doi.org/ 10.1109/ICEI.2018.00040
Chapter 9
Technology Transfer from a Tacit Knowledge Conservation Model into Explicit Knowledge in the Field of Data Envelopment Analysis Diana María Montoya-Quintero , Olga Lucia Larrea-Serna , and Jovani Alberto Jiménez-Builes Abstract This work presents a preservation model of production engineers’ tacit knowledge of Data Envelopment Analysis (DEA). Their expertise was explicitly coded into a computer system, and the model was developed by applying techniques and procedures from the fields of engineering and knowledge management. Technology transfer enables to solve the problem of selecting criteria and interpreting results with DEA techniques, when the efficiency of similar organizations is compared using an efficient frontier derived from non-parametric approximations of such techniques. Misunderstanding the techniques leads to misinterpretations of DEA results. This model was created by applying Knowledge Engineering, which enables to preserve and extend specific experiences and expertise in time by means of computer solutions. The model had an efficient and positive impact on strategic self-learning processes for the community interested in production engineering, knowledge transfer and management. Keywords Data envelopment analysis · Knowledge extraction · Knowledge management · Industry 4.0 · Artificial intelligence
D. M. Montoya-Quintero · O. L. Larrea-Serna Departamento de Calidad y Producción, Facultad de Ciencias Económicas y Administrativas, Instituto Tecnológico Metropolitano, Campus Robledo, Medellín, Antioquia, Colombia e-mail: [email protected] O. L. Larrea-Serna e-mail: [email protected] J. A. Jiménez-Builes (B) Departamento de Ciencias de la Computación y de la Decisión, Facultad de Minas, Universidad Nacional de Colombia, Medellín, Antioquia, Colombia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_9
197
198
D. M. Montoya-Quintero et al.
9.1 Introduction The results obtained from certain studies carried out on some students of industrial, production and quality engineering are presented here. Such results reveal the lack of learning and knowledge in students regarding concepts and techniques related to Data Envelopment Analysis (DEA). Therefore, this chapter presents the results of a technology transfer from a theoretical proposal of knowledge management (KM) of a tacit knowledge conservation model into the explicit knowledge of DEA methodology. A model is designed based on the identification of tacit knowledge elements (what authors (Nonaka and Takeuchi 1996)) called “informal, personal or social knowledge which is difficult to express systematically; which is also little visible and difficult to share through traditional means—possessed by the actors of the context in which human activity is carried out, even inside organizations”) into explicit knowledge in a tangible way by means of dialog, through the use of metaphors, analogies, or existing models. “The explicit is an essential activity in the creation of knowledge, and it is more frequently present during the phase of creation of new products” (Davies et al. 2003). By means of the model designed in Fig. 9.1, the essential knowledge of the topics used in DEA is obtained, and the computational system SEUMOD (Spanish for system of help and tutorial to find the efficiency in measurement units inside an organization) is generated as an intangible product (Giraldo-Jaramillo and MontoyaQuintero 2015). SEUMOD is intended to answer several mathematical models to ease and optimize resources of measurements units for the field topics. DEA includes a nonparametric procedure that uses a technique of linear programming, and that will allow evaluating the relative efficiency of a set of homogeneous productive units in a field of interest. DEA would allow the professionals of the field
Fig. 9.1 Layer model for the conservation of tacit knowledge for DEA methodology. Source The authors
9 Technology Transfer from a Tacit Knowledge Conservation Model …
199
to provide their services to reach the goals with less resources (reducing resources to the minimum) inside an organization. This research proves the importance held by DEA to create more competent engineering profiles. That is why it is chosen among the methods to treat information and data belonging to this field, based on the tacit knowledge it has. The aim is to provide a solution to the obstacles of its learning, as well as tools that help to optimize processes in the data envelopment analysis (to find efficiency). The strengths and competitive advantages that its instruction can generate are observed here. A model is created from the learning and knowledge needs on this field, which is designed based on KM, obtaining a set of know how that leads the knowledge in this field to be transformed into a computational tool; such tool was called SEUMOD (GiraldoJaramillo and Montoya-Quintero 2015). Tacit and explicit factors of the SEUMOD system are also presented, and its scope is described based on its learning and support to corporate capital and knowledge. When using knowledge management as a method in this research, the practice of the lessons learned was also obtained, which includes DEA techniques and tools. These findings are obtained from the literature and from professionals that take this knowledge manually to the organizations. Another advance of this research is the technology coding of human knowledge, preserved in the SEUMOD tool, which is able to instruct engineering students and professionals in their learning in a practical way and by using examples. Another scope is to find the efficiency in measurement units based on other units.
9.2 Literature Review Knowledge owned by each benefit area plays an important role in human beings’ daily activities. Nowadays, in a globalized and interconnected world, knowledge society has acquired a relevant role never seen before from the scientific and technological points of view, and has been transformed into an essential factor for the development of the societies in their social, economic and personal dimensions. In particular, knowledge has become the resource leading to higher levels of added value in the production of goods and services, the sustainable insertion of growing economies in a global world, and the improvement of the living conditions of people (Dayan et al. 2017). Knowledge is a human and dynamic process oriented towards a goal with intention and perception (Nonaka and Takeuchi 1996). It is specific and responds to the context where it is generated; it is also individual instead of collective, and it is associated to the individual experience and to the ability and skills to act. Moreover, knowledge can be tacit, dynamic, delimited and transferable (Depres and Chauvel 2000; Beckman 1997; Sveiby 1997; Ordóñez de Pablos 2001; Holsapple and Joshi 2002). Knowledge is susceptible of being lost if its great value to society is not taken into account that is why it is so important to preserve it (Biron and Hanuka 2015). Eventually, it can be managed and processed by means of knowledge management.
200
D. M. Montoya-Quintero et al.
Knowledge conservation is a part of knowledge management. This part is important because conservation plays a role of protection and management of knowledge value; it leads the actions to prevent and remove damages when information is gathered (Al-Emran et al. 2018). Also, Venkitachalam and Willmott (2017) affirm that when knowledge is lost, it is very expensive to create it again or obtain it from other sources, hence the importance of preserving it in technological means. Ogiela (2015) defines conservation of knowledge as the selection and storage of knowledge to be effectively reused. Wang et al. (2018) defined conservation of knowledge as “a maintenance process of an organizational system of knowledge and skills that preserves and collects perceptions, actions and experiences over time, and guarantees the possibility of recovery for the future”. In other words, knowledge conservation captures, understands, stores, recovers and protects tacit and explicit knowledge, and guarantees accessibility by using technology developments so that knowledge can be used in the future. Cerchione and Esposito (2017) consider knowledge management as something relevant and necessary; an essential part for organizations to survive and stay competitive. Thus, it is necessary that managers and executives keep it in mind as a prerequisite for having a high productivity and flexibility in both public and private sectors. When preserved, knowledge generates strategies for decision making. Jayawickrama (2014) worked on the implementation of Enterprise Resources Planning (ERP), which is a knowledge management model that interferes in resource planning. This ERP needs a wide range of different kinds of knowledge, and for which the right amount of knowledge among individuals during its implementation phase is really important. Knowledge transfer allows implementations based on empirical findings taken from people’s experiences and takes into account strategic decisions to be made during the implementation. In this research, elements of ERP specific knowledge as well as company knowledge are classified, thus allowing transfer between the ones performing the implementations and the users. Furthermore, key findings inform industry professionals on how, why, and with what different types of knowledge, transfer should be promoted during the execution of projects. Xu et al. (2016) consider that engineers in the quality, industrial and productivity domains must know the efficiency and productivity measurement techniques used in different scenarios, such as data envelopment analysis (DEA). This method allows such professionals to solve complex structures to recover ideal information for solving organizational problems. Furthermore, it allows making convenient decisions to improve productivity in the concerned organization. It is considered that DEA is a linear programming technique in the field of operation research (Kao 2016). DEA is a practice to measure efficiency based on obtaining an efficiency border of a set of observations, without having to assume any functional form between “input” or supplies and “outputs” or products. It is, in short, an alternative to extract information from a set of observations versus parametric methods (Mardani et al. 2017). DEA tries to optimize the efficiency measures of each analyzed unit to create an efficient border based on the principle of Pareto. One of the most spread efficiency ideas is the principle of Pareto, according to which a resource allocation A is preferred over a
9 Technology Transfer from a Tacit Knowledge Conservation Model …
201
resource allocation B if and only if, with the second, at least one individual improves and nobody worsens (Cook and Seiford 2009). Kuah et al. (2012) in one of their works have as a goal the design of a model of a Knowledge Management (KM) measurement model for the Stochastic Data Envelopment Analysis based on the Monte Carlo Simulation and a genetic algorithm. The proposed model evaluates KM by using a set of measures correlated with the main KM processes. Additional data are generated and analyzed by using a DEA model to obtain the KM general efficiency. A model is applied to evaluate KM performance in higher education institutions. Compared with a conventional DEA deterministic model, the results of the model proposed help managers to determine future strategies to improve their knowledge management. Chen et al. (2009) carried out a study in which they used Data Envelopment Analysis to examine the performance of electricity distribution in Taiwan in 2004. It also explores the relationship between a knowledge management (KM) system and the variations in the efficiency in Taiwan Power Company from 2000 to 2004. The findings show that there was a good performance in terms of general efficiency in 2004, and 75% of the districts showed a variation in the growing performance when using the analysis of crossed periods for that time. Authors demonstrated that there is a positive relation between a KM and the variations in the organization’s efficiency. Kuah and Wong (2013) propose a new measurement model of Knowledge Management (KM) performance based on DEA. They measure KM by using 17 correlated measurements with the main KM process. Data from 19 higher education institutions of Malaysia were collected by means of a survey, and were analyzed with the model proposed. For each institution, the model provides a unique score of KM performance together with the performance scores of the main KM process. The knowledge of a professional in DEA techniques allows firms to measure organizational efficiency and productivity, which is very important from different perspectives, the economic theory and the economic policies among them (Mardani et al. 2017). If the theoretical arguments such as the relative efficiency of the economic systems are subjective and not evaluated, solid arguments will not be obtained to make decisions. That is the reason why it is important to know the appropriate techniques to these type of measurements that allow to have an approximation to reality to make a better use of the variable. Data Envelopment Analysis brings about different techniques that become models which give origin to the DEA methodology (Zerafat Angiz et al. 2015). DEA models measure the ability to obtain results by means of a desired relation between the input and output variables; in other words, they seek to obtain maximum productivity with the optimal management of the input variables used (Adler et al. 2002). The ones that yield the highest benefits are presented here.
202
D. M. Montoya-Quintero et al.
9.2.1 CCR-I Model—Multiplicative Form Given some inputs and some outputs, this model searches for the proportional reduction in the inputs without altering the output variables. It seeks to demonstrate the amount of supplies (inputs) that have been used to obtain a given production (outputs). This model is identified thanks to the following characteristics: the objective function seeks to maximize the weighted outputs; the first restriction of this model keeps the inputs constant by varying the linear combination of the outputs; the second restriction indicates the sum of the outputs cannot be higher than the sum of the inputs: the weights of the inputs and outputs are values to be calculated by means of the conditioned maximization problem; the weights obtained are charged to each input and output, they provide the highest possible index of efficiency to each DMU and meet with the restriction that when applying the weight combination to the rest of units, it generates an efficiency index between zero and one, and weights must be equal to or higher than zero; a Decision Making Unit (DMU) will give higher weight to the inputs that it uses the least and to the outputs that produces in a higher amount, since the model tries to obtain an efficiency value that favors the most the DMU being evaluated. Envelopment form: In linear programming there is another associated lineal program called dual program that can be used to determine the solution of the primal problem. These two problems are obtained from the same matrix of technological coefficients, the same costs vector and from the same resource vector; nevertheless, one of these problems is of maximization, and the other, of minimization. If one of the problems has a solution, so does the dual, and the solutions of the objective function are the same. In most DEA applications, the model used in the efficiency evaluation is the dual form (Coelli et al. 2005). The reason is obvious: the linear primal program DEA-CCR oriented towards the inputs is defined by a number of restrictions equal to n + 1. Nevertheless, the linear dual program DEA–CCR oriented towards the inputs is subjected to s + m restrictions. The preference falls on the fact that the number of DMUs that are being used is usually higher than the total number of inputs and outputs. Some reasons to prefer the dual model over the primal one are. Technical efficiency can only be calculated in the sense that it measures the maximal equiproportional reduction in the input vector compatible with the level of outputs observed. The dual problem interpretation is more direct than the primal problem interpretation, since the solutions are characterized as inputs and outputs that correspond to the original data taken directly from the source.
9.2.2 CCR-O Model—Multiplicative Form Given some inputs and some outputs, this model seeks the proportional increase in the outputs without altering the input variables. This model seeks to demonstrate
9 Technology Transfer from a Tacit Knowledge Conservation Model …
203
a given production (outputs) that has been obtained from an amount of supplies (inputs). This model is identified because the objective function seeks to minimize the weighed inputs. The first restriction of this model keeps the outputs constant when varying the linear combination of the inputs. The second restriction indicates that the sum of the inputs cannot be higher than the sum of the outputs; the weight of the outputs and the inputs are values to be calculated by means of the problem of conditioned minimization. Weights obtained are charged to each input and output, they provide the highest possible index of efficiency to each DMU and follow the restrictions according to which when applying the combination of weights to the rest of the units, it generates an index of efficiency between zero and one; the weights must be higher or equal to zero. Envelopment form: when comparing the problem oriented towards the Inputs to the equivalent problem oriented towards the Outputs, it can be observed as a difference that in the first problem, the objective is to determine the maximum radial reduction that should be produced in the Inputs of the analyzed DMU. In the second, the objective is to maximize the proportional increase of the outputs that could be reached by the evaluated DMU, given its levels of input.
9.2.3 BCC Model This performance typology was generated from the creation of the BCC model, in which Banker, Charnes and Cooper supposed that the DMUs could have production increases, both lower and higher compared to the proportional increase of their inputs. On the other hand, the theoretical definitions presented in the models section with constant performance, DEA-CCR models, are of application in the present, since the DEA-CCR models were the basis for the DEA-BCC model, that is, the DEA-BCC is an extension of the DEA-CCR.
9.2.4 BCC-I Model—Multiplicative Form Envelopment form: it indicates to what extent the levels of Inputs of the reference DMU0 can be reduced, given the level of Outputs. This model contains an additional restriction, which is the sum of the λj equalized to one, which forces the reference DMU0 to be projected on the more productive units, with the aim of reaching an efficiency equal to one. Finally, the reference DMU0 will be rated as efficient if it achieves the optimal solution θ and the variables of width are null or equal to zero. The model designed for this kind of characteristics is the DEA-BCC Output and, as it was previously mentioned, the change in orientation is equal to switching the quotient between the virtual output and the virtual input.
204
D. M. Montoya-Quintero et al.
9.2.5 BCC-O Model—Multiplicative Form Linearization of the model for this case, the pure technical efficiency of the reference DMU0, will be given by 1/wo, so that the DMU under evaluation will be efficient if w0 = 1. Additionally, and as it was explained in the BCC-Model Fractional Input, the sign taken by the k term (positive, negative or null) in the optimal solution will indicate the type of scale performance that predominates for the evaluated DMU. Nevertheless, because they are models of different orientations and the sign is flipped, we have: • k0 > 0 for all the optimal solutions, decreasing scale performances prevail. • k0 = 0 to any optimal solution, constant scale performances prevail. • k0 < 0 to all the optimal solutions, increasing scale performances prevail. Envelopment form: this DEA-BCC Envelopment model measures the Output technical efficiency by means of the fraction 1/ϕ* and indicates to what extent the Output levels of the reference DMU0 can be increased, given the Input levels. The reference DMU0 will be rated as efficient if ϕ* = 1 and the widths are all null. Scale performance: indicates the production increases that result from the increase of all production factors in the same percentage. Those can be. Constant: the percentage increase of the Output is equal to the percentage increase of the productive resources (inputs). Increasing: the percentage increase of the Output is higher than the percentage increase of the inputs. Decreasing: the percentage increase of the Output is lower than the percentage increase of the inputs.
9.2.6 Non-discretionary Models As it can be seen in the formulation, variables θ and ϕ do not affect non-discretionary inputs. Also, the only widths that are maximized or minimized in the objective function are those belonging to the discretionary sets. That means that neither radial nor rectangular projection is done of the resources or products that cannot be modified. The classic models in DEA do not consider variables that cannot be controlled or modified due to some fixed production factors or external factors. There are uncontrollable factors that affect efficiency but which do not belong to the production process. Those are commonly called environmental variables, which are not explicitly included in DEA models. To strive with these variables, the use of the DEA model called multistage is usually required. Thus, a productive unit will be efficient when besides meeting the requirements previously mentioned it faces equal or worse uncontrollable factors.
9 Technology Transfer from a Tacit Knowledge Conservation Model …
205
The NDCCR-I, NDCCR-O, NDBCC-I, NDBCC-O models in the envelopment form have the same basis as the models that precede them, the only difference is that NDCCR-I, NDCCR-O, NDBCC-I, NDBCC-O are considered discretionary variables.
9.3 Results and Discussion A model of five layers was proposed, each one of them being the result of the elements identified in the knowledge management for its design. From this model a tacit knowledge is obtained to then be transformed into explicit knowledge. Layer programming was the technique used for designing the model as a model of software development aimed mainly at separating (decoupling) the parts forming the system, which allowed structuring the elements with which the knowledge containing the DEA methodology was obtained according to each of the elements forming each layer. Each element is put to practice and analysis to obtain knowledge from a step by step process. The development of the process of technology transfer to preserve conserve the DEA tacit knowledge starts with Layer 1 and finishes in layer 5. For doing so, experiences, thoughts, cases, skills and other elements of the essential knowledge of Data Envelopment Analysis (DEA) were taken into account, which were intervened with the proposed model. Layers in the model are defined in Fig. 9.1, and are the following: • • • • •
Layer 1. Human knowledge. Layer 2. Procedural knowledge. Layer 3. Declarative knowledge. Layer 4. Cognitive knowledge. Layer 5. Transformation of knowledge into processing.
The layer model for the preservation of tacit knowledge in the field of DEA is in charge of managing the knowledge chain, which identifies and prioritizes the crucial elements inside the global chain that supports the decision making of the knowledge that this methodology offers. To obtain its tacit knowledge, the first four layers are applied, thanks to the use of each element composing the model. As findings, a network of analytical process for collecting the information is obtained. In layer 5, technology transfer is done of the mathematical models that allow to find the efficiency of DEA methodology, which is the priority knowledge of this field. If knowledge is not properly managed and shared, this easily corrodes knowledge, in particular the tacit knowledge that lies in the minds of people and that is gathered there; thus it is suggested that it must be conserved and transformed into explicit knowledge. Among other knowledge management processes, exchange has been identified as the most essential; knowledge management is identified as a key element to generate value to the organizations in the competitiveness chain with technological tools (Charband and Jafari Navimipour 2016).
206
D. M. Montoya-Quintero et al.
Fig. 9.2 Layer of human knowledge. Source The authors
9.3.1 Layer Model for the Conservation of Tacit Knowledge in the DEA On-Discretionary Models 9.3.1.1
Layer 1: Human Knowledge
In this layer, knowledge is seen as the representative relationship between human experts in a particular field and their acquired knowledge inside that field. The expert is that person who is capable of observing internal experiences or external phenomena in a field or expertise in knowledge. Knowledge emerges from human rationality, which is defined as the ability to obtain a knowledge in a concrete, abstract and organized way, to then apply it in a suitable way to solve theoretical and practical problems. It is in this layer that knowledge to be transformed is acquired to then be transferred (Fig. 9.2).
9.3.1.2
Layer 2: Procedural Knowledge
According to the elements used in layer 1, such as action, activity and fact, an analysis of referents was carried out with individuals inside and outside an organization on how they apply each element with the knowledge contained in the DEA field; for doing so, the survey technique was used. Based on the findings, it is proposed to generate a hypothesis until reaching the specification of the solutions related through objective knowledge that provides solutions to the elements’ processes. Thus, skills were observed that lead to achieve know how success with the treated knowledge. For each layer a context was used to obtain information related with the needed elements of each one of them. For each layer, contexts were applied that ask some questions aimed to obtain answers to understand specific knowledge, some of which are: Why do the student curricula not include the subject DEA of the future industrial, production and quality engineers? What happens with DEA application inside the organization of the population of interest? Why is this field not applied even though there are several professionals trained in it? All these questions lead to control the
9 Technology Transfer from a Tacit Knowledge Conservation Model …
207
problem and to identify the specifications of the situation to face; therefore, essential elements were characterized, which when included, provided solution strategies to establish the requirements in the logical frame of the problem. A target population of 1969 individuals was interviewed including students and professionals of the aforementioned engineering programs to validate their knowledge of DEA techniques. 1676 answers were obtained from people who answered correctly the questions asked for each layer, as follows. Are you familiar with the Data Envelopment Analysis and its definition? It is observed that 42% of people do not know the tools related to DEA, 32% are familiar with them, and 26% do not answer, which demonstrates that among students and professionals there was little knowledge about DEA tools. Do you think it is important to measure organizations’ efficiency and productivity? Survey respondents express the importance for a company to measure its efficiency, since it is a relevant indicator for decision making, as it is presented in the research carried out by (Giraldo-Jaramillo and Montoya-Quintero 2015): “the growing of a company is closely related to the knowledge that one has of it. Measuring processes is one of the best solutions to have a constant control of what happens in one’s company and to improve what is not doing well”. There are parametric and non-parametric instruments and tools to measure productivity. Which of the following instruments and tools do you identify? (Parametric, non-parametric, both, other tools, none of the above, or no answer). Instruments and measures indicators for productivity are evaluated. It is observed that people are not familiar with the parametric and non-parametric techniques to measure productivity, which corresponds to 47% of the survey respondents. This is due to the fact that empirical processes are used inside the organizations. Would you include tools to measure productivity in the curricula of industrial and production engineering academic programs? Survey respondents gave answers about the importance of including tools to measure productivity in the academic curricula; it was validated that 89% of people answered affirmatively. It is a fact that the processes are currently measured in an empirical way and there are few tools in the academic field, as it is presented in the research carried out by Haldar et al. (2016). This is due to the fact that the curricula used in some cases are traditional and they lack the new measurement techniques. It was found that the practices used by professors are decontextualized from the institutional discourses. In layer 1, the main actions identified in the Data Envelopment Analysis (DEA) techniques are: Decision Making Units of Analysis (DMU), oriented towards any unit that can be evaluated in terms of its abilities to become a supply or a product (Cooper 2004). Finding efficiency of the DMU to define who is efficient must be one of the activities of the procedure given by DEA; an action is proposed to increase efficiency of the inefficient DMU. It is a fact to consider the analytic approach of DEA to generate good results in the data analysis. The chosen elements of this layer were very useful to continue with layer 2. Figure 9.3 shows a standardization of the elements of layer 1 into icons.
208
D. M. Montoya-Quintero et al.
Fig. 9.3 Layer of elements to find DEA processes. Source The authors
Fig. 9.4 Layer of declarative knowledge elements. Source The authors
9.3.1.3
Layer 3: Declarative Knowledge
Elements used for this layer were experience, cases and reasoning of each individual regarding DEA. These elements recognize specifications characterized by the lived experiences and cases solved with this topic. In this layer, the logical and expressive knowledge obtained from the expert was configured with an orientation based on the requirement analysis to create a computing algorithm composed of linear programming models (DEA). Then, interference rules are created in a decision tree, which originates step by step processes, activities and tasks leading to conclusions made by those models’ techniques. Model systematization is done from the elements of declarative knowledge (Fig. 9.4). When analyzing some of the experiences obtained by DEA practitioners, a better declarative knowledge is reached since the processes were observed from different perspectives that were aimed to find the most frequent causes, consequences and results to feed the system. This allows the conservation of this knowledge via lessons learned through experiences, lived cases and reasoning of the DEA professionals.
9.3.1.4
Layer 4: Cognitive Reasoning
This layer allows to know the logic, functioning and operability of the results of the DEA theories and practice, with the help given through its language and logical thinking. In Fig. 9.5, the two elements used in this layer can be observed: language and logical thinking, which served as inspiration to create new solutions based on the pre-existent ones to get to conclusions provided through the SEUMOD system. This
9 Technology Transfer from a Tacit Knowledge Conservation Model …
209
Fig. 9.5 Layer of cognitive knowledge. Source The authors
is related to the previous layers and transformed into a language that spreads logical thinking among actors and users; such logical thinking was obtained to respond to the operability and functioning of different actions. Technical terms belonging to DEA are found again, such as the mathematical models applied to optimize and reach efficiency, besides the logic used to generate measurable and valid results. To preserve the technical knowledge of each model, in this layer the practices by Cerchione and Esposito (2017) were applied, which were presented in their theory of knowledge management, in which they determine how tacit knowledge conservation models can be used to make reference to that knowledge being part of the mental model, and the result of personal experience, involving non tangible factors such as beliefs, principles, opinions, intuitions, among others. Therefore, they can become explicit knowledge (transformation and transfer helped by technology in most cases). This is because tacit knowledge cannot be structured, stored or distributed without being transformed.
9.3.1.5
Layer 5: Transformation of Knowledge into Processing
This layer uses procedural, declarative and cognitive knowledge from DEA practitioners. It is taken to the transformation of procedures, rules, documentation and associated data, to form the operability part of the SEUMOD computing system. Layer 5 with its elements (interference rules, knowledge management and technology transfer) had as a result a decision tree, which generates a route in the decision making about the most relevant techniques for DEA (Zerafat Angiz et al. 2015). The branches of the tree are formed by each model. Each one of them has a hierarchical order guided by certain characteristics which allow to make decisions with respect to one of the options of higher analysis opportunity according to the needed parameters of the data base to be treated (Fig. 9.6).
210
D. M. Montoya-Quintero et al.
Fig. 9.6 Layer of elements of knowledge transformation into processing. Source The authors
9.3.2 SEUMOD Model The models in the tree are the base to many of the emerging models in the data envelopment analysis. Selection criteria of those models are based on the current data that are not analyzed with efficient techniques or tools. Also, these models allow evaluating productivity and measuring efficiency, thus ensuring a proper esource distribution to reach results that are beneficial for the organizations (Galagedera and Silvapulle 2003). They also measure the ability to obtain results through a suitable relation between the input and output variables. In other words, they seek to reach the maximum productivity with the optimal control of the input variables (Adler et al. 2002). The main root of the tree is composed by the characteristics (strategies), which determine the selection of the route to take by the final user; this route starts with one of the two models. No matter the option selected, the agent will always ask whether the variables of the problem are controllable. Variables that are not under control because they depend on the environment are oriented towards a different direction. From the question, exploration of the tree shows some sensibilizations on the maximization of the outputs or the minimizations of the inputs. Also, the SEUMOD system generates an explanation after the analysis of the result yielded by the selected model, based on the calculation of the relative efficiency of the units compared and their orientation to the input and output variables. Finally, the transfer system intervenes generating a conclusion. The characteristics guarantee the proper selection of the model to be worked on and, mainly, they direct the decision makers about the route to choose when deciding which the most suitable model to use is. The determined characteristic are: • In the organization, the percentage increase of the Output is equal to the percentage increase experienced by the Inputs. CCR Model. • If two inputs (outputs) reach an amount of output (input), then any other linear combination of them can also do the same. BCC Model. • Depending on the availability of inputs and outputs, each organization can produce fewer (equal) outputs with the same (higher) level of resources. BCC Model. • Inputs vary from one period to another. BCC Model. • DMUs have different dimensions from the efficient DMU because they cannot reach the same efficiency. BCC Model.
9 Technology Transfer from a Tacit Knowledge Conservation Model …
211
The models that were codified in SEUMOD are shown in Table 9.1. Among the main functions, the system takes into account the entry of input and output variables, as well as the identification of the model to be done, until obtaining the relative efficiency of organizational units. The models codified in the system measure the ability to obtain results by means of a desired relation between the input and output variables; that is, they obtain the maximum productivity with the optimal management of the input variables used. This tool was designed in the PHP Hypertext Preprocessor programming language. It was designed specifically with dynamic Websites designs programming scripts on the test server at the moment. It has HTML embedded and it is related with the use of Linux servers (Fig. 9.7).
9.4 Conclusions and Future Work Knowledge can be obtained from real conversations, expressions and experiences of an individual in a factual or explicit way in a context to be inquired (necessary to reach a transition between knowledge and the creation of a computing tool). It is possible to identify the bonds among the expressions, the mood changes and the knowledge to associate them to arguments, activities, facts, experiences and reasoning that allow a subsequent analysis to produce a conscious explanation of it, so that knowledge can be preserved in different digital or technology media. It should be pointed out that the experience obtained when wanting to assess the need of incorporating DEA knowledge in engineering students, leads to see their contribution in the doing processes in a cognitive way, which is accomplished by means of the proposal of a layer model as steps of elements that must be studied to find the essence of a given objective. Each model layer allowed the elements selected to create documents, relations, practices and memories that play a relevant role in the generation of new knowledge. This knowledge makes connection with experiences that can explain, protect and define information in every transfer, showing a real process of knowledge management. Knowledge management conducted in DEA field—as a logical, organized and systematic process allows to be transferred and applied in specific situations between a harmonious combination of knowledge, experiences, values, contextual information and expert assessments that provided a framework to its evaluation and incorporation of foundations to be included in the curricular profiles of engineering students. Future research must explore several areas of knowledge and try out the possibilities of the model presented here for extraction, transfer and conservation of knowledge in the areas of interest.
212
D. M. Montoya-Quintero et al.
Table 9.1 Multiplicative and envelopment form of the models CCR-I
CCR-O
Multiplicative form Objective function Maximi ze Z = rs =1 μr yr o (9.1) Normalization restriction m i=1 δi x io = 1 (9.2) The denominator is kept constant assuming a value of 1 Restriction s m r =1 μr yr j − i=1 δi x i j ≤ 0 j = 1, . . . , n (9.3) Decision Variables μr ≥ ε; δi ≥ ε; ε ∼ = 0 (9.4) Envelopment form Objective function Minimi ze Z = θ − ε s + + s − (9.5) Restrictions λY = yo + s + λX = θ xo − s − (9.6) Decision Variables λ, s + , s − ≥ ε > 0 (9.7)
Multiplicative form Objective function m Minimi ze Z = i=1 δi xio (9.8) Normalization restriction s r =1 μr yr o = 1 (9.9) The numerator is kept constant assuming a value of 1 Restriction m s j= i=1 δi x i j − r =1 μr yr j ≤ 0 1, . . . , n (9.10) Decision Variables μr ≥ ε; δi ≥ ε; ε ∼ = 0 (9.11) Envelopment form Objective function Maximi ze Z = ϕ + ε s + + s − (9.12) Restrictions λX = xo − s − λY = ϕyo + s + (9.13) Decision variables λ, s + , s − ≥ 0 (9.14)
BCC-I
BCC-O
Multiplicative form Objective function Maximi ze Z = rs =1 μr yr o + ko (9.15) We maximized the numerator, the only difference with the CCR-I model is we added a ko With this variable, the term of scale performance is being introduced Normalization restriction: m i=1 δi x io = 1 (9.16) The denominator is kept constant assuming a value of 1 Restrictions s m r =1 μr yr j − i=1 δi x i j + ko ≤ 0 j = 1, . . . , n (9.17) Decision Variables μr ≥ ε; δi ≥ ε; ε ∼ = 0 (9.18)
Multiplicative form Objective function m δi xio − ko (9.19) Minimi ze Z = i=1 We minimized the denominator Normalization restriction s μr yr o = 1(9.20)
r =1
The numerator is kept constant assuming a value of 1 Restrictions m δi xi j − rs =1 μr yr j − ko ≤ 0 (9.21) i=1 j = 1, . . . , n Decision variables μr ≥ ε; δi ≥ ε; ε ∼ = 0 (9.22) Envelopment form Objective function Maximi zeZ = ϕ + ε s + + s − (9.23) Restrictions λX = xo − s − λY = ϕyo + s + λ = 1 (9.24) Decision Variables λ, s + , s − ≥ ε > 0 (9.25) (continued)
9 Technology Transfer from a Tacit Knowledge Conservation Model …
213
Table 9.1 (continued) CCR-I
CCR-O
NDCCR-I
NDBCC-I
Envelopment form Objective function Minimi ze Z = m − (9.26) θ − ε rs ∈O D s + + i∈I Ds
Envelopment form Objective function Minimi ze Z = m − (9.32) ϕ − ε rs ∈O D s + + i∈I Ds Restrictions n j=1 λ j Yr j = yr o + sr + ∀:1 . . . .s (9.33) n − ∀i ∈ I D (9.34) j=1 λ j X i j = θ x io − si n − ∀i ∈ I N D (9.35) j=1 λ j X i j = x io − si n j=1 λ j = 1 (9.36) Decision variables λ, s + , s − ≥ ε > 0 (9.37)
Minimi ze Z = m − (9.27) θ − ε rs ∈O D s + + i∈I Ds Restrictions n + ∀:1 . . . .s (9.28) j=1 λ j Yr j = yr o + sr n − ∀i ∈ I D (9.29) j=1 λ j X i j = θ x io − si n − ∀i ∈ I N D (9.30) j=1 λ j X i j = x io − si Decision variables λ, s + , s − ≥ ε > 0 (9.31) NDCCR-O
NDBCC-O
Envelopment form Objective function Maximice Z = m − ϕ + ε rs ∈O D s + i∈I D s (9.38) Restrictions n − ∀: 1 . . . .m (9.39) j=1 λi X i j = x io − si n + ∀r ∈ I D (9.40) j=1 λ j Yr j = ϕyr o + sr n + ∀r ∈ I N D (9.41) j=1 λ j Yr j = yr o + sr Decision variables λ, s + , s − ≥ ε > 0 (9.42)
Envelopment form Objective function Maximice Z = m − ϕ + ε rs ∈O D s + + i∈I D s (9.43) Restrictions n − ∀: 1 . . . .m (9.44) j=1 λ j X i j = x io − si n + ∀r ∈ I D (9.45) j=1 λ j Yr j = ϕyr o + sr n + ∀r ∈ I N D (9.46) j=1 λ j Yr j = yr o + sr n j=1 λ j = 1 (9.47) Decision variables λ, s + , s − ≥ ε > 0 (9.48)
Source The authors
Fig. 9.7 Technology transfer of tacit knowledge in the DEA. Source The authors
214
D. M. Montoya-Quintero et al.
References Adler N, Friedman L, Sinuany-Stern Z (2002) Review of ranking methods in the data envelopment analysis context. Eur J Oper Res. North-Holland, 249–265 Al-Emran M, Mezhuyev V, Kamaludin A, Shaalan K (2018) The impact of knowledge management processes on information systems: a systematic review. Int J Inf Manage. 43:173–187 Beckman T (1997) A methodology for knowledge management, Canada Biron M, Hanuka H (2015) Comparing normative influences as determinants of knowledge continuity. Int J Inf Manag 35:655–661. https://doi.org/10.1016/j.ijinfomgt.2015.07.006 Cerchione R, Esposito E (2017) Using knowledge management systems: a taxonomy of SME strategies. Int J Inf Manag 37:1551–1562. https://doi.org/10.1016/j.ijinfomgt.2016.10.007 Charband Y, Jafari Navimipour N (2016) Online knowledge sharing mechanisms: a systematic review of the state of the art literature and recommendations for future research. Inf Syst Front 18:1131–1151. https://doi.org/10.1007/s10796-016-9628-z Chen LC, Lu WM, Yang C (2009) Does knowledge management matter? Assessing the performance of electricity distribution districts based on slacks-based data envelopment analysis. J Oper Res Soc. Palgrave Macmillan Ltd., 1583–1593 Coelli TJ, Prasada Rao DS, O’Donnell CJ, Battese GE (2005) An introduction to efficiency and productivity analysis. Springer, USA Cook WD, Seiford LM (2009) Data envelopment analysis (DEA)—thirty years on. Eur J Oper Res 192:1–17. https://doi.org/10.1016/j.ejor.2008.01.032 Cooper, (2004) Data envelopment analysis, a comprehensive text with models, applications, references and DEA-solver software. Eur J Oper Res 149:245–246. https://doi.org/10.1016/s03772217(02)00304-1 Davies J, Fensel D, Van Harmelen F (2003) Towards the semantic web: ontology-driven knowledge management. Chichester, John Wiley & Sons, Ltd. https://doi.org/10.1002/0470858060 Dayan R, Heisig P, Matos F (2017) Knowledge management as a factor for the formulation and implementation of organization strategy. J Knowl Manag 21:308–329. https://doi.org/10.1108/ JKM-02-2016-0068 Depres C, Chauvel D (2000) A thematic analysis of the thinking in knowledge management. In: Knowledge Horizons, the present and the promise of Knowledge Management. Butterworth Heinemann Galagedera DUA, Silvapulle P (2003) Experimental evidence on robustness of data envelopment analysis. J Oper Res Soc 54:654–660. https://doi.org/10.1057/palgrave.jors.2601507 Giraldo-Jaramillo LF, Montoya-Quintero DM (2015) Aplicación de la metodología Commonkads en la Gestión del Conocimiento. Rev CEA 1:99. https://doi.org/10.22430/24223182.133 Haldar A, Rao SVDN, Momaya KS (2016) Can flexibility in corporate governance enhance international competitiveness? Evidence from knowledge-based industries in India. Glob J Flex Syst Manag 17:389–402. https://doi.org/10.1007/s40171-016-0135-3 Holsapple CW, Joshi KD (2002) Knowledge management: a threefold framework. Inf Soc 18:47–64. https://doi.org/10.1080/01972240252818225 Jayawickrama U (2014) An ERP knowledge transfer framework for strategic decisions in knowledge management in organizations. Int J Innov Manag Technol. https://doi.org/10.7763/ijimt.2014. v5.530 Kao C (2016) Efficiency decomposition and aggregation in network data envelopment analysis. Eur J Oper Res 255:778–786. https://doi.org/10.1016/j.ejor.2016.05.019 Kuah CT, Wong KY (2013) Data envelopment analysis modeling for measuring knowledge management performance in Malaysian higher educational institutions. Inf Dev 29:200–216. https://doi. org/10.1177/0266666912460794 Kuah CT, Wong KY, Wong WP (2012) Monte Carlo data envelopment analysis with genetic algorithm for knowledge management performance measurement. Expert Syst Appl 39:9348–9358. https://doi.org/10.1016/j.eswa.2012.02.140
9 Technology Transfer from a Tacit Knowledge Conservation Model …
215
Mardani A, Zavadskas EK, Streimikiene D et al (2017) A comprehensive review of data envelopment analysis (DEA) approach in energy efficiency. Renew Sustain Energy Rev 70:1298–1322 Nonaka I, Takeuchi H (1996) The knowledge-creating company: How Japanese companies create the dynamics of innovation. J Int Bus Stud 27:196–201. https://doi.org/10.1057/jibs.1996.13 Ogiela L (2015) Advanced techniques for knowledge management and access to strategic information. Int J Inf Manag 35:154–159. https://doi.org/10.1016/j.ijinfomgt.2014.11.006 Ordóñez de Pablos P (2001) La gestión del cocnocimiento como base para el logor de una ventaja competitiva sostenible: La organización occidental versus japonesa. Investig Eur Dir Y Econ La Empres 7:91–108 Sveiby KE (1997) The new organizational wealth: Managing & measuring knowledge-based assets. Berrett-Koehler Publishers Venkitachalam K, Willmott H (2017) Strategic knowledge management—insights and pitfalls. Int J Inf Manage 37:313–316. https://doi.org/10.1016/j.ijinfomgt.2017.02.002 Wang Y, Huang Q, Davison RM, Yang F (2018) Effect of transactive memory systems on team performance mediated by knowledge transfer. Int J Inf Manage 41:65–79. https://doi.org/10. 1016/j.ijinfomgt.2018.04.001 Xu J, Wei J, Zhao D (2016) Influence of social media on operational efficiency of national scenic spots in China based on three-stage DEA model. Int J Inf Manage 36:374–388. https://doi.org/ 10.1016/j.ijinfomgt.2016.01.002 Zerafat Angiz LM, Mustafa A, Ghadiri M, Tajaddini A (2015) Relationship between efficiency in the traditional data envelopment analysis and possibility sets. Comput Ind Eng 81:140–146. https://doi.org/10.1016/j.cie.2015.01.001
Chapter 10
Performance Analysis of Decision Aid Mechanisms for Hardware Bots Based on ELECTRE III and Compensatory Fuzzy Logic Claudia Castillo-Ramírez, Nelson Rangel-Valdez, Claudia Gómez-Santillán, M. Lucila Morales-Rodríguez, Laura Cruz-Reyes, and Héctor J. Fraire-Huacuja Abstract The development of ubiquity in computing demands more intelligence from connected devices to perform tasks better. Users usually lookout for devices that proactively aid in an environment, making decisions as themselves. Such cognitive models for hardware agents have increased in recent years. However, although numerous strategies emulate intelligent behavior in hardware, some problems are still to overcome, such as developing a hardware agents’ preference system with small computing capabilities. The present research proposes two novel cognitive preference models viable for hardware with few memory cells and small processing capacity; it also analyzes the approaches’ achieved performance. Keywords Decision aid mechanism · Hardware · ELECTRE III · Compensatory fuzzy logic
10.1 Introduction Preference modeling is a critical stage in the treatment of a decision problem. Although it is not generally evident, it plays a fundamental role in many real applications. For example, a personalized learning requires knowing how a specific individual learns (Bisták 2019); or, autonomous vehicles can be better guided whenever it has a pre-specified identity (Parikh et al. 2018). Artificial agents also interact better with individuals when adopting known behaviors (Delgado-Hernández et al. 2020). C. Castillo-Ramírez · C. Gómez-Santillán · M. Lucila Morales-Rodríguez · L. Cruz-Reyes · H. J. Fraire-Huacuja Tecnológico Nacional de México, Instituto Tecnológico de Ciudad Madero, Ciudad Madero, Tamaulipas, Mexico N. Rangel-Valdez (B) Cátedras CONACyT/Tecnológico Nacional de México, Instituto Tecnológico de Ciudad Madero, Ciudad Madero, Tamaulipas, Mexico e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_10
217
218
C. Castillo-Ramírez et al.
Preferences themselves are an essential element in individuals’ lives and something natural or familiar in any person’s day to day. For this reason, its modeling is considered an indispensable step, not only in decision-making as a discipline but also in different areas of study (Ruiz 2015). Nowadays, distinct mechanisms provide adequate cognitive levels to artificial systems. Some of them have achieved great success in the design of software agents such as neural networks (Oltean et al. 2019), clustering (Rubio et al. 2016), reinforcement learning (Low et al. 2019), fuzzy logic and compensatory fuzzy logic (EspinAndrade et al. 2016), and outranking relations (Chaux Gutiérrez 2017). However, not all of them count with implementations in hardware, or even more, in low-cost hardware platforms. Outranking Relations (or OR) and Compensatory Fuzzy Logic (or CFL) are two techniques that lack a study of their hardware performance as intelligence modelers. Despite their simple definition and generalization capacity, hardware agents almost never use them. This chapter analyzes the feasibility of implementing OR and CFL in hardware platforms with low resources because of the previous situations. The main contributions are (1) two novel cognitive models based on OR and CFL for low-cost hardware platforms; (2) the performance analysis of OR and CFL as cognitive modelers. The results show that the competitiveness of the proposed tools and the arising of new research lines. The remaining of this chapter presents, in the followed order, the related state of the art, the background formed by the basic concepts, the architectural proposal, the experimental design proposed for its validation, the results obtained, and the conclusions.
10.2 State of the Art Artificial intelligence (AI) has been the study object for multiple researchers in numerous computer science disciplines. AI aims for the development of autonomous devices capable of performing tasks efficiently. Among the mechanisms used to provide intelligence to software and hardware are neural networks (Thote et al. 2018; Oltean et al. 2019; García et al. 2019), clustering (Rubio et al. 2016; Pathoumvanh et al. 2016), reinforcement algorithms (Low et al. 2019), and classic Fuzzy Logic (Cárdenas León et al. 2016; Parikh et al. 2018). Recently, CFL (Andrade et al. 2014; Espin-Andrade et al. 2016) and OR (DelgadoHernández et al. 2020) increased the number of researches related to intelligent agents in software. Despite the numerous CFL and OR applications in software, its analysis of hardware agents’ implementation remains open. The following paragraphs provide more in-depth insight into the development of intelligent agents. The end of the section offers a summary discussion over those mechanisms and the present research.
10 Performance Analysis of Decision Aid Mechanisms …
219
Neural networks emulate intelligence in hardware and software for many applications. Some of them are learning (Ortega Zamorano 2015), detection of deception (Romero Vargas 2016), faults identification in transformers (Thote et al. 2018), defect inspection in coffee beans (García et al. 2019), and of general-purpose (Oltean et al. 2019). So far, some low-cost hardware platform implementations can implement neural networks, e.g., Artificial Neural Network (ANN) for Arduino (Benatti Moretti 2020). However, those approaches mostly require specific complex parameters configuration. Clustering is a simple yet powerful tool in learning agents. Some hardware implementations use clustering for biometric identification (Pathoumvanh et al. 2016) or fruit classification (de Jesús Rubio 2017). Arduino counts with one general clustering library based on the K-Nearest Neighbors algorithm (KNN), see (Arduino Blog 2020). Reinforcement learning bases its strategy on a matrix of states and actions that dictate an agent’s behavior. An example of using such a hardware learning technique is the mobile robot developed by Low et al. (2019) that uses Q-learning to solve path planning. The matrix’s size gives the complexity of reinforment learning; the greater the number of states and actions, the greater the required matrix. Fuzzy Logic supports several hardware agents (Cárdenas León et al. 2016) in systems that vary from temperature control (Espino Núñez 2017) to automated vehicles (Parikh et al. 2018). FL depends on rules formed by premises and consequences whose expressivity is limited to the number of rules loaded in the system. A system with few memory cells will suffer from implementing a complicated FL. Finally, concerning CFL and OR, numerous works deal with intelligent software agents, but few that involved hardware. (Llorente Peralta 2019) make use of CFL for the discovery of knowledge in smart systems; (Bogantes Barrantes 2016) and (Chaux Gutiérrez 2017) use OR to define agents for maintenance and monitoring purposes, respectively. And, more recently, (Delgado-Hernández et al. 2020) develop a conversational agent using OR. Based on the previous related work, CFL and OR have had success in developing intelligent software but are rarely used in hardware. In contrast with neural networks, clustering, reinforcement learning, and FL, the simplicity and generality of CFL and OR make them attractive as alternatives for developing hardware agents. Hence, this work investigates how convenient those approaches are for low-cost hardware platforms. This chapter mainly evaluates the quality of implementation in Arduino of the two proposed cognitive models, based on CFL and ELECTRE III. Note that ELECTRE III is the OR mechanism.
220
C. Castillo-Ramírez et al.
10.3 Background 10.3.1 ELECTRE Method The ELECTRE methods (ELimination Et Choix Traduisant la REalité) are multiattribute methods that can handle cardinal information. Bernard Roy and his collaborators developed these methods, which spread throughout Europe, aiding in many applications (Roy 1991; Figueira et al. 2010). ELECTRE methods seek to find the credibility index σ(x,y) of the statement “x is at least as good as y,” which is supported by two conditions: 1. 2.
The concordance index, Denoted c(x,y), is an indicator of how in favor are the criteria in favor of such a claim. The disagreement index, denoted d(x,y), is an indicator of the disagreement level by the criteria of such assertion.
The computation of σ(x,y) involves parameters such as weights w, indifference q, preference p, pre-veto u, and veto v.
10.3.2 Fuzzy Logic Fuzzy Logic (or FL) is a multivalent logic that represents uncertainty and vagueness mathematically, providing proper tools for its treatment (González Morcillo 2011). In conventional logic, an element belongs or does not belong to a set, giving true or false values or zero and one to indicate that membership. In contrast, FL defines a degree of membership to that set, denoting how much it is true that an element belongs to a specific group (Cejas Montero 2011). FL mimics how a person makes decisions based on imprecise information (D´Negri and Vito 2006). The support of FL is an inference system based on the implication rule. This rule has IF … THEN… form where the IF is the premise and the THEN is the consequent. Both premise and consequent are fuzzy sets that help to form other fuzzy sets used to solve models.
10.3.3 Compensatory Fuzzy Logic The Compensatory Fuzzy Logic (CFL) was created in Havana, Cuba, by the multidisciplinary scientific group Business Management in Uncertainty: Research and Services (GEMINIS) and is an extension of the FL proposed by Lofti A. Zadeh. The CFL is a multivalent logic model that allows the simultaneous modeling of deductive and decision-making processes. Its most important characteristics are
10 Performance Analysis of Decision Aid Mechanisms …
221
Fig. 10.1 Proposed architecture
flexibility, tolerance with imprecision, the ability to model non-linear problems, and its foundation in common sense language (Cejas Montero 2011). In FL, the conjunction’s truth value is less than or equal to all the components, while the disjunction’s truth value is greater than or equal to all the components. The waiver of these restrictions constitutes the basic idea of the CFL. An increase or decrease in the truth value of the conjunction or disjunction resulting from a variation in the truth value of some component can be offset by the change in another component (Bouchet et al. 2010). Its quality distinguishes the CFL to generalize all the formulas of Boolean Logic (BL). Kleene’s axioms show that the BL’s valid formulas are precisely the formulas with a truth value greater than 0.5 in the CFL context (Andrade et al. 2014). CFL is a model that breaks with the classic axioms and seeks the correct ways for doing things and thinking. CFL also pursues the reasoning in actual decision-making by establishing facts verifiable and demonstrating consistent results through repeated experiments (Racet-Valdés et al. 2017). The works of (Alonso et al. 2014; Andrade et al. 2014) present a more comprehensive review of the axioms to be considered by the CFL.
10.4 Proposed Architecture for Decision Aid Figure 10.1 presents the proposed general architecture to develop intelligent agents in an Arduino microcontroller using CFL and OR. The emulation of individuals’ preferences is the approach followed to develop intelligence in an artificial agent.
222
C. Castillo-Ramírez et al.
According to the scheme provided in Fig. 10.1, the architecture specifies five main modules required to implement hardware preferences. These modules are: (1) Board Module; (2a) Module for Integration of Preference from Reference Sets for OR, or MIP-RS; (2b) Module for Integration of Preference from CFL, or MIP-CFL; (3a) Reference Set (RS) Preprocessing; and (3b) CFL Preprocessing. Of the five modules, two of them are mutually exclusive. To integrate preference using OR, the architecture must follow modules 1, 2a, and 3a, otherwise, it must follow modules 1, 2b, and 3b to use CFL. The details of each of the modules are below. Board Module. This module is represented exclusively by Arduino’s hardware and the sensors and actuators that connect it to the environment. It contains three submodules. The sensors sub-module is made up of Arduino’s sensors, which collect information from the environment and communicate them to Arduino. The actuator submodule, which contains the actuators that allow Arduino to carry out specific actions in the environment in response to what the sensors send and the implemented preferences. This submodule receives from Arduino the actions to be executed in the form of state changes of the actuators. Finally, the Board submodule contains the Arduino microcontroller, which implicitly refers to the device’s memory and processing capabilities. Its purpose is to contain the implementation of the selected preference model; this sub-module interacts with sensors and actuators as previously described. This Module will only have one of the preference modeling strategies, outranking or LDC implemented at a time, which is referred to by the Preference Incorporation Mechanism (MIP). MIP-RS Module. This module is the MIP based on Overcoming Relationships (MIP-RS). This module models preferences through the implementation of the ELECTRE III method. To do this, define five sub-modules. The Sensor Reading sub-module represents the information from the available sensors; such information is the criteria. The VR (or Reference Value) and U (or Thresholds) submodules define the reference values and credibility thresholds for a specific individual. The reference values are a set of measurements of the criteria associated with states of satisfaction with the individual; in other words, they are abstract representations of the decision maker’s wishes within the hardware. The Concordance Index (IC), Discordance Index (ID), Credibility Index (σ) submodules implement the ELECTRE III method for the calculation of σ(a, b). In these modules, the values of the sensors (who would be a) defines the alternatives of the phrase “a is at least as good as b” (with credibility measured by σ(a, b)). The one or more reference values (b), if there is more than one reference value, the sigma value corresponding to the highest will be returned. RS Preprocessing. This module describes offline actions that must be carried out to configure the hardware agent for handling preferences through outranking properly. Because outranking requires thresholds to model an individual’s preferences, a strategy must be defined to calculate them. Taking into consideration the work of (Cruz-Reyes et al. 2017), the elicitation of these parameters is possible through the PDA approach (i.e. the Preference Disaggregation Analysis). Therefore, the RS preprocessing module receives information from the environment in the form of a
10 Performance Analysis of Decision Aid Mechanisms …
223
historical record of the individual’s satisfaction to be modeled (submodule H), and the required threshold values w, q, p, u, v are obtained using PDA (PDA submodule). With these elements, and a reference value chosen by the individual, the information of the VR and U submodules is defined. MIP-CFL Module. This module is the MIP based on Compensatory Fuzzy Logic (MIP-CFL). This module models preferences through LDC rules obtained by the EUREKA UNIVERSE framework (Llorente Peralta 2019). To do this, it defines three sub-modules. The Sensor Reading sub-module represents the information from the available sensors; such information is the criteria. The CFL Rules submodule contains the rules obtained through EUREKA UNIVERSE, and they are the elements that model the preferences of a specific individual to be modeled. Finally, the Preferences from CFL Rules module integrates the CFL criteria and rules to define the preference over the current state of the environment captured through the sensors. LDC Preprocesamiento Module. This module describes offline actions that must be carried out in order to properly configure the hardware agent for handling preferences through CFL. Because the CFL requires association rules to model an individual’s preferences, a strategy needs to be defined to calculate them. Taking into consideration the work of Llorente Peralta (2019) by means of a historical record of the individual to be modeled (submodule H), and by means of EUREKA UNIVERSE (associated submodule), association rules of the information provided can be built. These rules, together with the methodology proposed in the present work on how to use them as preferences, give rise to the mechanism of the Preferences as CFL Rules submodule that must be implemented in the CFL Rules sub-module of the MIPCFL Module to model the preferences of an individual. The history is records of environmental conditions and the level of satisfaction associated with each of them according to the individual to be modeled. It is worth mentioning that the architecture observes the difference in the need for a reference value between the use of outranking and CFL. While outranking requires a reference value associated with the individual, not necessarily defined directly by him, CFL does not need it. This situation is apparently advantageous for CFL that does not require the direct presence of the individual, however, for outranking it can be easily overcome by choosing satisfactory from the historical record. In other words, although the architecture indicates that the DM must provide the reference value, in the implementation it can be obtained directly or indirectly. On the other hand, the fact that CFL cannot at the moment consider a reference value could represent a disadvantage given the need to carry out an interactive process with the individual, since in this proposal it would require the recalculation of the rules on each occasion, while in outranking it would only require updating the reference value (the criteria values).
224
C. Castillo-Ramírez et al.
10.5 Analysis of Proposed Decision Aid Mechanism 10.5.1 Experimental Design According to the Architecture presented in Fig. 10.1, the two methodologies contemplated for the integration of preferences in hardware are outranking and LDC. Their validation was done through the analysis of their performance in a specific case study. This section summarizes the conditions of the experiment where they were tested, as well as the main results obtained. The Experimental Design was made up of a case study where the hardware agent interacts with the environment. The generic instance, the concrete instance, an artificial decision maker, the method of generating historical records, the performance indicators, and the statistical validation strategy are defined.
10.5.2 Case of Study The environment is defined by a closed room. The Hardware Agent will estimate your distance to the wall where the entrance to the room is located. Additionally, the agent can measure the intensity of light, humidity, and temperature in its location. It is assumed that the agent makes a linear tour of the room and takes samples of the state of the tuple (temperature, light, humidity, distance) at different times. The purpose is for the agent to identify, considering the preferences of a particular individual, which of the different samples taken are satisfactory for the individual.
10.5.3 Instance Definition The study problem is the incorporation of preferences in Arduino. A generic instance of this problem for the validation of the Architecture proposed in the previous section is characterized by the following elements: 1. 2. 3. 4. 5.
An environment E where the hardware agent is inserted; A set of criteria of interest C measurable in the environment; A Decision Maker (DM) individual who has an opinion about his satisfaction with respect to specific values that describe the states of the criteria C measured; A historical record H of opinions made by the DM regarding E and the states of C that he has perceived; A MIP methodology that allows implementing the DM preferences in the hardware agent;
10 Performance Analysis of Decision Aid Mechanisms … Table 10.1 Instance composition
225
Parameter
Description
E
Case of study
DM
Decision maker decisor artificial generado aleatoriamente
C
{Temperature, Humidity, Intensity, Distance}
MIP
MIP-RS, MIP-CFL
H
Random sampling in the range of possible sensors’ values
I
S 1 , Satisfaction level achieved by the MIP in the training set S 2 , Satisfaction level achieved by the MIP in the test set
6.
A performance indicator I that allows establishing whether or not there is a significant difference between the opinions issued by MIP and the opinions of the DM.
Taking into consideration the previously mentioned aspects for an instance, the aspects shown in Table 10.1 were specifically selected as the basis for creating random instances.
10.5.4 Artificial Decision Maker To analyze the performance of the architecture, the opinion of a decision maker was artificially simulated with respect to a set of sensor values obtained r called a record. For this, the decision-maker’s opinion was limited to two options: satisfied and not satisfied. The opinion of the artificial decision maker was emulated by associating the values numbers {0,1} to satisfied and not satisfied, respectively, and for a record r assigning one of them randomly. Table 10.2 presents the limits of values used in the criteria of temperature, humidity, light intensity, and distance as generation ranges of the random values. The lower and upper bound for each sensor are denoted [t i , t f ], [hi , hf ], [l i , l f ], and [d i , d f ], respectively. Table 10.2 Range value considered for the sensors
Temperature (o C)
Humidity (RH)
Intensity ()
Distance (cm)
[15, 40]
[40, 70]
[100, 1000]
[50, 450]
226
C. Castillo-Ramírez et al.
10.5.5 Historical Data With the previously defined artificial decision maker, the historical record H = H E ∪H P was constructed. The historical record H consists of two parts, the H E training record set, and the H P test recordset. It was considered to set the temperature and humidity and randomly vary the intensity and distance for the present advance. The assignment of satisfaction was also randomized. The resulting parts of this record are shown in Tables 10.3 and 10.4. Columns 1–4 contain sensors values, Column 5 is the preference of the DM. Figure 10.2 presents the pseudocode of the random generation method of historical data set H. Table 10.3 Training set of the historical data, HE Temperature (t)
Humidity (h)
Intensity (i)
Distance (d)
Satisfaction(s)
23
53
589
24.02
0
23
53
532
26.19
1
23
53
597
26.61
1
23
53
562
24.93
0
23
53
537
26.51
2
23
53
595
27.36
1
23
53
567
24.42
0
23
53
541
24.53
0
23
53
596
24.51
0
23
53
550
24.42
0
23
53
554
26.15
1
23
53
589
14.9
2
23
53
597
26.61
2
23
53
589
14.9
1
23
53
526
24.59
0
Table 10.4 Test set of the historical data, HP Temperature
Humidity
Intensity
Distance
Satisfaction
23
53
585
18.02
1
23
53
566
18.42
1
23
53
554
18.02
1
23
53
585
17.56
1
23
53
535
18.02
1
10 Performance Analysis of Decision Aid Mechanisms …
227
Fig. 10.2 Generation method for historical data with max entries (s stands for satisfaction)
10.5.6 Performance Indicators In order to evaluate the performance of the MIP-RS and MIP-CFL preference models, the following indicators are defined: 1. 2.
S1 indicator. Level of satisfaction obtained by the MIP in the training set. S2 indicator. Level of satisfaction obtained by the MIP in the test set.
Indicator S 1 is calculated by obtaining the number of coincidences between the satisfaction issued by the MIP and that specified by the DM in the H P set. In other words, the satisfaction column in Table 10.3 is compared with the response given by the MIP. Indicator S 1 is calculated by obtaining the number of matches between the satisfaction issued by the MIP and that specified by the DM in the H E set. In other words, the satisfaction column in Table 10.4 is compared with the response given by the MIP.
10.5.7 Validation Currently, the entire infrastructure of the agent is already in place. Having precisely defined the case of study and the indicators, the remaining activity to conclude the thesis work is to complete the statistical validation. For this purpose, 30 random historical records are already being experimented with, which, in conjunction with the current one, will be used to statistically validate whether or not the difference between the opinion issued by the hardware agent and the opinion of the DM is significant or not according to the values of indicators S 1 and S 2 .
228
C. Castillo-Ramírez et al.
10.6 Results The results from the implementation of the architecture in the case study described are described based on the implementations, the defined parameters, and the report on the quality of the performances within this section.
10.6.1 Elicitation of Thresholds and Reference Values ELECTRE III requires the definition of the weights w and thresholds q, p, u, v. To carry out this task, an implementation of the PDA strategy reported by (CruzReyes et al. 2017) was used, taking as input values the historical record presented in Table 10.3, as well as the initial values of the thresholds shown in Table 10.5. The initial values of weights and thresholds may have been random, however, on this occasion they were selected with an apparently relevant differenceaccording to the criterion, respecting the order relationship q < p < u < v and wi = 1, for all weight i. Additionally, initial threshold values (λ, β, ε) = {(0.51,0.15,0.07), (0.67,0.15,0.07), (0.70,0.20,0.10), (0.75,0.20,0.10)} of credibility, symmetry and asymmetry, suggested by the authors, and associated with an incremental order in the degree of flexibility in the decision, from very flexible to conservative. However, for this work, the thresholds (β, ε) have no relevance in the outranking method, and the threshold λ was only used to establish satisfaction or not. Table 10.5 Initial configuration of weights and thresholds used by the PDA strategy Parameter
Temperature
Humidity
Intensity
Distance
w
0.25
0.25
0.25
0.25
q
1
2
20
4
p
2
4
40
8
u
3
6
60
8
v
4
8
80
12
Table 10.6 ELECTRE III parameters’ values obtained through PDA strategy Parameter
Temperature
Humidity
Intensity
Distance
w
0.312
0.278
0.259
0.151
q
1.040
1.550
18.462
5.133
p
2.600
5.164
51.496
7.956
u
3.234
7.636
77.724
8.068
v
4.539
8.902
99.666
8.608
10 Performance Analysis of Decision Aid Mechanisms …
229
Table 10.6 shows the values of the weights w and thresholds q, p, u, v estimated by the PDA to be used by ELECTRE III. The value λ = 0.871 was estimated in this procedure. The method to select as the only reference value the r record in HE consists of choosing the one that was satisfactory for the DM and had the lowest average credibility index σ(r’,r) obtained when comparing with each r’ record in HE different from r. Each comparison is a measure of how good an r’ record is against another r record considered satisfactory for the DM, and the lower average will indicate which one they were able to beat the least, identifying it as the best. Table 10.7a identifies the potential records to be reference values in the last column; Table 10.7b shows the comparison of all and the final selection of what will be considered as the reference value. In this preview, for demonstration purposes, the r1 register was used as the reference value. According to Table 10.7b, the reference value VR is the record r 1 with criteria values (23, 53,596,24.51).
10.6.2 Definition of CFL Rules The rules shown in Table 10.8 were used to predict the preference, these rules were obtained from entering history in the EUREKA UNIVERSE software (Llorente Peralta 2019).
10.6.3 Performance Measurement Using Indicators S1 y S2 The calculation of performance by indicators S 1 and S 2 of MIP-RS is shown in Tables 10.9 and 10.10, respectively. For this calculation, the threshold λ = 0.871 estimated by the PDA for the analyzed instance was used. Once the credibility index σ(r,VR) has been calculated, where r is the historical records and VR the defined reference value, it is defined as YES it is satisfied if said index exceeds the value of λ estimated, and as NOT satisfied otherwise. The performance of the implementation of MIP-CFL was zero in both indicators S 1 y S 2 , suggesting poor performance of the model so far. The latter means that some adjustment should be made in order to improve the model in the future.
10.7 Successful Cases and Discussion Table 10.11 presents the success case identified for the research. The case considers Arduino Mega as the low-cost studied platform. The analysis only included mechanisms with a general-purpose implementation for Arduino; these were neural ANN,
230
C. Castillo-Ramírez et al.
Table 10.7 Chosen reference value for outranking (a) Possible reference values (VR) Record r
Temperature
Humidity
Intensity
Distance
Satisfaction
Could be VR?
r1
23
53
589
24.02
0
SI
r2
23
53
532
26.19
1
NO
r3
23
53
597
26.61
1
NO
r4
23
53
562
24.93
0
SI
r5
23
53
537
26.51
2
NO
r6
23
53
595
27.36
1
NO
r7
23
53
567
24.42
0
SI
r8
23
53
541
24.53
0
SI
r9
23
53
596
24.51
0
SI
r 10
23
53
550
24.42
0
SI
r 11
23
53
554
26.15
1
NO
r 12
23
53
589
14.9
2
NO
r 13
23
53
597
26.61
1
NO
r 14
23
53
589
14.9
2
NO
r 15
23
53
526
24.59
0
SI
(b) Comparison by averages in σ(r’,r). VR identification Records
Identified candidates to be Reference Values (VR) r1
r4
r7
r8
r9
r10
r15
r’1
0.410
0.500
0.510
0.540
0.450
0.530
0.520
r’2
0.557
0.440
0.626
0.560
0.430
0.510
0.490
r’3
0.530
0.540
0.530
0.580
0.520
0.490
0.430
r’4
0.520
0.530
0.510
0.490
0.560
0.540
0.550
r’5
0.508
0.550
0.578
0.570
0.510
0.495
0.490
r’6
0.570
0.510
0.520
0.530
0.460
0.530
0.510
r’7
0.560
0.520
0.510
0.560
0.440
0.520
0.560
r’8
0.510
0.490
0.539
0.580
0.480
0.590
0.490
r’9
0.490
0.530
0.540
0.597
0.500
0.490
0.480
r’10
0.480
0.540
0.550
0.610
0.510
0.530
0.540
r’11
0.530
0.510
0.520
0.540
0.490
0.510
0.530
r’12
0.420
0.480
0.40
0.490
0.515
0.480
0.495
r’13
0.510
0.530
0.550
0.510
0.460
0.530
0.440
r’14
0.480
0.490
0.510
0.480
0.410
0.490
0.530
r’15
0.616
0.510
0.684
0.530
0.520
0.530
0.500
Promedio σ (r’,r)
0.5127
0.5113
0.5384
0.5444
0.4836
0.5176
0.5036
Generated predicate
(IMP(AND”Distancia””Humedad”)”Temperatura”)
(IMP(AND “Distancia””Intensidad””Humedad”)”Temperatura”)
(IMP(AND”Humedad””Distancia”)”Temperatura”)
(IMP(AND”Humedad””Distancia”)”Temperatura”)
(IMP(AND “Distancia””Intensidad””Humedad”)”Temperatura”)
(IMP(AND”Distancia””Humedad”)”Temperatura”)
(IMP(AND”Humedad””Intensidad””Distancia”)”Temperatura”)
Truth value
0.93548
0.92871
0.92491
0.92398
0.91644
0.90839
0.90399
Table 10.8 CFL rules generated using EUREKA UNIVERSE
((label”Temperatura”, colname”Temperatura1”))
((label”Temperatura”, colname”Temperatura1”))
((label”Temperatura”, colname”Temperatura1”))
((label”Temperatura”, colname”Temperatura1”))
((label”Temperatura”, colname”Temperatura1”))
((label”Temperatura”, colname”Temperatura1”))
({:label “Temperatura”, :colname “Temperatura”
Linguistic variables
10 Performance Analysis of Decision Aid Mechanisms … 231
232
C. Castillo-Ramírez et al.
Table 10.9 Value of S1 for HE Record
σ(r,VR)
λ
Is DM satisfied?
Initial DM answer in HE
Do they match?
r’1
0.450
0.871
NO
YES
NOT
r’2
0.430
0.871
NO
YES
NOT
r’3
0.520
0.871
NO
YES
NOT
r’4
0.560
0.871
NO
YES
NOT
r’5
0.510
0.871
NO
NOT
YES
r’6
0.460
0.871
NO
YES
NOT
r’7
0.440
0.871
NO
YES
NOT
r’8
0.480
0.871
NO
YES
NOT
r’9
0.500
0.871
NO
YES
NOT
r’10
0.510
0.871
NO
YES
NOT
r’11
0.490
0.871
NO
YES
NOT
r’12
0.515
0.871
NO
NOT
YES
r’13
0.460
0.871
NO
NOT
YES
r’14
0.410
0.871
NO
YES
NOT
r’15
0.520
0.871
NO
YES
NOT S1 = 3
Table 10.10 Value of S2 for HP r
σ(r,VR)
λ
Is DM satisfied?
Initial DM answer in HP
Do they match?
r’1
0.510
0.871
NO
YES
NO
r’2
0.590
0.871
NO
YES
NO
r’3
0.535
0.871
NO
YES
NO
r’4
0.529
0.871
NO
YES
NO
r’5
0.540
0.871
NO
YES
NO S2 = 0
Table 10.11 Description of the success case
Platform
Arduino mega
ANN
Neurona (Benatti Moretti 2020)
KNN
Arduino_KNN (Arduino Blog 2020)
CFL
MIP-CFL
OR
MIP-RS
Indicator
Memory consumption
Conditions
Using at most 1 sensor
10 Performance Analysis of Decision Aid Mechanisms … Table 10.12 Memory consumption in comparison with other mechanisms
Mechanism
Memory consumption (bytes)
MIP-RS
2798
MIP-CFL
2834
KNN
5896
Neurona, ANN
8166
233
KNN, CFL, and OR. The libraries taken into account for ANN and KNN were Neurona and Arduino_KNN. The CFL and OR considered the proposed cognitive models MIP-CFL and MIP-RS. The performance indicator was the memory consumption (in bytes), and the maximum number of the utilized sensors was one. Let’s note that the specific Arduino’s implementations were the examples SimpleKNN from the library Arduino_KNN, ColorSensor from the library Neurona, and the CFL rules and ELECTRE III parameters under an instance with four attributes. Table 10.12 shows the results from the analysis of the success case. Column 1 shows the mechanisms considered, and Column 2 presents Arduino’s memory to implement each of them using at most one sensor. Some key observations arise from the results in Table 10.12. First, there is no significant difference in memory consumption among the proposed cognitive models MIP-RS and MIP-CFL. Second, MIP-RS and MIP-CFL consume fewer memory resources than a neural network and a clustering mechanism, making them competitive in low-cost platforms. Third, the MIP-RS is more convenient than MIP-CFL when considering the reconfiguration of parameters. Simultaneously, MIP-RS only requires a change of parameters’ values, MIP-CFL needs an entirely new set of rules. Finally, there is a positive impact from CFL and OR; it reflects the need to continue the study on their performance on new research lines.
10.8 Conclusions Several implications arise from this research work; these are enumerated in the following list: • The use of outranking requires a reference value, whereas CFL does not. • The reference value does not necessarily need to be defined by an individual directly; it can be estimated indirectly, as through the methodology proposed in the section Defining Thresholds and Reference Values. • A new strategy is proposed to define a reference value based on its average value of the others’ credibility index. • The use of a reference value for outranking does not represent a disadvantage in contrast to CFL. • The use of a reference value by outranking represents an advantage over CFL if the mechanism’s configuration is considered. The reference values in OR just
234
• • • •
C. Castillo-Ramírez et al.
need an updated, but CFL must regenerate the rules and loading them again in the hardware platform. Computing resources do not significantly affect memory consumption between MIP-RS and MIP-CFL. The historical record of opinions of the DM could be managed offline. It doesn’t need a module implemented in the hardware agent to translate the views or past actions of the DM into decision parameters. MIP-RS is more comfortable to reconfigure that MIP-CFL. The proposed strategies consume fewer memory cells in their implementations than other mechanisms.
Some open questions that remain and are of interest for future research are: (1) how do the approaches perform in more general or complex situations; (2) how can be better understood their advantages and disadvantages when executed in real applications; (3) can be reconfigurable with ease MIP-CFL; and, (4) a line of research derived from the current work is the study of the impact of wireless communication in MIP-RS through the use of hardware such as Wifi, Bluetooth or radio frequency communicators. Acknowledgements The authors want to thank the support from CONACYT projects 3058 Cátedras CONACyT 2014, A1-S-11012 Ciencia Básica 2017–2018, and 312397 PAACTI 2020-1. Also thank the support from TecNM Project 5797.19-P, and from Laboratorio Nacional de Tecnologías de Información (LaNTI) del TecNM/Campus ITCM.
References Alonso MM, Espín Andrade RA, Batista VL, Suárez AR (2014) Discovering knowledge by fuzzy predicates in compensatory fuzzy logic using metaheuristic algorithms. Stud Comput Intell 537:161–174. https://doi.org/10.1007/978-3-642-53737-0_11 Andrade RAE, Fernández E, González E (2014) Compensatory fuzzy logic: a frame for reasoning and modeling preference knowledge in intelligent systems. Stud Comput Intell 537:3–23. https:// doi.org/10.1007/978-3-642-53737-0_1 Arduino Blog (2020) Arduino Blog » Simple machine learning with Arduino KNN. https://blog. arduino.cc/2020/06/18/simple-machine-learning-with-arduino-knn/. Accessed 11 Dec 2020 Benatti Moretti C (2020) Neurona-Arduino Libraries. https://www.arduinolibraries.info/libraries/ neurona. Accessed 11 Dec 2020 Bisták P (2019) Arduino support for personalized learning of control theory basics. In: IFACPapersOnLine. Elsevier B.V., pp 217–221 Bogantes Barrantes BG (2016) Desarrollo de la estructura necesaria para la implementación de un modelo de toma de decisiones para mantenimiento basado en el deterioro multiestado para el parque eólico los santos. Tecnológico de Costa Rica, Costa Rica Bouchet A, Pastore J, Brun M, Ballarin V (2010) Lógica Difusa Compensatoria basada en la media aritmética y su aplicación en la Morfología Matemática Difusa Cárdenas León A, Barranco Gutiérrez A, Pérez Pinal F (2016) Implementación de Sistema Difuso en Arduino Uno. Acad J Celaya 8:821–826 Cejas Montero J (2011) The compensatory fuzzy logic. Rev Ing Ind 22:157–161
10 Performance Analysis of Decision Aid Mechanisms …
235
Chaux Gutiérrez AF (2017) Monitoreo de anomalías en máquinas rotativas con agentes inteligentes Jade y Arduino. Institución Universitaria Politécnico Cruz-Reyes L, Fernandez E, Rangel-Valdez N (2017) A metaheuristic optimization-based indirect elicitation of preference parameters for solving many-objective problems. Int J Comput Intell Syst 10:56–77. https://doi.org/10.2991/ijcis.2017.10.1.5 D´Negri CE, Vito ED (2006) Introducción al razonamiento aproximado: lógica difusa. undefined de Jesús Rubio J (2017) A method with neural networks for the classification of fruits and vegetables. Soft Comput 21:7207–7220. https://doi.org/10.1007/s00500-016-2263-2 Delgado-Hernández XS, Morales-Rodriguez ML, Rangel-Valdez N et al (2020) Development of conversational deliberative agents driven by personality via fuzzy outranking relations. Int J Fuzzy Syst 22:2720–2734. https://doi.org/10.1007/s40815-020-00817-w Espin-Andrade RA, Gonzalez E, Pedrycz W, Fernandez E (2016) An interpretable logical theory: the case of compensatory fuzzy logic. Int J Comput Intell Syst 9:612–626. https://doi.org/10. 1080/18756891.2016.1204111 Espino Núñez A (2017) Control de temperatura con lógica difusa para un sistema espectroscopia laser. Universidad Nacional Autónoma de México Figueira JR, Greco S, Roy B, Słowi´nski R (2010) ELECTRE methods: main features and recent developments. Springer, Berli, pp 51–89 García M, Candelo-Becerra JE, Hoyos FE (2019) Quality and defect inspection of green coffee beans using a computer vision system. Appl Sci 9:4195. https://doi.org/10.3390/app9194195 González Morcillo C (2011) Lógica Difusa, una introducción práctica. Técnicas de Softcomputing Llorente Peralta CE (2019) Algoritmo evolutivo para descubrir conocimiento de asociación usando lógica difusa compensatoria. TecNM/Instituto Tecnológico de Ciudad Madero Low ES, Ong P, Cheah KC (2019) Solving the optimal path planning of a mobile robot using improved Q-learning. Rob Auton Syst 115:143–161. https://doi.org/10.1016/j.robot.2019.02.013 Oltean G, Oltean V, Balea HA (2019) Method for rapid development of Arduino-based applications enclosing ANN. In: IECON proceedings (industrial electronics conference). IEEE computer society, pp 138–143 Ortega Zamorano F (2015) Algoritmos de aprendizaje neurocomputacionales para su implementación hardware. Universidad de Málaga Parikh P, Sheth S, Vasani R, Gohil JK (2018) Implementing fuzzy logic controller and PID controller to a DC encoder motor-“a case of an automated guided vehicle.” In: Procedia manufacturing. Elsevier B.V., pp 219–226 Pathoumvanh S, Bounnady K, Indahak P, Viravong V (2016) Implementation of the ECG biometric identification by using Arduino Microprocessor. In: 2016 13th international conference on electrical engineering/electronics, computer, telecommunications and information technology, ECTI-CON 2016. Institute of Electrical and Electronics Engineers Inc Racet-Valdés A, Espinosa-Gonzalez L, Suárez-Quintana J et al (2017) Modelo matemático para medir el nivel de servicio al cliente basado en la lógica difusa compensatoria. Rev Ing Ind 28:193–200 Romero Vargas AB (2016) Sistema de Detección de Engaños basado en Redes Neuronales y Arduino. Universidad Mayor de San Andrés Roy B (1991) The outranking approach and the foundations of electre methods. Theory Decis 31:49–73. https://doi.org/10.1007/BF00134132 Rubio JA, Ontañón M, Perea V, Megia A (2016) Health care of pregnant women with diabetes in Spain: approach using a questionnaire. Endocrinol y Nutr (English Ed 63:113–120. https://doi. org/10.1016/J.ENDOEN.2015.11.017 Ruiz J (2015) Métodos de Decisión Multicriterio ELECTRE y TOPSIS aplicados a la elección de un Dispositivo Movil. Universidad de Sevilla, Escuela Técnica Superior de Ingeniería Thote PB, Daigavane MB, Daigavane PM et al (2018) Hardware-in-loop implementation of ANN based differential protection of transformer. In: WIECON-ECE 2017-IEEE international WIE conference on electrical and computer engineering 2017. Institute of Electrical and Electronics Engineers Inc., pp 80–83
Chapter 11
A Brief Review of Performance and Interpretability in Fuzzy Inference Systems José Fernando Padrón-Tristán, Laura Cruz-Reyes, Rafael Alejandro Espín-Andrade, and Carlos Eric Llorente-Peralta Abstract Business analytics allows understand the current state of an organization and identify the needs of the company. Currently, there are commercial and scientific tools, which focus on trying to solve some of the stages of a business analytics system based on machine learning that helps in decision making through the discovery of knowledge. Network-based learning methods solve problems that were challenging for many years but the results they produce are not interpretable. On the other hand, methods based on fuzzy logic allow the discovery of predicates and their evaluation of unseen data, which is called inference. Due to their construction based on predicates, fuzzy logic facilitates the presentation of results in an interpretable way, but with less precision than network-based methods. A general line open to research is to integrate learning methods that compensate for accuracy and interpretation. This work presents a scoping review on compensatory fuzzy logic, Archimedean compensatory fuzzy logic as well as works based on inference, accuracy, and interpretability. Also, it presents a list of lines open to research on these topics. Keywords Fuzzy logic · Archimedean compensatory fuzzy logic · Inference · Performance and interpretability perspectives
J. F. Padrón-Tristán · C. E. Llorente-Peralta Tecnológico Nacional de México, Instituto Tecnológico de Tijuana, Tijuana, Mexico L. Cruz-Reyes (B) Tecnológico Nacional de México, Instituto Tecnológico de Ciudad Madero, Ciudad Madero, Mexico e-mail: [email protected] R. A. Espín-Andrade Autonomous University of Coahuila, Saltillo, Mexico © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_11
237
238
J. F. Padrón-Tristán et al.
11.1 Introduction Commercial and scientific tools, can help in decision making through the discovery of knowledge. These products are primarily supported by learning methods developed by the artificial intelligence (AI) community. In machine learning (ML), the process of making predictions by applying trained models with sample data is known as inference. In particular, network-based learning methods, such as deep learning neural networks, have dramatically improved the state-of-the-art in many application domains, solving problems that were challenging for many years in the AI community. Despite the high level of accuracy achieved, the results they produce do not allow a decision-maker to provide understandable justifications, in other words, they are not interpretable. On the other hand, methods based on fuzzy logic (FL), due to their construction based on predicates, naturally have greater proximity to natural language, facilitating the presentation of results in an interpretable way, but with less precision than network-based methods. It should be noted that inference is the traditional object of study of logic in general, and in particular, that fuzzy inference systems apply specific predicates (rules) to the knowledge base to deduce new information. A general line open to research is to integrate learning methods that compensate for accuracy and interpretation. Their integration is not enough to provide explanations to a decision-maker; new models, methods, and metrics are required to guarantee a useful interpretation. This work is organized as follows: Sect. 11.2 describes concepts related to fuzzy logic emphasizing compensatory fuzzy logic (CFL), Archimedean compensatory fuzzy logic (ACFL), and inference systems. In Sect. 11.3 is presented a literature analysis on surveys and general reviews to contrast our scoping review. Section 11.4 details the methodology to carry out this scoping review. Section 11.5 presents the research analysis through a case study, a comparison of compensatory fuzzy logic with other fuzzy logic structures, and other related works. Section 11.6 is for research findings and discussion, showing analytics of research papers and open trends to research. Section 11.7 details future research and Sect. 11.8 presents conclusions and recommendations.
11.2 Background of the Study 11.2.1 Knowledge Discovery The discovery of knowledge from data is becoming an urgent issue whose relevance is growing along with technological advances that allow the manipulation of large amounts of data (Cios et al. 1998).
11 A Brief Review of Performance and Interpretability …
239
Knowledge discovery from data (KDD) refers to the study of mechanisms that allow the recognition of precise patterns in a set of data that can be exploited as a form of knowledge in particular learning tasks (Fayyad 1996). In reality, KDD strategies also have an impact on a series of issues related to data mining, affecting data preprocessing methods, data grouping, and the explanation of the results of the obtained queries (Castellano et al. 2005). In this context, the emerging role of fuzzy logic assumes a crucial relevance among innovative approaches that attempt to use the expressiveness of natural language to address KDD problems, thus improving the understanding of the results obtained. Fuzzy sets lend themselves well to handling incomplete and heterogeneous data and their application to knowledge discovery processes is very useful in terms of interpretability.
11.2.2 Fuzzy Logic The formalization of knowledge continues to be one of the main problems of computerization (Serov et al. 2019). Applied knowledge can be brought to a logical level using fuzzy logic (FL), which is the integration of mathematical logic and the theory of fuzzy sets. The use of fuzzy predicates to represent knowledge facilitates problem-solving by allowing complex objects in the real world to be approached with vagueness and uncertainty. FL is a discipline proposed in the sixties by Zadeh (1965), defining the principle of incompatibility “as the complexity of a system increases, our ability to be precise and construct instructions about its behavior decreases to the threshold beyond the which, precision and meaning are exclusive characteristics” (Pérez-Pueyo 2005). In FL, use is made of the concepts of logic and Lukasiewicz sets (it governs the principles of implication and equivalence in a trivalued logic) by defining degrees of membership. FL starts from the idea that how human thought is constructed is not by numbers, but by linguistic labels. Linguistic terms are inherently less precise than numerical data, but they express knowledge in terms more accessible to human understanding (Pérez-Pueyo 2005). FL can acceptably reproduce the usual modes of reasoning, considering that the certainty of a proposition is a matter of degree, where classical logic is considered the limiting case. Therefore, the most attractive characteristics of FL are its flexibility, its tolerance for imprecision, its ability to model non-linear problems, and its base in natural language (Pérez-Pueyo 2005). A predicate is a propositional function P(x 1 , x 2 , ..., x n ), defined in individual variables x 1 , x 2 , ..., x n , whose range of values are true or false statements (1 or 0). Such formalism reflects only the logical side of the problem (Rey et al. 2017). The other side of knowledge, vagueness, and uncertainty, can be considered using the fuzzy set theory proposed in Zadeh (1965). The fundamental concept in fuzzy set theory is the concept of the membership function (MF) (Rey et al. 2017). Let
240
J. F. Padrón-Tristán et al.
M be a set, x be an element of M, then a fuzzy subset A of M is defined as a set of ordered pairs {(x, µA(x)}, ∀x∈M, µA(x) which is a characteristic MF that takes its values in a well-ordered set E, which indicates the degree or level of membership of an element x to a subset A. A set E is called a membership set. If E = {0, 1}, then a fuzzy subset A is treated as a normal subset. A variable z will be called fuzzy if it is defined in the set M and its range of values is the set of fuzzy subsets {Az}. The variable in the usual sense is a special case of a fuzzy variable when each fuzzy subset consists of one element. According to Rey et al. (2017), a fuzzy predicate is a function G(z1 , z2 , …, zn ), defined in fuzzy variables z1 , z2 , …, zn , whose range of values is a statement whose truth is estimated by values of the interval (0, 1). This true value is obtained from the composition of fuzzy variables (known as individual fuzzy predicates) and fuzzy logic operators.
11.2.3 Compensatory Fuzzy Logic Compensatory fuzzy logic (CFL) is a multivalent logic approach different from axiomatic norm and conorm. These must satisfy the characteristics of a descriptive and normative approach to decision making (Espin-Andrade et al. 2015). CFL is formed by a quartet of continuous operators conjunction (c), disjunction (d), fuzzy strict order (o), and negation (n), which satisfy a group of axioms among which are those of compensation and veto (Espin-Andrade et al. 2015).
11.2.4 Archimedean Compensatory Fuzzy Logic ACFL is obtained from the probabilistic logic and from the compensatory logic based on the geometric mean (GMBCL). Both of these behavioral trends better describe preferences rather than using only one of them (Espin-Andrade et al. 2015). Archimedean continuous logic is a system of t-norm and t-conorm logic, while fuzzy compensatory logic can be obtained from quasi-arithmetic mean operators. The property that the preference over a pair of truth-value vectors is the same for certain predicates in compensatory fuzzy logic and Archimedean continuous logic. On the other hand, CFL satisfies the idea that under certain terms it allows compensation, which is an interesting approach to modeling human decision making. Archimedean logic is a trio of operators (c, d, n) where c is a continuous Archimedean t-norm, d is its corresponding t-conorm, and n is the negation operator. ACFL has a property of affirmation and refutation at the same time, which are complementary properties. These operators are other approaches to the well-known concepts of Necessity and Possibility that appeared in Modal Logic and included in the fuzzy logic theory of possibility.
11 A Brief Review of Performance and Interpretability …
241
For some logical proposition, a quartet is defined formed by that which is called the value of non-refutation of the necessity of the proposition, the value of affirmation of the necessity of the proposition, the value of non-refutation of the possibility of the proposition, and the affirmation value of the possibility of the proposition (Espin-Andrade et al. 2015). This quartet indicates two semantic approaches, which are: (1) (2)
the non-refutation approach of the pair formed by the universal quantifier of no refutation and the existential quantifier of no refutation and, the affirmation approximation of the couple formed by the universal quantifier of the affirmation and the existential quantifier of the affirmation.
11.2.5 Fuzzy Inference Systems Fuzzy inference systems (FIS) are systems that make use of the Zadeh logic proposed in 1965, a theory that addresses the imprecision and uncertainty that is inherent to human judgments in decision-making (Amindoust et al. 2012; Carmona and Castro 2020). To carry out the inference process, the FIS makes use of fuzzy rules, also called fuzzy conditional rules, which are expressions of the form if A then B where A and B are labels of fuzzy sets (Jang 1993; Amindoust et al. 2012; Sabri et al. 2013; Carmona and Castro 2020). In the literature, it is possible to find various methods in the design of FIS, such as heuristic methods, neuro-fuzzy techniques, the discovery of association rules, genetic algorithms, and based on evolving systems on support vector machines (Carmona and Castro 2020). Main modules of a FIS are (Jang 1993; Amindoust et al. 2012; Sabri et al. 2013): Fuzzifier: In this process, the precise inputs are transformed into fuzzy inputs through a mathematical process. For this process, various functions are used with various shapes to model the diffusion; among these functions are concave, linear, trapezoidal functions, among others. For illustrative purposes, two sets A and B can be assumed that belong to the universe of discourse X, the fuzzification process receives the elements a, b ∈ X and receives the fuzzy degrees µ A (a), µ A (b), µ B (a) andµ B (b) (Amindoust et al. 2012; Sabri et al. 2013; Espín-Andrade et al. 2014). Knowledgebase: generally known in the literature as the knowledge base is the union of the fuzzy rule base and the database to be used (Jang 1993). The database is the container in which the data used to carry out the inference process is stored. The rule base contains a set of fuzzy rules. They are also called fuzzy logical predicates; they are expressions of the type i f a then b, where appropriate labels are assigned for a and b for the knowledge they represent. In turn, they are characterized by their corresponding membership functions (Amindoust et al. 2012).
242
J. F. Padrón-Tristán et al.
In other words, these rules are fuzzy predicates that make use of sets, logic, and fuzzy inference. These rules are generated through the knowledge and experience of experts in the subject matter. In the literature, it is found that there are two main types of fuzzy rules: Mamdani fuzzy rules and Takagi-Sugeno fuzzy rules (Sabri et al. 2013). The general formula for Mamdani’s rule can be expressed as follows: I F var1 is T1 AN D . . . AN D varm is Tm T H E N x1 is Q 1 , . . . x p is Q p where vari , i = 1, .., m, is an input variable and x j , j = 1, .., p, is an output variable. Ti and Qj are input and output fuzzy sets respectively. The general formula is also expressed for the fuzzy rules of Takagi-Sugeno in the following way: I F var1 is T1 AN D . . . AN D varm is Tm T H E N x1 = f 1 (var1 , . . . , varm ), . . . , x p = f p (var1 , . . . , varm ) where f p can be any real function. Inference engine: Membership values are combined in the premise section; this is achieved by using a specific t-norm, which is usually multiplication or minimum. Thus calculating the firing strength for each existing rule in the rule base, if the firing strength is equal to 0, this rule will be a non-active rule. In the next step, the best-qualified consequent is calculated for each activated rule; this is achieved through the firing strength. As a result, a series of fuzzy output values are obtained (Jang 1993; Sabri et al. 2013). Defuzzification: It is a mathematical process in which a fuzzy set is transformed into a real number. Through this process, the inference process’s fuzzy values are combined, obtaining a single number as the final result (Sabri et al. 2013). There are different ways to carry out the defuzzification process (Jang 1993); within the literature, we commonly find the following: This scheme uses a weighted average of the induced output through the firing strength and each rule’s membership functions, obtaining a general output. For this case, the membership functions must be monotonous. For other schemes, the general output is obtained by using the maximum operation for qualified fuzzy balances. In this way, several methods have been proposed to obtain the final output as a function of the general fuzzy output. Some of them are area centroid, area bisector, mean of maximum, maximum criterion, etc. In another scheme, the Takagi-Sugeno, if-then rules are used. A linear combination of the input variables plus a constant term is used to obtain output in each rule and employing a weighted average of these outputs. A general output is obtained.
11 A Brief Review of Performance and Interpretability …
243
11.2.6 Inference in Compensatory Fuzzy Systems In Espín-Andrade et al. (2014) is showed an inference system based on compensatory fuzzy logic, which they called the compensatory inference system (CIS). It is defined as a generalization of deduction as in mathematical logic, with implication operators found in the literature regarding fuzzy logic. CIS is a combination of a compensatory logic and an implication operator. In which each CIS is a logically rigorous inference system with easy application. CIS is a way of representing inaccurate knowledge and data in a way similar to how human thinking does. A CIS defines a non-linear correspondence between one or more input variables and an output variable. This provides a basis from which to make decisions or define patterns (Fig. 11.1). A CIS has the following modules: (1) (2) (3) (4) (5)
data entry, a fuzzyfier module, a core that includes a predicate discovery and inference module, as well as the knowledge base generated by this module, a defuzzyfier module and, data output to a numeric value.
The predicate discovery module searches an implication predicate with the maximum truth value. The linguistic state and his corresponding membership function are also optimized, adjusting them to a given dataset. All of them are stored in the system. Notice that the obtained predicate is close to its natural interpretation. After, in the inference module, for each variable in an unseen data, a fuzzification is executed, taking the values of the given entry to ranges of [0, 1] according to the linguistic states of the discovered best predicate. These fuzzy values are evaluated in the discovered predicate to obtaining the true value of its components (antecedent and consequent). Finally, in the denormalization module, the truth values of discovered predicate components are used to find its corresponding crisp value.
Fig. 11.1 Inference system based on compensatory fuzzy logic (Espín-Andrade et al. 2014)
244
J. F. Padrón-Tristán et al.
Models and applications of decision making and knowledge discovery have been made using fuzzy compensatory logic, and Archimedean compensatory fuzzy logic (Espin-Andrade 2019). Few works have been oriented to the design of inference methods.
11.2.7 Performance of Systems Based on Fuzzy Rules Fuzzy rule-based systems (FRBSs) are a common alternative for applying fuzzy logic in different real-world areas and problems. The schemes and algorithms used to generate these types of systems imply that their performance can be analyzed from different points of view, not only the precision of the model (Rey et al. 2011). Any model, including fuzzy models, must be accurate enough, but other perspectives, such as interpretability, are also possible for FRBSs. Therefore, the trade-off between precision and interpretability arises as a challenge for fuzzy systems, since approaches can currently generate FRBSs with different trade-offs. The relevance of the rule can be added to the precision and interpretability for better compensation in FRBSs. In Gunning (2017) also is pointed out that in general, for the development of models and methods for the discovery of knowledge, the precision and interpretability of the results must be considered. Although the ideal would be to find a high degree of both, this is a complex task given that they conflict.
11.2.8 Semantic Interpretability in Fuzzy Systems Defining interpretability is a challenging task as it deals with the analysis of the relationship that occurs between two heterogeneous entities: a model of the system to be designed (usually formalized through a mathematical definition) and a human user (not as a passive beneficiary of the result. of a system, but as an active reader and interpreter of the model’s work engine). In this sense, interpretability is a quality that is inherent to the model and, however, refers to an act performed by the user who is willing to understand and explain the meaning of the model (Alonso et al. 2015). Interpretability is essential for those applications with high human interaction, for instances of support systems in fields such as medicine, economics, among others. Since interpretability is not guaranteed by definition, an enormous effort has been made to find the basic constraints to be superimposed in the fuzzy modeling process (Alonso et al. 2009). Among others, the reasons to justify the interpretability in fuzzy systems are integration, interaction, validation, and the most important, trust: the ability to convince users of the reliability of a model. Also, trust is provided because the model explains the inference process to produce its results (Alonso et al. 2015).
11 A Brief Review of Performance and Interpretability …
245
11.2.9 Restrictions and Interpretability Criteria Interpretability is a quality of fuzzy systems that is not immediate to quantify. However, a quantitative definition is required both to assess the interpretability of a fuzzy system and to design new fuzzy systems. This requirement is especially strict when fuzzy systems are designed automatically from data, through some knowledge extraction procedure (Alonso et al. 2015). Interpretability is achieved through a granulation process, which consists of transforming a crisp value to a linguistic variable, which allows obtaining the linguistic labels corresponding to fuzzy sets (Mencar and Fanelli 2008). To guarantee interpretability, this granulation process must be subject to several restrictions. Restrictions have been proposed in the literature, some of which can be very subjective. In Mencar and Fanelli (2008) work, the constraints were critically reviewed, objectively describing and analyzing them to provide a guide in the design of interpretable fuzzy models. A common approach to defining interpretability is based on the adoption of a series of constraints and criteria that together define interpretability. This approach is inherent in the subjective nature of interpretability because the validity of some conditions/criteria is not universally recognized and may depend on the context of the application. In Alonso et al. (2015) is described a hierarchical organization that starts from the most basic components of a fuzzy system, that is, the fuzzy sets involved, and continues towards more complex levels, such as fuzzy partitions, fuzzy rules, even considering the model as a whole (Fig. 11.2), which are listed below:
Fig. 11.2 Restrictions and interpretation criteria (Alonso et al. 2015)
246
J. F. Padrón-Tristán et al.
11.3 Literature Analysis on Surveys and Reviews Table 11.1 shows a list of reviews and surveys that have focused on interpretability, accuracy, and fuzzy inference. To our knowledge, those papers are all existing works. At the difference of them, we present the first scoping review for exploring works dedicated to studying how to construct inference systems with high levels of accuracy and interpretability and opening opportunities for new fuzzy logic like compensatory and Archimedean. Ahmed and Isa make a review of the balance between interpretability and precision through granules of fuzzy information that allow this balance (Ahmed and Isa 2017). These entities allow quantifying the degree of interpretability of fuzzy sets. The level of granularity varies in direct proportion to the accuracy of a knowledge system. To design a granulate of fuzzy information with high precision, it is desirable to avoid conflicting decisions, incorporate interpretability restrictions, and optimize parameters. Finally, it describes a series of techniques to accomplish these tasks. Table 11.1 Comparison of surveys and reviews related to this work Paper
Survey/Review Interpretable Accurate Interpretability-accuracy Fuzzy approaches approaches inference
Ahmed and Isa (2017)
R
X
Carvalho S et al. (2019)
X
Chakraborty S et al. (2018)
X
Cordón (2011)
R
He et al. (2020) Mi et al. (2020)
X
X
X
X
X
X
R
X
X
X
X
R
X
Mohseni S et al. (2020)
X
Otte (2013)
R
X
Shukla and Tripathi (2011)
S
X
X
X
Shukla and Tripathi (2012)
R
X
X
X
Zhang and Zhu (2018)
S
X
11 A Brief Review of Performance and Interpretability …
247
Carvalho et al. present a review of the current state in the field of machine learning research, proposing a series of metrics to compare the methods to evaluate the interpretability in the ML field, instead of creating new metrics (Carvalho et al. 2019). They also specify that it is not possible that there is interpretability that defines a general-purpose, but that the domain of the application and the use case of specific problems must be considered, being a concept that must be contextualized. Cordón mentions the fact that for a time the objective of interpretability was being set aside to improve precision in fuzzy systems (Zhang and Zhu 2018), but finally, he returned to the beginning to find a balance between accuracy and interpretability. The contribution consisted of conducting a review of the different proposed Mamdanitype fuzzy genetic systems to improve their precision without altering the degree of interpretability or reducing it as little as possible. He et al. talk about the importance of extracting rules that have a balance between precision and interpretability and the problems encountered when performing this activity using the multi-layer perceptron (MPL) and neural networks deep (DNN) (He et al. 2020), among which are: (1) the evaluation of the interpretability and precision in a quantitative way of a set of rules, (2) the efficient extraction of rules using the MPL and the DNN and, (3) the search for the balance between the interpretability and the precision in the set of rules. It also presents a review of existing works that seek this balance using DNNs to guide this line of research. Ishibuchi et al. address the conflict of interpretability and precision in classifiers based on fuzzy rules, focused rather on the goal of interpretability to demonstrate that there are still unanswered questions on this topic (Ishibuchi et al. 2011). The process of this research work begins by conducting a brief survey about the design of classifiers based on fuzzy rules with linguistic conditions.
11.4 Research Methodology Since ACFL is an emergent topic for knowledge discovery in general and inference processes in particular, this scoping review is used for exploring the research area dedicated to studying how to construct inference system with high levels of accuracy and interpretability, simultaneously. The purpose is to identify opportunities for fuzzy models, and how ACFL can contribute to this area. Data collection and analysis took place in five main phases, which are described below. Step1. We first created a draft set of five keywords [interpretability, interpretable, accuracy, interpretable, fuzzy] to notify our literature scoping review. Step 2. To each combination of these words, to consider previous systematic literature analysis, we add separately the words “survey”, “review”, “systematic review”, “scoping review”, “mapping review” or none of them. Step 3. We analyze the occurrence of each of these keyword combinations in searches using Google Academy. The search was configurator for time (any time or 2016–2020), word location (throughout the paper or in the paper title).
248
J. F. Padrón-Tristán et al.
Step 4. We made analytics to determine tendencies in journals, citations, leader journals, and area evolution on time (see Sect. 11.6.1). Step 5. Reading the papers obtaining in Step 4 and its references, we identify those dealing with fuzzy inference systems and analyze some of them. The results are incorporated into several sections (Sects. 11.3, 11.5, 11.6.2, and 11.7).
11.5 Research Analysis 11.5.1 A Basic Case of Study of Interpretability with CFL Present case study is based on (Cruz-Reyes et al. 2019), which refers to the orderpicking problem within a warehouse is studied, which is an optimization problem to fill orders efficiently. Among the warehouse activities is included in the supply of orders, and to carry out this task, the items included in each order must be collected (Saenz 2011), which is known as order-picking. Among the approaches to solve this problem are orders batching, in which is selected what orders are collected simultaneously in a tour. CFL is applied for the discovery of fuzzy predicates that define whether an order is integrated with another in the same batch to minimize the distance traveled in the warehouse to pick all orders. For this purpose, a workflow was developed using the Eureka-Universe platform (Padrón-Tristán et al. 2020) (Fig. 11.3). In Cruz-Reyes et al. (2019) two instances are used for training and testing that are transformed into instances of classification taking combinations of orders to
Fig. 11.3 Workflow for the order-picking problem using Eureka-Universe (Padrón-Tristán et al. 2020)
11 A Brief Review of Performance and Interpretability …
249
determine if they are included in the same batch, which is the decision variable in the classification problem. The following processes are shown in the workflow in the Fig. 11.3: (1) (2)
Instances pre-processing. The training and testing instances are transformed into those of a classification problem. In Eureka-Universe platform:
(a) With the training instance, n iterations of the discovery task are performed to generate a list of predicates p(x) where p(x) has the structure (b) From each discovered predicate, the premise q(x) is taken to be evaluated using the testing instance. The truth-value of each record in the instance is taken for comparison against a class separation threshold. q(x)equals class. 3)
Post-processing of results for each premise q(x) and their aggregation:
(a) A class separation threshold of 0.5 is proposed and the truth values of each record i in the testing instance are analyzed: (b) The class obtained is compared against the real class of each record in the testing instance and the accuracy is calculated according to the hits obtained. (c) The best predicate, according to its classification precision, is selected as the classification model for this problem. class =
0 tr uth value i < thr eeshold 1 tr uth value i ≥ thr eeshold
Following are the discovery task three predicates that were obtained: 1. 2. 3.
(EQV (AND “added aisles” “aisles 2_1” “vol 2” “added aisles_0”) “batch”) (EQV (AND “batch distance_1” “added aisles_0” “sp dist 1_0” “vol 1”) “batch”) (EQV (OR “added aisles_1” “added aisles”) “batch”)
From these predicates, premises were taken to be the model of classification of the order-picking problem: 1. 2. 3.
(AND “added aisles” “aisles 2_1” “vol 2” “added aisles_0”) (AND “batch distance_1” “added aisles_0” “sp dist 1_0” “vol 1”) (OR “added aisles_1” “added aisles”)
Table 11.2 shows the minimum and maximum values of the variables in instances and, the parameters of the MF for their corresponding linguistic states for the first predicate obtained. Table 11.3 shows a sample of the classification method according to the truthvalue calculated on premise 1, where 0.5 is the proposed threshold to separate classes
250
J. F. Padrón-Tristán et al.
Table 11.2 MF parameters values of linguistic states for premise 1 Label
Variable
GMF parameters Min value
Max value
Alpha
Gamma
M
vol 2
vol 2
1
4
3.6755
1.1757
0.0136
added aisles
added aisles
0
3
0.6227
0.0404
0.8704
aisles_2_1
aisles 2
1
4
1.9038
1.1369
0.0246
added aisles_0
added aisles
0
3
0.9392
0.0378
0.9936
Table 11.3 Sample of classification accuracy for premise 1
Table 11.4 Accuracy of premises obtained
Batch
Truth value premise 1
Class obtained
Classification result
1
0.17141868
0
Fail
0
0.07292227
0
Success
1
0.69229263
1
Success
0
0.1418835
0
Success
Premises
Accuracy (%)
Premise 1
93.33
Premise 2
86.67
Premise 3
95.56
Aggregation
95.56
and the calculated class is compared to the real class to get the accuracy according to the hits obtained. Table 11.4 shows that the highest accuracy was obtained by premise 3 and the aggregation operator. Finally, according to the MF parameters for linguistic states it can be possible to express the discovered predicates in natural language i.e. for predicate 1, considering that added aisles appear twice with similar parameters: not low “added aisles” AND low “aisles visited order 2” AND not high “volume order 2” EQUALS “batch” orders.
11.5.2 Interpretability Comparison of CFL Whit Other Fuzzy Logic Structures A transdisciplinary concept of Interpretability has been introduced and applied in Espin-Andrade et al. (2015). That concept is derived from the one of Interpretability of a theory A according with a Theory B, which used to be defined by languages theory
11 A Brief Review of Performance and Interpretability …
251
like the demonstration of all the results of A using the language of theory B. According to that transdisciplinary concept, a theory is ‘interpretable by natural language´, if it is interpretable by theories and scientific methods closest to social practices which use it. Those disciplines and scientific approaches are Logics, Decision-Making theories and techniques, as well as Mathematical Statistics, among others. The mentioned paper treats specifically those three types of interpretability of the Compensatory Fuzzy Logic (CFL) by axioms and properties of it, connected with important selected results of Bivalued Logic, Normative and Constructive DecisionMaking Theories, AHP and ANP methodologies, and the connection with the Central Limit Theorem. That way, many fuzzy approaches of interpretability by natural languages lacks: (1) Not compatibility with classical rationality, (2) Not compatibility with human behavior, (3) Not treatment of Fuzzy Systems and (4) Contradiction with accuracy are faced successfully. Interpretability is demonstrated by the capacity of CFL for the following wide range of tasks: (1) (2) (3) (4) (5) (6)
Evaluating how convenient is an alternative according to a predicate, obtained from expressions of the DM preferences, Searching new convenient alternatives using that predicate, evaluating how truth is an expression using facts and/or expert opinions, estimating how truth is an expression using facts associated with a probabilistic sample, discovering new knowledge expressed in natural language using heuristic and/or optimization and, Demonstrating and discovering new knowledge by reasoning.
Table 11.5 taken from the reviewed paper resumes the comparison of CFL with the principal fuzzy ways to use productively the natural language by Fuzzy Logic (Mamdani Fuzzy Inference Systems (MFIS), Integration of membership functions by operators (IMFO) and Computing by words (CWW) and Linguistic Data Summarization (LDS), according to the 6 tasks explained in the paragraph above. Under this evaluation framework, CFL is the only one that can perform all the tasks, so it is assumed that it could have a higher degree of interpretability.
11.5.3 Related Works in Balancing Accuracy-Interpretability This section presents a sample of the works reviewed on the search for a compromise between interpretability and accuracy of fuzzy systems. The section ends with a review of the only works that address the inference problem with compensatory fuzzy logic. The result of the analysis of all the works reviewed is presented in the next section to open lines for research. In the literature, two approaches address the issues of fuzzy systems considering their precision and interpretability (Cpałka 2017):
252
J. F. Padrón-Tristán et al.
Table 11.5 Skill comparison by tasks with the principal fuzzy ways to use productively the natural language Logic Task
Theory CFL
MFIS
IMFP
1
By predicates
By rules
By aggregation using By specific operators preforms
2
Searching by increasing truth values of a predicate
Not possible naturally
Searching the best according to inform. integration
Searching by increasing truth values of a preform
3
By the evaluation of a predicate to obtaining a truth
Not possible naturally
By aggregation to obtain a membership value or a label
By the evaluation of specific preforms
4
Calculating the Not possible truth value over the sample
Not possible
Not possible
5
Possible to get any kind of knowledge by predicates
Not possible
Possible to get knowledge by specific preforms
6
Possible freely of Possible expressed its logical structure as a set of rules and reasoning
Not possible
Possible associated to a specific preforms
(a) (b)
Possible to get a set of rules in a form of FIS
CWW and LDC
Precise fuzzy modeling, aimed at obtaining fuzzy models with better accuracy. Linguistic fuzzy modeling, whose objective is to have fuzzy models distinguished by good readability.
In the design of interpretable fuzzy systems, it is sought to ensure that they are characterized by high accuracy, good readability of the accumulated knowledge, and low complexity, but in practice, this solution is not possible due to different reasons, among which are: (1) (2)
To achieve high accuracy in modeling, a complex structure of fuzzy system is used, which does not lead to good readability, To achieve good interpretability, a simple fuzzy system structure can be used, which does not lead to high precision.
In practice, the goal should be to develop solutions enabling a balance between the interpretability and accuracy of the system and the knowledge stored in it (Fig. 11.4). Considering the design aspects of FIS, attention should be paid to two types of interpretability: (1) (2)
Focused on the complexity of the system, in particular, its fuzzy rule base and, Focused on the readability of its fuzzy rule base.
Accommodation of semantic-based interpretability in the evolutionary linguistic fuzzy controller. In Cremer et al. (2019) is proposed the adaptation of
11 A Brief Review of Performance and Interpretability …
253
Fig. 11.4 The balance between precision and interpretability (Cpałka 2017)
interpretability based on semantics in the evolutionary linguistic fuzzy controller through an overlay-based coding strategy for the parameters of the triangular MF. With this strategy, the coverage and distinguishable appearance of fuzzy partitions are guaranteed during evolution. The idea behind this proposal is that the set of MF that make up the fuzzy partition is defined as overlapping functions instead of separate functions subject to some restrictions as earlier works do. To evaluate the validity and usefulness of the proposed integer-coded evolutionary algorithm (IEA), they performed simulations. The results obtained suggest that the proposed IEA is efficient in the evolution of only valid and interpretable linguistic fuzzy logic controllers (FLCs). They also show the excellent dynamic performance of the evolved FLC for different operating conditions reflecting the non-linear character of the designed linguistic FLCs. The proposed method can be extended for various MF classes (e.g. trapezoidal, gaussian, flared, etc.) and possibly for fuzzy model design. Future work will investigate the stability analysis and robustness improvement of the evolved FLC, as well as comparisons with other advanced controllers. FRBSs based on multiple objectives for the improvement of the commitment in Accuracy and Interpretability. Any model, including fuzzy models, must be accurate enough, but other perspectives, such as interpretability, are also possible for FRBSs. Therefore, the trade-off between precision and interpretability emerges as a challenge for fuzzy systems, since approaches can currently generate FRBSs with different trade-offs (Rey et al. 2011). Here, the relevance of the rule is added to the precision and interpretability for better compensation in FRBSs. These three factors are involved in this approach to making a rule selection using an evolutionary multi-target algorithm. Relevance is an idea managed and understood as “something that is or is considered worthwhile”. Here, algorithmic or system relevance is managed as the relationship between a query and the information objects in a system, how it was retrieved, or could not be retrieved by a certain procedure or algorithm (Megherbi et al. 2019). In Rey et al. (2011), a Multi-Objective Evolutionary Algorithm (MOEA) approach based on precision and interpretability, which includes Relevance, was carried out. The latter was an earlier approach to the work in Rey et al. (2017). MOEA are computational algorithms inspired by genetic foundations to find sets of solutions to problems subject to several objectives, to be optimized simultaneously.
254
J. F. Padrón-Tristán et al.
Fig. 11.5 Improving the balance between precision and interpretability (Megherbi et al. 2019)
The objective in Megherbi et al. (2019) is to improve the well-known balance between Precision and Interpretability for FRBSs (Fig. 11.5) while preserving the most relevant rules for each FRBSs for compensation. This objective is carried out through a selection of rules based on multi-objective optimization that implies: concepts of precision, interpretability and relevance. Extract balanced interpretation-precision rules from artificial neural networks. In He et al. (2020) is explained the importance of extracting balanced rules in terms of precision-interpretability in combination with practical application scenarios and then presents three central problems found when extracting perfect rules, which are: (1) (2) (3)
How to define the interpretability and precision of a set of rules and evaluate the set of rules quantitatively. How to efficiently extract rules from the multi-layer perceptron and deep neural networks (MPL and DNN respectively) trained plainly. The precision and interpretability of the set of rules are often contradictory. So how can we balance the two contradictory goals?
In He et al. (2020) are summarized current main methods for quantitatively assessing the accuracy and interpretability of rule sets, and then reviews existing methods for extracting rules from the MLP and trained DNN model. The reason the rules are drawn from MLP is summarized in detail is that DNN is developed from shallow MLP. To some extent, DNN can be considered as an MLP with more hidden layers, so the shallow MLP rule extraction method will have important guiding importance for extracting rules from DNN. So far, there is little research that has focused on extracting balanced rules of interpretability and accuracy from DNN.
11 A Brief Review of Performance and Interpretability …
a) Based on NSGA-II
255
b) Based on a Fuzzy neural network
Fig. 11.6 Approaches to design rules considering accuracy and interpretability
Figure 11.6 shows how two literature approaches balancing the concepts of accuracy-interpretability in fuzzy inference systems; the first one proposes an evolutionary algorithm based on NSGA-II and the second one proposes a shallow multilayer perceptron neural network. Figure 11.6a shows a multi-objective evolutionary optimization algorithm (MOEOA) for the automatic designing of fuzzy rule-based classifiers (FRBCs); the algorithm generates solutions characterized by various levels of accuracy-interpretability trade-offs (Gorzałczany and Rudzi´nski 2016). Figure 11.6b shows an algorithm that uses a decomposition approach in which rule extraction works at the neuron level instead of at the architecture level of a complete network. The entire neural network is divided into neural units, and rules are extracted from each neuron that is subsequently added to present the complete network. These kind of algorithms may lose interpretability but have accuracy (He et al. 2020). Designing fuzzy inference systems from data: an interpretability-oriented review. Hierarchical fuzzy partitioning (HFP) is used to build the fuzzy inference system using both fuzzy decision trees and simple selection mechanisms. Alonso et al. (2008) describe a methodology to build highly interpretable linguistic knowledge (HILK) bases through integration processes, consistency analysis, simplification, and optimization, under the fuzzy logic formalism. The combination of expert knowledge and induced knowledge from the data approaches can lead to more accurate systems. Induced rules must be as interpretable as expert rules for efficient cooperation. The three conditions for a knowledge base (KB) to be interpretable have been established (Alonso et al. 2008): (1) (2) (3)
use interpretable fuzzy partitions, use a small number of rules and, use compact rulers for large systems.
HILK is based on a linguistic fuzzy model approach although it takes care of accuracy and interpretability throughout the entire modeling process of the system.
256
J. F. Padrón-Tristán et al.
11.5.4 Related Works in Interpretable Inference with CFL Discrimination of brain tissues in magnetic resonance imaging (MRI) based on predicate analysis and CFL. In Meschino et al. (2008), MRI brain images are analyzed in pixels by fuzzy logic predicates, reproducing the considerations that experts use when interpreting these images, to identify the tissues that the pixels represent. The goal is to determine which tissue corresponds to each pixel with simple operations and short processing time. It discriminates cerebrospinal fluid, gray matter, and white matter in simulated and real images, validating the results using the Tanimoto Coefficient. Virtual Savant Paradigm based on CFL. Virtual Savant (VS) (Pinel et al. 2013; Massobrio 2018) is a method to automatically generate new parallel solvers for optimization problems. In the learning stage, it models a reference algorithm from solutions to a given problem, and then it can accurately reproduce solutions on new instances. The result is a scalable classification model to solve problems of different dimensions. The execution of VS is carried out in two stages: in the first one, from the learned model, which is executed in parallel, generates a population of solutions. In the second stage, a refining operator is applied to the population of solutions and then the best solution is selected as the VS output. In Massobrio (2018) the performance and precision of this technique were analyzed in the task assignment problem and the backpack problem. Padrón-Tristán (2020) extends VS by solving the bin packing problem (BPP), transforming the instances from an optimization problem to a classification one, by establishing in each observation of the instance a relationship between combinations of the weights of two objects, the residual capacity, and the classification variable, which indicates whether or not the objects were stored in the same container. An ML technique is replaced by a classifier based on CFL. In the learning stage, it selects from a list of discovered predicates the best one as the classification model. In the execution stage, the selected predicate is used to evaluate the observations on unseen instances. The calculated truth values are used as the probability vector to generate the population of solutions that are later refined to obtain the best one as the output of the VS (Fig. 11.7).
11.6 Research Findings and Discussion 11.6.1 Analytics of Research Papers This section summarizes the main results of searching in google academic with four keywords (interpretability, interpretable, accuracy, and fuzzy) and applying filters of time, location of the keys, and types of reviews.
11 A Brief Review of Performance and Interpretability …
257
Fig. 11.7 Execution of VS for the BPP (Padrón-Tristán 2020)
In Fig. 11.8 the queries Q1–Q3 search on the entire document, while Q4–Q8 search only on the paper’s title. Several intervals of time were explored to identify the age of the area. Because of the observed youth, we decided finally to report the result for any moment and select Q6 and Q7 to analyze with detail the papers obtained. As was expected, the paper’s frequency reduces when they are any systematic review and when the search is on the title, letting manageable and relevant papers. Notice that mapping studies like the present work are few. Figure 11.9 shows the results of two queries Q6 and Q7, in the number of papers and citations, having in common the keywords interpretability and accuracy, and their location in the papers’ title. The location filter is to identify papers that probably deal with the research problem with bigger deeply. First query Q6 considers papers approaching, in general, the problem of balancing interpretability and accuracy. In contrast, the second query Q7 excludes those not related to the fuzzy approach of more interest in the present work. Q6 shows a tendency to increase while Q7 seems a uniform distribution. The number of publications and citations is with the fuzzy approach is approximately 50% of the other approaches. In Fig. 11.10, to obtain the blue graph, three queries were made that search the keyword only in the title (Q7, Q4-survey, Q4-review) and recover relevant papers for getting useful information about journals. With post-processing, the number of articles per journal shown by the blue graph is obtained. The red graph shows the
258
J. F. Padrón-Tristán et al.
Fig. 11.8 Results of the search in the number of papers using eight different queries combining four keywords (interpretability, interpretable, accuracy, and fuzzy) and two filters (review type and keyword location)
Fig. 11.9 Results of the search in several papers (blue) and citations (red) per year, using two queries that search in the paper title and differing in the keyword fuzzy
number of articles and citations per journal. This graph was obtained from the Q3 query that searches all the keywords of interest in the entire article with a filter to consider the previously found journals. The journals’ order allows identifying those that lead the area in the number of articles published. This scoping review allowed us to identify that balancing interpretability and accuracy with fuzzy approaches is a young and current area. Also, with this analysis, relevant journals and papers were identified. Some of them are analyzed in the next sections. Applying the proposed queries, researchers can recover all papers to make a full systematic review.
11 A Brief Review of Performance and Interpretability …
259
Fig. 11.10 Number of research papers per journal
11.6.2 Discussion of Papers on Accuracy-Interpretability Some papers were selected to show the scope of their proposals (Table 11.6), it is noticed that they use different measures to establish balance on accuracyinterpretability that could be complementary. Similarly, optimization algorithms use different objective functions, mainly maximizing precision and interpretability measures, and some incorporate other measures such as minimization of the penalty. The studies analyzed and quantify the improvement achieved in precision and interpretability on different data sets, which makes it difficult to compare them. Table 11.6 is a sample of the articles reached in the different consultations.
11.6.3 Open Trends to Research This section presents new directions for researching within the scope of the ability to interpret efficient learning methods for inference systems. Comparative analysis of proposed methods to compensate for accuracy and interpretability in the design of fuzzy systems (Shukla and Tripathi 2011). Various methods have been proposed in the specialized literature to maintain good compensation in the design of complex fuzzy systems. They include multiobjective evolutionary optimization, context adaptation, fuzzy hierarchical modeling, and many other algorithms and approaches related to rules, rule bases, MFs, fuzzy partitions, and granularity (Mencar 2005) among others. The interpretability research field
Optimal Classification TressGreedy Optimization-based Tree
DT depths for all the 22 contingencies
Predictive performance of the classifier
Cremer et al. (2019)
Impurity
Evolutionary Algorithm proposed
Complexity, similarity
Fuzzy modeling: FasArt. Rule selection: NSGA-II
Interpretability = number of rules, means of activity the rule, overlapping
Penalty (singular value, r-value, variance)
Fuzzy modeling: Scatter, linguistic. Rule selection: SPEA2
Algorithm
Cpałka (2017) Box and Root mean Jenkins gas, square error Chemical plant, (RMSE) Delta ailerons
Australian Mean of correct Credit classifications Approval, German Credit Approval, Credit Approval
Hjørland (2010)
Number of rules, similarity, redundancy, incoherency, completeness
Relevance
Other measures
Genetic Algorithm (NSGA-II)
The quake, Mean square abalone, error (MSE) treasury, among others
Rey et al. (2011)
Number of rules, number of membership functions, incoherency, similarity
Interpretability measures
Interpretability = complexity, rule antecedents, mean rules, active attributes
THE quake, Mean square abalone, error (MSE) treasury, among others
Rey et al. (2017)
Accuracy measures
Datasets
Paper
Table 11.6 Comparison of selected papers on accuracy and interpretability
Maximize accuracy
Maximize accuracy, interpretability
Maximize accuracy, interpretability
Maximize accuracy, interpretability minimize penalty
Maximize accuracy, interpretability
Objective
Accuracy = 1% best
Accuracy = 0.2959, Interpretability = 1.59
Mean = 86.7, Interpretability = 5.8
Accuracy = 62,2%, Interpretability = 56.8%
Accuracy = 48,4-62,1%, Interpretability = 23,3-52,1%
Main results
260 J. F. Padrón-Tristán et al.
11 A Brief Review of Performance and Interpretability …
261
should focus more on comparing existing methods of explanation rather than simply creating new ones. Formalization and unification of interpretability (Carvalho et al. 2019). The main cause that the problem of interpretability remains unsolved is that interpretability is a highly subjective concept and therefore difficult to formalize. Some attempts have been made to formalize the interpretations mathematically. It is also clear that a unified notion of interpretability is necessary. Development of ad hoc interpretation methods based on deep networks and fuzzy logic for big data (Cpałka 2017). Post hoc interpretation methods explain existing models, while ad hoc methods build new models and emphasize deep network training that is critical to the success of deep learning. Post hoc interpretability tends to be more biased, while ad hoc is more specialized and can sacrifice representative ability. Fuzzy rule-based systems have strengths in interpretation, although acquiring a large number of rules is a tedious and expensive process. While a neural network, which represents the knowledge extracted by weights between nodes in a distributed manner, is efficient with many data, but is inefficient in cases of small data and suffers from a lack of interpretability; see the study in (Megherbi et al. 2019). It seems that the neural network and fuzzy logic are complementary to each other towards better interpretability. Some studies have followed this direction, highlighting the proposals with radial base function (RBF) networks. In the literature, it has been shown that the RBF network allows the generation and encoding of fuzzy rules in their adaptive representation without loss of precision and in a simpler way compared to general networks. There is preliminary evidence that it is possible to extend these results to deep RBF networks, which would allow us to approximate very complex functions. In general, more research effort is required to synergize fuzzy logic and deep learning techniques, with the help of big data. Development and interpretation of deep networks based on their understanding, optimization, and support of neuroscience (Cpałka 2017). To date, truly intelligent systems remain only human. When it comes to interpretability, given that biological and artificial neural networks are deeply connected, advancement in neuroscience should be relevant and even instrumental for the development and interpretation of deep learning techniques. The main aspects of deep learning interpretability include understanding the underlying mechanism of deep networks and understanding the optimization of these networks. It requires: (1) the construction of a cost function that reflects the brain’s mode of operation and (2) the incorporation of optimization algorithms to replace backpropagation, incorporate neuromodulation and allow learning with few exposures, which are conditions of the internal functioning of the human brain. There are very few studies that discuss the interpretability of training algorithms. A powerful and interpretable training algorithm based on non-convex optimization that has some kind of uniqueness, stability, and continuous dependence on the data, among other characteristics, will be highly desirable. The combination of Bayesian networks with deep learning is a promising alternative for training with a few exposures (Matsubara 2020). Visual and textual explanations for KD applications with multidisciplinary support, knowledge of context, and human interaction (Chakraborty et al. 2018;
262
J. F. Padrón-Tristán et al.
Tjoa and Fellow 2019; Carvalho et al. 2019). The visual and textual explanation provided by an algorithm is an ideal choice. For deep learning methods, the study and correction of incorrect explanations do not seem to exist, so multidisciplinary work may be necessary that combines domain experience, applied mathematics, and data science among others (Zhang and Zhu 2018). Also, a solution is required that provides explanations considering the domain of the problem, the case study, and the type of user. One promising line is the incorporation of human interaction into a convolutional network (CNN) to guide learning and produce an explanatory graph. Evaluation of the interpretation methods that consider quality and utility measures (Shukla and Tripathi 2011; Tjoa and Fellow 2019). Quantifying precision and interpretability are very important to find a good balance between them. Estimating precision has many workable methods, but measuring interpretability is quite a difficult task due to the involvement of several conflicting aspects, including transparency and complexity of the rules. Also, few publications emphasize good results in real applications; so it would be useful to conduct more comparative studies between the performance of interpretation methods, not only with performance metrics of precision and interpretation but also with standard utility judgments. Methods oriented to interpretation and performance in fuzzy inference with CFL and ACFL (Meschino et al. 2008; Gorzałczany and Rudzi´nski 2016). Few research papers use KD developing predicate discovery tasks that search for parameters for MF that fit the data. These works take advantage of the knowledge of a user, an expert in the area, who defines the predicates. For CFL and ACFL, proposals have been presented that address this problem with generalized MF, which could facilitate the interpretation of results. In CFL and ACFL there are no investigations where the interpretability can be specialized according to the context of the problem and, with the support of an expert user (human or artificial), assign labels to the linguistic states, according to the discovered MF, which meets interpretability criteria and restrictions. On the other hand, as has been commented, the precision of the discovered predicates and their interpretability are exclusive objectives, so that the investigations carried out tend to diminish some of these characteristics to increase their capacity for interpretability or precision, depending on the case. For CFL and ACFL the tradeoff between precision and interpretability has not been addressed. As for inference, although many research works perform this task, they are focused on Zadeh’s fuzzy logic, while with compensatory fuzzy logic there are only two proposals. From the above, there is still a lot of work on inference using compensatory fuzzy logic.
11.7 Future Research Focus This section suggests some future work based on the reviewed works that seek to balance the approaches of accuracy and interpretability. It is highly recommended to formalize interpretations mathematically and unify notions of interpretability. In Alonso et al. (2015), the need to carry out investigations
11 A Brief Review of Performance and Interpretability …
263
regarding interpretability’s blurred nature is observed, thus projecting interpretable fuzzy systems towards transdisciplinary investigations and mentioning the need to investigate metrics that allow evaluating subjective values of interpretability. It is also necessary to use new representations of fuzzy systems, which represent complex relationships to highly interpretable rules. This is also mentioned in Alonso et al. (2009), which appoint the interest in making use of various base rule structures. Another representation is needed to add interpretability features to neuralnetwork-based methods. Neuroscience could be relevant for the development and interpretation of deep learning techniques In works such as Lughofer et al. (2018), Škrjanc et al. (2019), the need for systematic and formal approaches is mentioned to achieve a balance between stability and plasticity that allows adaptability to the work environment and the study of aggregation operators used in these models. The need to define labels that allow interpretability and thus also methods that allow modeling of the data are approaches on which it is necessary to investigate. Besides, it is necessary to define whether the results achieved should satisfy the interpretability needs of experts on the subject, or the understanding of the general public will be sought (Yeganejou et al. 2020). This demands the incorporation of human interaction to guide learning and produce useful explanatory results. The definition of more powerful inference approaches than those currently found in the literature is necessary (Yeganejou et al. 2020). A promised approach is the formulation of models and development of methods oriented to interpretation and accuracy in inference systems with compensatory fuzzy logic (CFL) and Archimedean compensatory fuzzy logic (ACFL). Perform a fair comparative analysis of proposed methods to compensate for accuracy and interpretability in the design of fuzzy systems. The evaluation, with statistical support, must consider measures of both characteristics. In summary, there are still areas within the fuzzy inference systems on which research is necessary; the results shown so far still have the capacity for improvement and growth. It is necessary to deepen various concepts that bring the community closer together to a better understanding of these.
11.8 Conclusions and Recommendations This work presents a scoping review on achieving an adequate balance between interpretability and accuracy, putting it in the context of other systematic review works. Some analytics on publications in this area show their youth and aspects of relevance and leadership, laying the groundwork for a complete study that considers the elements described below. Concepts common to fuzzy logic are reviewed, emphasizing compensatory fuzzy logic (CFL) and its application to inference. Through a wide range of tasks involving interpretability, a comparative study of the literature shows that CFL is the only one
264
J. F. Padrón-Tristán et al.
that can perform all these tasks to have a higher degree of interpretability. Quantitative experimental studies could support this conclusion. Although fuzzy approaches are a powerful alternative due to their high interpretation capacity, they are limited in precision concerning other approaches, mainly those derived from neural networks. The hybridization of these approaches is an alternative that is being investigated and has not yet been explored with CFL. The comparative performance of works in the literature on the balance between interpretability and precision shows the need to carry out broad evaluations under homogeneous experimental conditions between similar and different approaches. Most of the studies analyzed do not support their performance conclusions with tests of statistical significance. Finally, this review discusses research opportunities both in conceptual aspects and in the construction and evaluation of models and methods that contribute to the problem of accuracy-precision for fuzzy inference systems.
References Ahmed M, Isa N (2017) Knowledge base to fuzzy information granule: a review from the interpretability-accuracy perspective. Appl Soft Comput J 54:121–140 Alonso JM, Castiello C, Mencar C (2015) Interpretability of fuzzy systems: current research trends and prospects. Springer handbook of computational intelligence. Springer, Berlin, pp 219–237 Alonso JM, Magdalena L, González-Rodríguez G (2009) Looking for a good fuzzy system interpretability index: an experimental approach. Int J Approx Reason 51:115–134. https://doi.org/ 10.1016/j.ijar.2009.09.004 Alonso JM, Magdalena L, Guillaume S (2008) HILK: A new methodology for designing highly interpretable linguistic knowledge bases using the fuzzy logic formalism. Int J Intell Syst 23:761– 794. https://doi.org/10.1002/int.20288 Amindoust A, Ahmed S, Saghafinia A, Bahreininejad A (2012) Sustainable supplier selection: a ranking model based on fuzzy inference system. Appl Soft Comput J 12:1668–1677. https://doi. org/10.1016/j.asoc.2012.01.023 Carmona P, Castro JL (2020) FuzzyFeatureRank. Bringing order into fuzzy classifiers through fuzzy expressions. Fuzzy Sets Syst 401:78–90. https://doi.org/10.1016/j.fss.2020.03.003 Carvalho DV, Pereira EM, Cardoso JS (2019) Machine learning interpretability: a survey on methods and metrics. Electronics 8:832. https://doi.org/10.3390/electronics8080832 Castellano G, Castiello C, Fanelli AM, Mencar C (2005) Knowledge discovery by a neuro-fuzzy modeling framework. Fuzzy Sets Syst 149:187–207 Chakraborty S, Tomsett R, Raghavendra R, et al (2018) Interpretability of deep learning models: a survey of results. In: 2017 IEEE SmartWorld ubiquitous intelligence and computing, advanced and trusted computed, scalable computing and communications, cloud and big data computing, internet of people and smart city innovation, SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI 2017. Institute of Electrical and Electronics Engineers Inc., pp 1–6 Cios KJ, Pedrycz W, Swiniarski RW (1998) Data mining and knowledge discovery. Data mining methods for knowledge discovery. Springer, US, Boston, MA, pp 1–26 Cordón O (2011) A historical review of evolutionary learning methods for Mamdani-type fuzzy rulebased systems: designing interpretable genetic fuzzy systems. Int J Approx Reason 52:894–913 Cpałka K (2017) Design of interpretable fuzzy systems. Springer International Publishing, Cham
11 A Brief Review of Performance and Interpretability …
265
Cremer JL, Konstantelos I, Strbac G (2019) From optimization-based machine learning to interpretable security rules for operation. IEEE Trans Power Syst 34:3826–3836. https://doi.org/10. 1109/TPWRS.2019.2911598 Cruz-Reyes L, Espin-Andrade RA, Irrarragorri FL, Medina-Trejo C et al (2019) Use of compensatory fuzzy logic for knowledge discovery applied to the warehouse order picking problem for real-time order batching. In: Handbook of research on metaheuristics for order picking optimization in warehouses to smart cities. IGI Global, pp 62–88 Espín-Andrade R, González E, Fernández E, Alonso MM (2014) Compensatory fuzzy logic inference. Stud Comput Intell 537:25–43. https://doi.org/10.1007/978-3-642-53737-0_2 Espin-Andrade RA (2019) Semantic data analytics by Fuzzy Logic Predicates: New ways of representation, searching and inference based on Deep Learning Espin-Andrade RA, Caballero EG, Pedrycz W, Fernández González ER (2015) Archimedeancompensatory fuzzy logic systems. Int J Comput Intell Syst 8:54–62. https://doi.org/10.1080/ 18756891.2015.1129591 Fayyad UM (1996) Advances in knowledge discovery and data mining Gorzałczany MB, Rudzi´nski F (2016) A multi-objective genetic optimization for fast, fuzzy rule-based credit classification with balanced accuracy and interpretability. Appl Soft Comput J 40:206–220. https://doi.org/10.1016/j.asoc.2015.11.037 Gunning D (2017) Explainable artificial intelligence (xai). Defense Advanced Research Projects Agency He C, Ma M, Wang P (2020) Extract interpretability-accuracy balanced rules from artificial neural networks: a review. Neurocomputing 387:346–358. https://doi.org/10.1016/j.neucom.2020. 01.036 Hjørland B (2010) The foundation of the concept of relevance. J Am Soc Inf Sci Technol 61:217–237. https://doi.org/10.1002/asi.21261 Ishibuchi H, Kaisho Y, Nojima Y (2011) Design of linguistically interpretable fuzzy rule-based classifiers: a short review and open questions. J Mult Log Soft Comput 17 Jang JSR (1993) ANFIS: adaptive-network-based fuzzy inference system. IEEE Trans Syst Man Cybern 23:665–685. https://doi.org/10.1109/21.256541 Lughofer E, Pratama M, Skrjanc I (2018) Incremental rule splitting in generalized evolving fuzzy systems for autonomous drift compensation. IEEE Trans Fuzzy Syst 26:1854–1865. https://doi. org/10.1109/TFUZZ.2017.2753727 Massobrio R (2018) Savant Virtual: generación automática de programas Matsubara T (2020) NOLTA, IEICE Bayesian deep learning: a model-based interpretable approach. https://doi.org/10.1587/nolta.11.16 Megherbi H, Megherbi AC, Benmahammed K (2019) On accommodating the semantic-based interpretability in evolutionary linguistic fuzzy controller. J Intell Fuzzy Syst 36:3765–3778. https://doi.org/10.3233/JIFS-18637 Mencar C (2005) Theory of fuzzy information granulation: contributions to interpretability issues. University of Bari Mencar C, Fanelli AM (2008) Interpretability constraints for fuzzy information granulation. Inf Sci (Ny) 178:4585–4618. https://doi.org/10.1016/j.ins.2008.08.015 Meschino GJ, Andrade RE, Ballarin VL (2008) A framework for tissue discrimination in magnetic resonance brain images based on predicates analysis and compensatory fuzzy logic. IC-MED Int J Intell Comput Med Sci Image Process 2:207–222. https://doi.org/10.1080/1931308X.2008. 10644165 Mi J-X, Li A-D, Zhou L-F (2020) Review study of interpretation methods for future interpretable machine learning. IEEE Access 8:191969–191985. https://doi.org/10.1109/access.2020.3032756 Mohseni S, Zarei N, Ragan ED (2020) A multidisciplinary survey and framework for design and evaluation of explainable AI systems. ACM Trans Interact Intell Syst 1, 1(1):46. https://doi.org/ 10.1145/3387166 Otte C (2013) Safe and interpretable machine learning: a methodological review. Stud Comput Intell 445:111–122
266
J. F. Padrón-Tristán et al.
Padrón-Tristán JF (2020) Algoritmo de Virtual Savant con Lógica Difusa Compensatoria para problemas de empacado de objetos Padrón-Tristán JF, Cruz-Reyes L, Espin-Andrade RA et al (2020) No title. In: Eureka-Universe. https://www.dropbox.com/s/tfer1ac5ftx2yjw/eureka-client-2.8.4_1.jar?dl=0 Pérez-Pueyo R (2005) Procesado y optimización de espectros raman mediante técnicas de lógica difusa: aplicación a la identificación de materiales pictóricos. Universidad Politecnica de Catalunya Pinel F, Dorronsoro B, Bouvry P (2013) Solving very large instances of the scheduling of independent tasks problem on the GPU. J Parallel Distrib Comput 73:101–110. https://doi.org/ 10.1016/j.jpdc.2012.02.018 Rey MI, Galende M, Fuente MJ, Sainz-Palmero GI (2017) Multi-objective based Fuzzy Rule Based Systems (FRBSs) for trade-off improvement in accuracy and interpretability: a rule relevance point of view. Knowl Based Syst 127:67–84. https://doi.org/10.1016/j.knosys.2016.12.028 Rey MI, Galende M, Sainz GI, Fuente MJ (2011) Checking orthogonal transformations and genetic algorithms for selection of fuzzy rules based on interpretability-accuracy concepts. In: IEEE international conference on fuzzy systems, pp 1271–1278 Sabri N, Aljunid SA, Salim MS et al (2013) Fuzzy inference system: short review and design. Int Rev Autom Control 6:441–449. https://doi.org/10.15866/ireaco.v6i4.4093 Saenz HF (2011) Data mining framework for batching orders in real-time warehouse operations. Iowa State University Serov VV, Sokolov IV, Budnik AA (2019) Applied calculus of fuzzy predicates for the formalization of knowledge, 537. https://doi.org/10.1088/1757-899X/537/4/042043 Shukla PK, Tripathi SP (2011) A survey on Interpretability-Accuracy (I-A) Trade-Off in evolutionary fuzzy systems. In: Proceedings-2011 5th international conference on genetic and evolutionary computing, ICGEC 2011, pp 97–101 Shukla PK, Tripathi SP (2012) A review on the interpretability-accuracy trade-off in evolutionary multi-objective fuzzy systems (EMOFS). Information 3:256–277. https://doi.org/10.3390/inf o3030256 Škrjanc I, Iglesias J, Sanchis A et al (2019) Evolving fuzzy and neuro-fuzzy approaches in clustering, regression, identification, and classification: a Survey. Inf Sci (Ny) 490:344–368. https://doi.org/10.1016/j.ins.2019.03.060 Tjoa E, Fellow CG (2019) A Survey on Explainable Artificial Intelligence (XAI): Towards Medical XAI. arXiv 14 Yeganejou M, Dick S, Miller J (2020) Interpretable deep convolutional fuzzy classifier. IEEE Trans Fuzzy Syst 28:1407–1419. https://doi.org/10.1109/TFUZZ.2019.2946520 Zadeh LA (1965) Fuzzy sets. Inf Control 8:338–353. https://doi.org/10.1016/S0019-9958(65)902 41-X Zhang QS, Zhu SC (2018) Visual interpretability for deep learning: a survey. Front Inf Technol Electron Eng 19:27–39
Chapter 12
Quality and Human Resources, Two JIT Critical Success Factors Jorge Luis García-Alcaraz, José Luis Rodríguez-Álvarez, Jesús Alfonso Gil-López, Mara Luzia Matavelli de Araujo, and Roberto Díaz-Reza Abstract Timely product delivery is one of the quality characteristics that clients value the most, which is why production plans and programs are made to guarantee it. One of the most used quality-focused production programs is Just-In-Time (JIT) philosophy; however, it remains unclear to what extent JIT factors of quality planning and management influence on the performance of manufacturing companies in Mexico. To address this gap, this chapter presents a structural equation model associating JIT elements of quality planning and quality management with human resources and economic performance in the context of Mexican maquiladoras (cross-border assembly plants). Results indicate that even though there is no direct relationship between quality planning and management and economic benefits, these variables are indirectly related through human resources. In conclusion, human resources are key to achieving financial success of maquiladoras. Keywords Human capital · Quality management · Quality improvement · Organizational performance · Human resource management
J. L. García-Alcaraz Department of Industrial Engineering and Manufacturing, Universidad Autónoma de Ciudad Juárez, Chihuahua, México J. L. Rodríguez-Álvarez Doctoral Program in Engineering Sciences, Instituto Tecnológico y de Estudios Superiores de Occidente (ITESO), Tlaquepaque, Jalisco, México J. A. Gil-López · M. L. M. de Araujo Department of Business and Economy, University of La Rioja, La Rioja, Logroño, Spain R. Díaz-Reza (B) Department of Electric Engineering and Computation, Universidad Autónoma de Ciudad Juárez, Chihuahua, México e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_12
267
268
J. L. García-Alcaraz et al.
12.1 Introduction Just-In-Time (JIT) first appeared in Japan at the Toyota Company during the 1950s with the purpose of eliminating unnecessary elements in the production area. Since its conception, JIT has been considered a soft advanced manufacturing technology (Small and Chen 1997; Alcaraz et al. 2014; Green et al. 2014), is frequently described as a process for achieving continuous improvement through the systematic elimination of waste and variations (Kiran 2019), and it is currently highly popular among all kinds of companies to reduce production costs and meet product delivery times. Also, JIT is one of the pillars of lean manufacturing and most authors view JIT as a production philosophy, and one of the many implications of this conception is the fact that human resources play a major role in fulfilling JIT implementation objectives and obtaining the highest production quality standards (Cua et al. 2001). JIT is a manufacturing philosophy that “company produces only what is needed when it is needed and in the quantity that is needed (Amin et al. 2020). JIT Manufacturing tries to smooth the flow of materials from the suppliers to the customers, thereby increasing the speed of the manufacturing process (Htun et al. 2019). The JIT mechanism requires a moving continuous-flow assembly line where the relationship between flow time and installation demand fits one another (Ong and Sui Pheng 2021). The main reason for JIT implementation is cost reduction by eliminating waste (muda), variability (mura), and overloads (muri) (Atti 2019; Kosky et al. 2021). Therefore, most JIT benefits can be discussed in terms of material flow and operational performance (Green et al. 2014; Agyabeng-Mensah et al. 2020), agility and fast response to uncertainty in demand (Inman et al. 2011), profits (García-Alcaraz et al. 2015), product quality (Green et al. 2014), customer satisfaction (AgyabengMensah et al. 2020), efficiency improvement, scheduling simplicity, warehousing cost reduction (Lyu et al. 2020), and human resources satisfaction (Amasaka 2014). However, certain elements are necessary for obtaining such benefits. In this sense, previous research lists 33 elements or components required to guarantee or maximize JIT benefits. The benefits of implementing a JIT system impact all entities involved in supply-chain management (Taghipour et al. 2019). Most of these elements relate to factors quality management, supplier integration, management commitment, and lean techniques applied to production process (Singh and Garg 2011). The absence of one of these JIT elements may compromise quality and obtained benefits, which may slow down or fail. When this happens, many companies simply quit. Since JIT implementation may be a challenge, the relationship between JIT elements and corporate benefits has become the focus of much attention (Green et al. 2014). found that certain JIT elements were related to supply chain and organizational performance, while (Alcaraz et al. 2014) evaluated JIT elements of education and empowerment and measured their impact on inventory metrics and quality costs. In the context of Mexican maquiladoras (Montes 2014), has conducted a descriptive factor analysis of JIT elements and benefits, grouping such elements in factors and categories. In those works, results pointed out that quality and human resources guarantee JIT benefits.
12 Quality and Human Resources, Two JIT …
269
Although it has been reported that Quality Planning and Quality Management directly depend on Human Resources abilities and skills to benefit companies, JIT studies conducted in Mexico have merely indicated relationships among these variables but have failed to measure the extent to which they depend on one another. To address this gap, this research seeks to analyze the effects of Quality Planning and Quality Management as JIT factors on Human Resources and Economic Performance. Results will highlight the essential JIT elements for maximizing Human Resources and economic benefits by providing a quantitative dependency measure.
12.1.1 Quality and JIT Specific activities are necessary for successful JIT implementation, usually known as JIT elements and are grouped into categories called critical success factors (CSFs). Two CSFs of JIT are Quality Planning and quality management. The former is composed of elements ensuring proper planning of JIT implementation, whereas in the latter, elements provide the methodology to implement established JIT plans. The implementation of JIT is increasingly being seen as a vital way for manufacturing organizations to enhance their competitiveness (Tortorella et al. 2019). JIT implementation is grounded on the assumption that quality, as any administrative process, is not accidental and must be carefully planned according to product specifications. Quality Planning must take into account employees, and foremost, companies must show high management commitment (Fichera et al. 2014). Table 12.1 lists activities and programs that, according to the literature, ensure proper Quality Planning in production systems. These elements were also grouped by (Montes 2014), and appear sorted in descending order according to their quote citation. The reason why quality circles seem to be the top Quality Planning element is because they integrate employees’ opinions in production process improvement. Total quality control and statistical quality control programs are equally important for avoiding defective products. Quality in production process ensures timely product deliveries, mainly by complying with product technical specifications and avoiding defects or rework (Priestman 1985; Kumar 2010). In the end, quality translates into satisfied customers that are likely to become recurrent clients and recommenders of both the products and the manufacturer (Fullerton et al. 2003). As regards the last two JIT elements of Table 12.1, training and quality culture strictly depend on human resources, who must have enough opportunities to develop professional skills and abilities. As for Quality Management factor, companies need to maintain quality levels reached through programs listed in Table 12.1. In this sense, Table 12.2 shows Quality Management elements identified (Montes 2014). The first element refers to high visibility of quality results through different techniques (e.g. control graphs). Such information must be accessible to everybody, not just to managerial staff, and must be in sight (Colledani et al. 2014). Likewise, quality results must inform that employees’ effort is yielding significant benefits for them and the company.
270
J. L. García-Alcaraz et al.
Table 12.1 Quality planning Elements
Authors
Total
Quality circles
Priestman (1985), Nandi (1988), Maass et al. (1989), Singh (1989), Bonito (1990), Padukone and Subba (1993), Vrat et al. (1993), Garg et al. (1994, (1996a), Singh and Bhandarkar (1996), Saxena and Sohay (1999), Chong and Rundus 2000), Yasin et al. (2003)
13
Total quality control
Hall (1983), Dutton (1990), Padukone and Subba 9 (1993), Vrat et al. (1993), Ebrahimpour and Withers (1993), Flynn et al. (1995), Garg et al. (1996a, Saxena and Sohay (1999), Yasin et al. (2003)
Statistical quality control
Priestman (1985), Voss (1986), Singh (1989), 8 Dutton (1990), Padukone and Subba (1993), Garg et al. (1996a), Kumar and Garg (2000), Kumar et al. (2001)
Quality development programs
Hall (1983), Voss and Robinson (1987), Macbeth 7 et al. (1988), Baker et al. (1994), Garg et al. (1996a), Chong and Rundus (2000), Yasin et al. (2003)
Quality continuous improvement
Hall (1983), Singhvi (1992), Ebrahimpour and Withers (1993), Garg et al. (1996a), Garg and Deshmukh (1999a), Chong and Rundus (2000)
6
Zero defects
Ebrahimpour and Withers (1993), Garg and Deshmukh (1999a), Kumar and Garg (2000)
5
Quality-focused training
Singh (1989), Singhvi (1992), Vrat et al. (1993), Garg et al. (1994, 1996a)
5
Quality culture
Singhvi (1992), Flynn et al. (1995), Kumar et al. (2001)
3
Table 12.2 Quality management Elements
Authors
Total
High visibility in quality control
Hall (1983), Priestman (1985), Dutton (1990), Ebrahimpour and Withers (1993), Kumar et al. (2001)
5
Long-term commitment to quality control
Hall (1983), Priestman (1985), Dutton (1990), Ebrahimpour and Withers (1993), Kumar et al. (2001)
4
100% quality check
Singh (1989), Kumar and Garg (2000), Kumar et al. (2001)
4
Quality regulation and reliability of auditing
Nandi (1988), Garg and Deshmukh (1999a), Kumar et al. (2001)
3
Simplify the total quality process
Nandi (1988), Garg and Deshmukh (1999a), Kumar et al. (2001)
3
12 Quality and Human Resources, Two JIT …
271
Quality success and maintenance are impossible without long-term commitment of all parties involved. Each company department as well as suppliers and customers can be accountable for production quality success and stability. Likewise, to maintain timely product delivery, audits must be regularly conducted at all organizational levels, thereby inspecting facilities, production processes, and their results (Kouaib and Jarboui 2014). Finally, manufacturers need to simplify the quality process by conducting statistical analyses of information generated along the production process. Such information must be easily communicated to employees. As previously mentioned, production quality must be planned and administered. Therefore, we can propose the following hypothesis in the context of Mexican maquiladoras as regards the relationship between Quality Planning and quality management: H1. In a JIT implementation environment, Quality Planning has a positive direct effect on Quality Management in Mexican maquiladoras.
12.1.2 Human Factors and JIT Benefits Six categories of JIT benefits are established in (Montes 2014), but this research will merely focus on JIT benefits for Human Resources and profits, the ultimate goal of investors (García et al. 2014). As for the former category, Table 12.3 highlights motivation as the most important (Balakrishnan et al. 1996), since it is reported by 13 works. Employee motivation may be the result of appropriate training and skills for ensure that if any production operator is missing, any other operator can occupy his/her position without compromising the material flow. Substitute operators feel motivated as they work in the full confidence that their interventions are safe and effective. Table 12.3 JIT benefits for human resources Benefit
Authors
Improved employee motivation
Hall (1983), Voss and Robinson 1987), Dutton 11 1990), Miltenbirg 1990), Hong et al. 1992), Padukone and Subba 1993), Balakrishnan et al. 1996), Garg et al. (1996b), Singh and Bhandarkar (1996), Ha and Kim (1997)
Increased teamwork
Bonito (1990), Hong et al. (1992), Singhvi (1992), Fiedler et al. (1993), Balakrishnan et al. (1996), Garg et al. (1996a), Ha and Kim (1997)
Reduced classified positions
Total
8
4
Better use of work resources
Flynn et al. (1995), Garg et al. (1996a), Kumar and Garg (2000), García-Alcaraz et al. (2015)
4
Improved communication
Voss (1986), Garg and Deshmukh (1999a)
2
272
J. L. García-Alcaraz et al.
Motivated employees create strong work teams that efficiently and effectively communicate problems for continuous improvement (Power and Sohal 2000), although other factors may threaten JIT implementation. In fact, companies with excellent quality plans may fail to apply JIT principles when Human Resources lack a sense of belonging (Amasaka 2014), when turnover increases, or when trained operators are dismissed. All these events may bring work insecurities and poor integration, let alone the loss of intellectual capital (Sumedrea 2013). From this perspective, it seems that JIT benefits for Human Resources are influenced by education, but also by the way quality is planned and managed. This conclusion allows two working hypotheses to be constructed as follows: H2. In a JIT implementation environment, Quality Planning has a positive direct effect on Human Resources benefits in Mexican maquiladoras. H3. In a JIT implementation environment, Quality Management has a positive direct effect on Human Resources benefits in Mexican maquiladoras. As regards Economic Performance, Table 12.4 lists the main JIT benefits reported by (Montes 2014). According to the list, the main impact of JIT implementation from a financial perspective is reduction of both production costs and labor force costs (Balakrishnan et al. 1996; Kumar 2010; Inman et al. 2011). Cost reduction can be the result of effective quality plans, Human Resources education (training included), and reduced rejected products and reprocessing, which both solve quality inconsistencies. Such findings support the fourth working hypothesis of this research, which states as follows: H4. In a JIT implementation environment, Quality Planning has a positive direct effect on Economic Performance of Mexican maquiladoras. Table 12.4 highlights administrative costs reduction as the third main payback of JIT implementation, although for some authors administrative work is an indirect labor force. Similarly, successful JIT implementation can impact on product costs as a consequence of effective Quality Management programs (Elzahar et al. 2015) and Table 12.4 JIT Economic benefits Benefit
Authors
Total
Production costs reduction
Priestman (1985), Guinipero and Law (1990), 8 Ebrahimpour and Withers (1993), Garg et al. (1994), Vuppalapati et al. (1995), Garg and Deshmukh (1999b), AM et al. (2014)
Labor force costs reduction
Priestman (1985), Guinipero and Law (1990), 8 Ebrahimpour and Withers (1993), Garg et al. (1994), Vuppalapati et al. (1995), Garg and Deshmukh (1999b), AM et al. (2014), Filippini and Forza (2016)
Administrative costs reduction Hall (1983), Singh (1989), Vrat et al. (1993), Garg and Deshmukh (1999a), Kumar and Garg (2000)
6
Product cost reduction
Hall (1983, Filippini and Forza (2016)
2
Improved competitiveness
Filippini and Forza (2016)
1
Increased profit margin
Filippini and Forza (2016)
1
12 Quality and Human Resources, Two JIT …
273
in this sense, quality is often viewed as a reliable performance indicator that brings profits and sustains business competitiveness (Tomšiˇc et al. 2015). Furthermore, it is the business card of companies. In conclusion, quality is the means by which businesses reach a competitive position, thereby increasing profits (Cubas et al. 2016). Under this assumption, we propose the fourth working hypothesis of this study: H5. In a JIT implementation environment, Quality Management has a positive direct effect on Economic Performance of Mexican maquiladoras. Quality may be the source of much profit, but the financial performance of a company certainly depends on many other important factors, such as human resources. The human factor is the main asset of any organization, so employee motivation and promoting a sense of belonging are essential when looking to increase or maintain profits. Author have analyzed Human Resources management in JIT implementation of Australian companies, pointing out that Human Resources with high levels of training and education can solve more problems in the production process irrespective of companies’ management practices (Power and Sohal 2000). Others have reported that Human Resources are the success factor of JIT, since employees, especially operators, possess necessary knowledge to build product quality (Alcaraz et al. 2014). From this perspective, we can propose the last working hypothesis that links Human Resources benefits with the Economic Performance of Mexican maquiladoras. H6. In a JIT implementation environment, Human Resources benefits have a positive direct effect on the Economic Performance of Mexican maquiladoras. Figure 12.1 illustrates the six hypotheses to be tested in this research. The objective of this work is to quantitatively validate effects among analyzed latent variables. The major contribution of this research is therefore the estimation of dependency measures explaining the relationships among variables Quality Planning, Quality Management, Human Resources benefits, and Economic Performance in the context of Mexican maquiladoras during JIT implementation. Results will contribute to current discussions on JIT implementation in the manufacturing sector Fig. 12.1 Proposed model
274
J. L. García-Alcaraz et al.
and will support decision-making of maquiladoras managers in Mexico when implementing JIT philosophy or improving its application. The structure of this research chapter is thus as follows: Sect. 2 presents the methodology used to statistically validate hypotheses depicted in Fig. 12.1, whereas Sect. 3 reports results obtained from the construction of the structural equation model. Finally, Sect. 4 presents research conclusions.
12.2 Methodology 12.2.1 Survey Development To collect data, we designed a questionnaire to assess JIT elements and benefits reported in other research (Kumar 2010). The questionnaire analyzed four latent variables: Quality Planning (Table 12.1, 8 items), Quality Management (Table 12.2, 5 items), Human Resources benefits (Table 12.3, 5 items) and Economic Performance (Table 12.4, 6 items), and it was answered using a five-point Likert scale for subjective assessments, where the lowest value (1) indicated that a JIT element was never executed or a JIT benefit was never gained. In contrast, the highest value (5) implied that a JIT element was always executed or a JIT benefit was always obtained. The questionnaire was subjected to validation by a group of experts, composed of industrial engineers and academics, looking to ensure consistency and survey applicability, since the assessed JIT elements and benefits were originally identified in the Indian industrial sector. The first version of the survey thus included blank spaces for experts to incorporate additional items to be evaluated according to their own experience in JIT implementation.
12.2.2 Data Collection To collect data, we used stratified sampling technique and focused on maquiladoras with mature supply chains and solid JIT programs. In total, 447 maquiladoras located in the Mexican state of Chihuahua were invited via e-mail to participate in the project. As for the questionnaire administration mode, we conducted face-to-face interviews with supply chain managers and supervisors in the area of material flow. All participants had at least two years of experience in their position.
12 Quality and Human Resources, Two JIT …
275
12.2.3 Data Analysis and Survey Validation Data collected were analyzed using SPSS 21® software. Before validating latent variables, we conducted a screening process to identify missing values and outliers. Cases with more than 10% missing values were discarded and for detect outliers, we standardized each item, considering a standardized value as an outlier if the absolute value was above 4. A similar procedure was used in supply chain research by (García-Alcaraz et al. 2015). Once screened data, we looked for internal consistency and reliability of latent variables using the Cronbach’s alpha and the composite reliability index, considering 0.7 as the minimum value (Adamson and Prion 2013). Additionally, we computed other indices as: average variance extracted (AVE), full variance inflation factor (VIF), R2, adjusted R2 and Q2. AVE measured discriminant validity of latent variables, considering 0.5 as the minimum accepted value. As for convergent validity, the same coefficient and correlations among latent variables were analyze. A similar procedure was described in supply chain research by (García-Alcaraz et al. 2015). On the other hand, we computed full collinearity VIF to measure collinearity among latent variables, setting 3.3 as the threshold, although some works suggest values above 10. Finally, we compute the R2 and adjusted R2 for parametric validity and Q2 squared for nonparametric validity. Acceptable predictive validity in connection with an endogenous latent variable is suggested by a Q-squared coefficient greater than zero (Kock et al. 2009) and should, preferably, be similar to R-squared values.
12.2.4 Descriptive Analysis A univariate analysis of items to identify measures of central tendency and deviation of data was conducted. The median or 50th percentile was estimated as a measure of central tendency, while the interquartile range (IQR) was calculated to measure data deviation. A high median value indicated that the involved JIT element was always performed in Mexican maquiladoras, or the JIT benefit was always obtained. In contrast, a low median value implied that the involved JIT element was never performed, or a JIT benefit was never obtained. As for IQR, highest values denoted low consensus among participants regarding the median value of an item, whilst low values indicated little data dispersion, and thus high consensus among respondents (Tastle and Wierman 2007). A similar procedure was reported in supply chain studies (Qi et al. 2016; Brusset and Teller 2017).
276
J. L. García-Alcaraz et al.
12.2.5 Structural Equation Model To prove hypotheses depicted in Fig. 12.1, we employed the structural equation modeling technique (SEM), a popular modeling technique to assess and validate causal relationships, especially in supply chain environments. For instance, using SEM-based research has reported the impact of JIT on supply chain performance (Green et al. 2014), the effects of flexibility, uncertainty, and firm performance on supply chain (Merschmann and Thonemann 2011) and the effects of green supply chain management on green performance and firm competitiveness. In this research, the model was run in WarpPLS 5.0® software, since its main algorithms are based on Partial Least Squared (PLS), widely recommended for smallsized samples and non-normal or ordinal data (Kock 2017). In addition, four model fit indices were estimated to assess reliability of the model: Average Path Coefficient (APC), Average R-Squared (ARS), Average Variance Inflation Factor (AVIF) and the Tenenhaus goodness of fit (GoF). Such indices were proposed by (Kock 2017) and adopted in supply chain studies in supply chain studies as (Couto et al. 2016) and (Blome et al. 2014). First, we computed and analyzed P-values of APC and ARS to determine the model’s efficiency, establishing 0.05 as the threshold. This value means that statistical inferences were significant at a 95% confidence level, thereby testing the null hypothesis where APC and ARS equaled 0, versus the alternative hypothesis, where APC and ARS were different from zero. Second, we estimated AVIF, for which SEM usually accepts values below 5. Finally, the Tenenhaus GoF measured the model’s explanatory power, and an acceptable value must be above 0.36 (Kock 2017). Finally, to validate and understand how latent variables were related, we measured three types of effects: direct, indirect, and total effects. Direct effects appear in Fig. 12.1 as arrows directly connecting two latent variables; indirect effects between two latent variables occur through other latent variables and follow two or more segments, and total effects in a relationship are the sum of its direct and indirect effects. Every effect showed a P-value to determine its statistical significance at a 95% confidence level, considering the null hypothesis: βi = 0, versus the alternative hypothesis: βi = 0, but also, for every effect, the effect size is reported.
12.3 Results 12.3.1 Sample Description The survey was administered for three months in Mexican maquiladoras located in the Mexican state of Chihuahua. Table 12.5 resume demographic information, we received 162 questionnaires, 18 of which were discarded since they were incomplete. Table 12.5 illustrates the sample characteristics, where 59 of the 144 respondents were production managers having knowledge of suppliers and internal customers.
12 Quality and Human Resources, Two JIT …
277
Table 12.5 Descriptive analysis of the sample Position
Length of work experience (years)
Total
2–5
5–10
>10
Production manager
24
22
13
59
Supply chain manager
18
29
6
53
Procurement manager
5
21
6
32
47
72
25
144
Total
Table 12.6 Industrial sectors and number of employees Industrial sector
Number of employees
Total
501
Similarly, 85 participants were supply chain managers familiarized with issues on external customers and suppliers. As regards length of work experience, 72 (50%) of polled managers had 5–10 years of experience in their current position, given reliability to this research. Table 12.6 compares surveyed industrial sectors with the number of respondents. The automobile industry was the most prominent surveyed sector (56 administered questionnaires), closely followed by electronics and electric manufacturers.
12.3.2 Data Validation In SEM data must be validated before constructing the model. Table 12.7 presents results from the validation process, where indices previously discussed measured latent variables. Results showed all R-squared and adjusted R-square values above 0.02, implying that all latent variables had enough predictive validity from a parametric perspective. Likewise, each Q-squared value was higher than 0 and similar
278
J. L. García-Alcaraz et al.
Table 12.7 Latent variable coefficients Index
Quality planning
Quality management
Human resources
Economic performance
R-squared
0.617
0.481
0.63
R-squared adjusted
0.614
0.474
0.627
Composite reliability
0.932
0.883
0.867
0.899
Cronbach alpha
0.915
0.834
0.795
0.859
AVE
0.664
0.603
0.62
0.64
Full VIF
2.885
2.478
3.27
2.733
0.62
0.478
0.628
Q-squared
to its corresponding R-squared and adjusted R-squared values. Thus, from a nonparametric point of view, latent variables had sufficient validity. Values of the Cronbach’s alpha and the composite reliability index were above 0.7 in all cases, demonstrating that all latent variables were statistically reliable. Finally, results showed AVE values above 0.5 and full VIF values below 5, showing convergent validity between latent variables and no collinearity problems, respectively.
12.3.3 Descriptive Analysis Table 12.8 shows the descriptive analysis of items, which are sorted in descending order according to their median or second quartile values. In average, all median values were above 4, indicating that Mexican maquiladoras frequently execute critical success JIT elements or activities and usually obtain expected JIT benefits. Also, based on results, the most important Quality Planning elements to ensure JIT benefits for Mexican maquiladoras are continuous improvement and quality culture, since both showed the highest median values and the lowest IQR values. As for Quality Management elements, it appears that to Mexican maquiladoras, long-term commitment to quality control is a top priority while working with JIT. Here is where quality culture regains importance, as quality plans and programs are not enough to ensure JIT benefits. In addition, Mexican manufacturing companies seem to understand the role of Human Resources in ensuring and maintaining quality, since results show that high visibility of quality control results is the second top-rated item. Spreading quality results along the company is a means to motivate employees and make them feel proud of their contributions, thereby ensuring they will give their best in consequent challenges. As for Human Resources benefits, results revealed that Mexican maquiladoras experience a considerable increase in group collaboration or teamwork as a result of
12 Quality and Human Resources, Two JIT …
279
Table 12.8 Descriptive analysis of items and variables Percentile
QR
Quality planning
25th
50th
75th
Quality continuous improvement
3.79
4.47
4.82
Quality culture
3.67
4.40
5.00
1.33
Total quality control
3.56
4.34
4.95
1.39
Statistical quality control
3.6
4.33
4.91
1.31
Zero defects
3.57
4.33
4.93
1.36
Quality-focused training
3.51
4.28
4.89
1.38
Quality development programs
3.45
4.27
4.90
1.45
Quality circles
3.42
4.22
4.85
1.43
Long-term commitment to quality control
3.56
4.31
4.92
1.36
High visibility of quality control
3.5
4.25
4.86
1.36
Simplify the total quality process
3.47
4.25
4.87
1.40
Quality regulation and reliability of auditing
3.36
4.15
4.80
1.44
100% quality check
3.03
3.97
4.73
1.70
Better use of work resources
3.39
4.20
4.85
1.46
Increased teamwork
3.39
4.18
4.83
1.44
Improved communication
3.33
4.14
4.81
1.48
Increased employee motivation
3.28
4.09
4.77
1.49
Reduced classified positions
3.1
3.89
4.66
1.56
Production costs reduction
3.48
4.25
4.87
1.39
Improved competitiveness
3.44
4.23
4.86
1.42
Increased profit margin
3.37
4.17
4.83
1.46
Administrative costs reduction
3.32
4.10
4.77
1.45
Product costs reduction
3.18
4.05
4.76
1.58
Labor force costs reduction
3.16
4.00
4.75
1.59
1.03
Quality management
Human resources benefits
Economic performance benefits
JIT implementation. Teamwork is an invaluable asset, since it fosters the third benefit ranked in the table: communication. When employees work together, communication increases and improves at all levels of the organizational structure. Finally, Economic Performance benefits as a result of JIT implementation in Mexican maquiladoras mainly include production costs reduction and improved competitiveness. Both elements have the highest median values and the lowest IQR values. In conclusion, there is high consensus among managers of Mexican manufacturing companies as regards the economic impact of JIT implementation on their production systems.
280
J. L. García-Alcaraz et al.
Fig. 12.2 Evaluated model
12.3.4 Structural Equation Model After items and dimensions were validated, we tested the structural equation model shown in Fig. 12.1 as described in the methodology section. Figure 12.2 presents results from the evaluation process, where hypotheses H4 and H5 were rejected, since their P-values were above 0.05, meaning that they were statistically not significant. In Fig. 12.2 rejected hypotheses are depicted as dotted lines, whereas statistically significant hypotheses are represented by solid arrows.
12.3.4.1
Model Fit and Quality Indices
To evaluate the model as a whole construct, we estimate some model fit and quality indices: Average Path Coefficient (APC), Average R-Squared (ARS), Average Adjusted R-Squared (AARS), Average block VIF (AVIF), Average Full collinearity VIF (AFVIF), and the Tenenhaus goodness of fit (GoF). Results showed P-values of APC, ARS, and AARS below 0.001, meaning that parameters were statistically significant in average. Similarly, AVIF and AFVIF values were below 5, the threshold, while the Tenenhaus GoF was much higher than 0.36. Those indices appear below and lets to concluded that the model was adequate to be interpreted. APC = 0.577, P < 0.001 ARS = 0.576, P < 0.001
12 Quality and Human Resources, Two JIT …
281
AARS = 0.572, P < 0.001 AVIF = 2.024, acceptable if 0.7 mg/l), it is prone to cause seizures, abortion, body paralysis (partial or total), coma, and death. It should be noted that lead exposure is given by touching a plaque of this material and not washing hands, for example, or inhaling its parts. On the one hand, according to Govindan et al. (2017), reverse logistics networks (RLN) are often designed to collect used, refurbished, or defective products from customers and then carrying out some recovery activities. The reverse logistics issue is to take back the used products, either under warranty or at the end of using or the end of leasing, so that the products or parts are appropriately disposed of, recycled, reused, or remanufactured (Kannan et al. 2010). But in this case, one of the most important aspects are related to the complex issues depending on social, technical, and legislative factors are: how to prevent the environmental deterioration caused by the generation of hazardous wastes, how to minimize the generation of hazardous wastes, and finally how to recover the valuable material contained by the wastes (Jayant 2015). Finally, operational risk has gained increasing attention in academic research and practice because operational risk directly affects the economic results of the company. Due to the influence of risk on logistics performance, implementing risk management has become a critical aspect, and the reverse logistics process is not exempt from operational risks. These situations can affect companies by generating significant economic losses due to sanctions, fines, and compensation, as well as being passed on to the environment: soils, air, and water bringing fatal consequences on the health of people and ecosystems. According to Manotas et al. (2014), a risk management system has four phases, as presented in Fig. 13.1. These phases are considered in this chapter. The same authors argue that the first two fundamental phases depend on the system’s success because
13 Operational Risks Management in the Reverse Logistics …
Risk Identification
Risk assessment and prioritization
Risk Management
291
Risk Monitoring
Fig. 13.1 Operational risk management system in supply chains (Manotas et al. 2014)
if the risks are adequately identified and prioritized, effective action can be taken by the organization so that such risks are mitigated or eliminated. For that reason, identifying and prioritizing these risks can be a fundamental activity in establishing actions that are oriented to preserve people’s health involved in these processes and avoid harmful effects on the environment derived from lead.
13.2 Methodology The methodological design for the development of the project is presented below in the Fig. 13.2. The methodology will have four phases and seek the definition of actions to mitigate or eliminate the main risks identified in the processes related to the reverse logistics of lead-acid batteries.
13.2.1 Characterization of the Recovery Process Traditionally, the supply chain operates in one direction, from the manufacturer to the end-user, responsible for its arrangement, which in the case of Colombia, is carried out in landfills. The environmental impacts generated have been sources of several studies that propose recycling as a viable solution to recover discarded products and obtain economic returns. However, the possibilities that a double-flow chain could generate, where the products return to one of the actors for the use of materials, for repair, dismantling, remanufacturing, recycling, or proper disposal have been used by very few companies in the country (Jarrín et al. 2011).
Characterization of the recovery process
Identifying risks at every stage of the process
Fig. 13.2 Methodological design
Risk prioritization using Fuzzy Quality Function Deployment
Actions to mitigate or eliminate critical risks
292
D. Sarria-Cruz et al.
It is essential to characterize the reverse supply chain for the recovery of lead-acid batteries, additionally, to identify the processes necessary for this recovery. In this way, with the chain characterized and the processes identified, it will be possible to determine their operational risks.
13.2.2 Identifying Risks at Every Stage of the Process Risk identification allows understanding the association between possible risks and problems that may arise in a supply chain (Pastrana-jaramillo and Osorio-gómez 2018). According to Aqlan and Lam (2015), risk identification is the most important supply chain risk management system activity. First, to appreciate existing risks, potential failures that can cause adverse results should be listed. Furthermore, for each failure, the sources that can affect or influence the organization are to be defined (Tummala and Schoenherr 2011). We use the methodology presented in Pastrana-jaramillo and Osorio-gómez (2018) and depicted in the Fig. 13.3. This methodology uses questionnaires applied to experts to identify risks, based on the literature revision. The questionnaire’s application was carried out individually and allowed the experts to rate the risk, both in probability and impact, using the linguistic scale
Identify the risks in literature. Design questionnaires and select experts. Apply the questionnaire. Consolidate results Build the probability matrix Fig. 13.3 Risk identification approach
Table 13.1 Linguistic scale for the risk identification and fuzzy equivalence for FQFD (Pastranajaramillo and Osorio-gómez 2018) Linguistic Scale
Very low (VL)
Low (L)
Medium (M)
High (H)
Very high (VH)
Numerical equivalence
1
2
3
4
5
Triangular fuzzy number
(0,1,2)
(2,3,4)
(4,5,6)
(6,7,8)
(8,9,10)
13 Operational Risks Management in the Reverse Logistics …
293
illustrated in the Table 13.1. In the same table is presented the numerical equivalence used in the prioritization stage. The data obtained apply Eqs. 13.1 and 13.2 to get the percentages of risk application and weighted averages of the occurrence probability and magnitude of impact (Pastrana-jaramillo and Osorio-gómez 2018; Osorio et al. 2019). Equation 13.1 Weighted average of the magnitude of risk i n Xi =
j=1
(Bi,j × Mi,j ) n
; ∀i
(13.1)
Equation 13.2 Weighted average probability of risk i n Yi =
j=1
(Bi,j × Pi,j ) n
; ∀i
(13.2)
Xi = Weighted average of the magnitude of risk i. Yi = Weighted average probability of risk i. Bi,j = Expert’s criterion j if i applicable as risk (1, 0). Mi,j = Expert’s qualification j on the impact of risk i. Pi,j = Expert’s qualification j on the probability of risk i. After getting these data, we built the Probability Impact Matrix (Fig. 13.4). We use the following terminology per section: The green part is negligible risks, while the yellow one, and the red one are critical. The latter are the ones selected for the next phase.
Fig. 13.4 Probability and impact matrix (Osorio-Gomez et al. 2018)
294
D. Sarria-Cruz et al.
13.2.3 Risk Prioritization Using Fuzzy Quality Function Deployment The importance of prioritizing risks is that it determines which risks should be accepted and addressed, which may be disregarded based on their impact level. Giannakis and Louis (2011) also emphasize that risks consider a wide range of criteria such as the probability of an event, its risk level, and especially its impact. In this regard, risk prioritization must be aligned with the objectives of the company and organized strategically, seeking to be addressed in the first place and mitigate negative impacts on the core business (Pastrana-jaramillo and Osorio-gómez 2018). Risk prioritization is fundamental to success in defining actions to mitigate or eliminate them. In this sense, it is crucial to determine this priority based on the companies’ strategic objectives so that the risks directly impact these objectives so that these are the first to be treated. Some papers use QFD y FQFD in supply chain management and risk management such as (Bevilacqua et al. 2006; Wang et al. 2007), and especially (Gento et al. 2001; Costantino 2012; Lam et al. 2016) are focused in risk management, but these applications are not with fuzzy logic and are not to risk prioritization. The Fuzzy QFD methodology for risk prioritization is shown in Fig. 13.5. This is according to Osorio-Gomez et al. (2018).
Fig. 13.5 Methodological approach to risk prioritization (Osorio-Gomez et al. 2018)
13 Operational Risks Management in the Reverse Logistics …
295
13.2.4 Actions to Mitigate or Eliminate Critical Risks When the critical risks are established, cause and effect analyses are developed to identify the main causes associated with these risks. In this way, through activities such as brainstorms and review in the literature, actions are defined to eliminate or mitigate risks.
13.3 Results Following the methodology presented above, the next results were obtained.
13.3.1 Characterization of the Recovery Process The Fig. 13.6 presents the typical reverse logistics network to lead-acid batteries, and the Figs. 13.7 and 13.8 provide the process of recovering lead-acid batteries. The next phase is to identify the risk in each of the processes. It is essential to mention that this type of chain typically shows four actors: the batteries’ producers: In Colombia, there are few authorized lead-acid batteries manufacturers and recoverees. Some only recover lead and other metals and send it to manufacturers or those who perform both processes. The distributors: Lead battery distributors who supply batteries to users of different types of cars. When they provide technical service for batteries with defects or customer’s nonconformities, they can repair the batteries or send them back to the manufacturer.
Fig. 13.6 Reverse logistics network to lead-acid batteries
296
D. Sarria-Cruz et al.
Fig. 13.7 The battery recovery process in a Colombian company
Fig. 13.8 Battery recovery process (https://www.ambientebogota.gov.co/, 2008)
13 Operational Risks Management in the Reverse Logistics …
297
The users: Who become the suppliers of the reverse logistics process once the batteries end their useful life. End-users: Who may belong to the industrial, energy mining, commercial, banking, construction, and telecommunications sectors; and buy the batteries from manufacturers or distributors. When the product does not meet their expectations or has manufacturing defects, it is returned to the distributor or manufacturers. Third parties: External companies are responsible for collecting, selecting, and battery disposition. They are usually delivered to the lead and plastic recovery officer, providing a certificate of disposal to end-users. Between the Distributor, the third party, and the manufacturer, the transport of lead-acid materials or batteries must be carried out by an authorized conveyor, who holds the necessary licenses for handling hazardous materials, as shown in the Fig. 13.6. In the Fig. 13.7 we present a traditional process in Colombia.
13.3.2 Identification of Operational Risks We identified the main risks associated with the battery recovery process shown in the Fig. 13.8. The Table 13.2 presents the identified risks and the results of the application of the questionnaire. A questionnaire was conducted by the staff involved in this process. The questionnaire asked whether or not the risk applied. If it applied, its probability of occurrence and its impact was requested using the scale and numerical equivalence presented in the Table 13.1. Table 13.2 Validation and weighted averages of operational risks Risk description
Id
Probability of occurrence
Impact
Sulphuric acid spills in collection and transport
R1
2,13
2,25
Lead leaks that are incorporated into the soil
R2
1,13
1,75
Electrolyte spillage (sulfuric acid) in rivers or lakes
R3
1,38
1,88
Manual battery handling
R4
3,00
3,50
Soil and environmental pollution from surrounding areas
R5
2,50
2,63
Dust released from shredders and mills
R6
1,38
1,63
Inhalation of lead settled in the vibration equipment
R7
1,75
2,25
Water pollution
R8
1,50
1,50
Lead dust from process water
R9
1,13
1,13
Release of fragments and lead dust at the recycling plant
R10
2,25
2,50
Inhale lead vapor
R11
1,75
1,88
Contaminated dust in the screening
R12
2,38
3,00
Poor knowledge about lead toxicity and its management
R13
2,75
3,13
298
D. Sarria-Cruz et al.
Fig. 13.9 Risk probability-impact matrix
The experts were selected from the POSCONSUMO PLAN (MINISTERIO DE AMBIENTE, 2019), and from the CVC’s List of Environmentally Licensed Companies or Environmental Management Plan for the Management of Hazardous Waste Especially of Lead-Acid Batteries and additionally personal contacts. The Fig. 13.9 shows the probability impact matrix. In this particular case, the risks that continued to the next stage are manual battery handling, soil and environmental pollution from surrounding areas, release of fragments and lead dust at the recycling plant, contaminated dust in the screening, and poor knowledge about lead toxicity and its management.
13.3.3 Prioritization of Operational Risks Following the methodology presented in the Fig. 13.5, the first step is to define the team to develop the FQFD methodology. In this case, some members of the related companies were involved. Then, phases 1 and 2 were implemented. The Table 13.3 shows WHAT´S and its importance. For defining the weight, the fuzzy numbers and the linguistic scale presented in the Table 13.1 were used. The Table 13.3 shows that team defined seven internal variables related to the reverse logistics associated with the recovery process in the lead-acid batteries case. The weights are the result of the average ratings that each of the team members generated individually. To obtain this average, fuzzy mathematics related to triangular fuzzy numbers is used.
13 Operational Risks Management in the Reverse Logistics …
299
Table 13.3 Internal variables and their relative importance Weight of what’s W1 Preserving the environment
8
9 10
W2 Improvement of people’s health
8
9 10
W3 Satisfy the need of costumers to generate energy
6
7 8
W4 Having a competitive portfolio in the market
7
8 9
W5 Profitability of the company
7
8 9
W6 Alignment with the standards of the parent company
6
7 8
W7 Use of adequate equipment and logistics for handling batteries 4
5 6
Continuing the proposed methodology, the How’s were defined, and then, their relationship with the What’s were established to obtain the Weight of the How’s. The Table 13.4 presented these. And finally, in the Table 13.5, the results of risk prioritization are obtained. In this case, the most critical risk is insufficient knowledge about lead toxicity and its management. However, it is essential to mention that all risks remained in the range between High and Very High, which means that the five risks must have high Table 13.4 Weight of the how’s Strategic objectives or how’s
Weight of how’s
H1 Continuously improving processes
45
H2 Safe working methods
43
59 76 57 74
H3 Preserving the environment
46
61 77
H4 Ensuring the quality, reliability, and technology of machinery 44 and materials
58 74
H5 Customer satisfaction
43
57 73
H6 Ensuring battery tightness
43
57 73
H7 Generate added value to the services offered
45
60 76
H8 Support people’s personal and professional growth
42
56 72
Table 13.5 Results of prioritization N°
Description of the risk
IPRF
Very high
478
R13
Poor knowledge about lead toxicity and its management
462
R10
Release of fragments and lead dust at the recycling plant
462
R5
Soil and environmental pollution from surrounding areas
458
R4
Manual battery handling
439
R12
Contaminated dust in the screening
436
High
375
300
D. Sarria-Cruz et al.
Fig. 13.10 Cause effect diagram to operational risk in reverse logistics of lead-acid batteries
priority when establishing actions oriented towards mitigation or elimination by those involved.
13.3.4 Actions to Mitigate or Eliminate Critical Risks From the risks mentioned in the Fig. 13.5, an Ishikawa or cause-effect diagram was constructed, as depicted in the Fig. 13.10. The prioritized risks can be grouped within a common range with the same nature: The group 1 considers Workforce and includes R13 (Poor knowledge about lead toxicity and its management) and R4 (Manual battery handling). On the other hand, the group 2 considers machinery and equipment with R10 (Release of fragments and lead dust at the recycling plant), R5(Soil and environmental pollution from surrounding areas), and R12 (Contaminated dust in the screening). Based on literature review and expert consultations, mitigation actions for prioritized risks were established considering the mentioned groups. Group 1: The first risk group consists of R13 and R4. These risks have their origins in the ignorance of lead toxicity, its management, and good practices. Therefore, actions aimed at attacking both risks can be defined. Lacking knowledge about lead toxicity and its management. This risk has the highest score in the ranking (462.08). This lack of knowledge derives from inappropriate management that is given to it by some people. The direct contact with this element in any of its presentations (solid, suspended particles, and liquid residues mixed with sulfuric acid) can bring illness and discomfort, ranging from colic to death. Mitigation-oriented actions:
13 Operational Risks Management in the Reverse Logistics …
301
To mitigate the negative impact of this risk on people and the environment, we proposed: 1. Periodic training on lead toxicity with results validation: Companies are encouraged to participate in any process that is part of lead batteries’ reverse logistics, specifically to their Occupational Health or Occupational Safety area. We suggest an 8-h workday training with two breaks to new income just as they arrive at the company before starting work in their jobs. We recommended that the company workers to be trained every three months, with a middle working day (4 h). The training of both workers will address topics such as: 1. 2. 3. 4. 5.
Lead toxicity and diseases caused by this material. Lead exposure pathways. Good handling practices for elements containing lead and its derivatives. Rights and duties of the worker against the management of lead-acid batteries and their derivatives. Allowed lead level in the blood and follow-up at the current level of each employee. We also suggested that the company take these samples every three months so that older workers already know their blood status at the training time.
The information needed for presentation is easily accessible on internet and literature, for example, on the World Health Organization’s website (who.int). One of its objectives is to reduce cases of lead poisoning (Organización mundial de la salud (OMS) 2019). Besides, Health at Work staff has academic training on this subject, and also the regulation on the management of hazardous waste and lead in Colombia. Validate their knowledge employing a written test to record what they have learned and for Occupational Safety and Health managers to analyze the employee’s working conditions and, if possible, improve them. The written validation test might contain the questions in the Fig. 13.11 that contains two types of questions: those highlighted for older workers tracked with good practices and those not highlighted who must answer both old and new income workers. 2. Mitigation actions focused on informal recycling: This work’s focus was not informal recycling, as it is very complex to have control over it and thus train those who do so. However, actions were proposed that could be useful in this practice. The life cycle of a lead-acid battery involves manufacturers, retailers, scrapers, recast furnaces, and consumers. Each can contribute to the prevention of harmful risks in the process through training, such as the one mentioned above, and taking into account other informal recycling factors. Suggested approaches for training include. Promote the collection of batteries used by licensed retailers when purchasing spare batteries. For example, by asking for a returnable deposit or a return plan that sets the price of lead batteries to a level closer to the actual intrinsic value of lead and above what the informal sector would be willing to pay, to reduce the illegal collection of the income it generates.
302
D. Sarria-Cruz et al.
Fig. 13.11 Questionnaire for knowledge validation in new income and monitoring in old workers
Inform consumers about the value of used lead-acid batteries and the dangers of discarding or supplying them to unlicensed smelting furnaces. Inform the general public and recyclers straightforwardly about the health and environmental hazards of lead. Create an alternative for the informal sector by developing the necessary infrastructure that encourages people who search dumpsters to take batteries to licensed smelting furnaces. This can be reached by providing transportation to collectors, as some are left in industrial areas outside the city and buying their batteries in a few units (PNUMA 2013). 3. Training and constant monitoring of the proper use of personal protective equipment: The occupational health area must explain the importance of personal protective equipment and carry out training to motivate the proper use of these and avoid possible sanctions. The training will last half a working time for new income to the company and must be done before beginning their work at their workplace. Training will be provided every three months for former workers and will last a quarter of a working day, or approximately two hours. Emphasize those involved in the process that does not remain with a permanent supervisor, as is the case with battery conveyors, because if they are not aware of the advantages of the correct use of the equipment, they could overlook the standard, bringing consequences for their health and subsequently for the company.
13 Operational Risks Management in the Reverse Logistics …
303
The Fig. 13.12 exemplifies some minimum personal protection elements that operators must take into account. The companies must also provide safety gloves, hygiene standards, revised health history in operators, and place of residence, among others. Group 2: The second risk group consists of R10, R5, and R12. These risks have the deficiency or non-existence of machinery and equipment suitable for developing their operations since these are released steam, particles, or lead dust. The following actions can be established to mitigate these risks. Release of fragments and lead dust at the recycling plant This risk was the second in the ranking with a score of 461.53. As discussed above, lead is considered one of the most harmful to health, according to WHO. It should
Fig. 13.12 Personal protective equipment
304
D. Sarria-Cruz et al.
be treated with appropriate regulations, as this can have a high negative impact on people and the environment. An individual is at risk of lead exposure in 3 ways: suck it up, ingest it and contact with the skin. The most dangerous for humans is to suck in the dust, smoke, or steam that gives off the process. Lead travels in the form of steam particles generated by the combustion of materials containing this metal, for example, during casting or recycling activities under unsafe conditions of lead-acid batteries, vapors deposited in the soil, other surfaces, and hair and clothing. For these reasons, it is necessary to identify the correct measures for the proper functioning of the process and to be able to prevent or minimize the inhalation of lead to avoid the negative impact that this chemical generates on people. Mitigation-oriented actions To mitigate the risk impact on people and the environment, the following actions were proposed: 1.
2.
3. 4. 5. 6.
The use of hoods and exhaust ventilation in open areas of operation, e.g., electric saws and crushers, conveyor belts, and furnace loading points, that trap dust, and vapors. Keeping molten lead at low temperatures will reduce the amount of vapor. Dust and particulate emissions must be trapped in a filter chamber using a wet electrostatic precipitator or any other similar device. These filters should be cleaned periodically, and debris must be put into the oven to recover the lead. To reduce exposure to lead dust, it is especially important to keep all painted surfaces in good condition and frequently clean to decrease the chances of chunks or dust forming. Prohibition of smoking, eating, and drinking in the workplace. Provision of a separate dining area away from recycling operations. Provision of a clean air space that is maintained with positive pressure and filtered air for the removal of respirators. Provision and use of facilities for workers where they can be changed to wear clean clothes before starting work and to wash and change clothes at the end of the day.
Although the intention is to mitigate the risk before humans’ inhale lead vapors particles, there may be cases where people become intoxicated, either in manufacturing/recycling batteries or in the use of paint. Emergency mechanisms or treatments should be in place for people who have inhaled high levels of lead, which must be carried out by the Occupational Health area in companies and the particular health institutions. Soil and environmental pollution from surrounding areas It is the third risk in the ranking; this risk is closely related to the first two, as strategies proposed above can be useful against that risk. Equipment used in the lead battery recycling process, such as shredders and mills, releases particulate matter oxide and lead sulfate when fulfilling their function. They must carry lead to small particles to be melted and used again in a new battery.
13 Operational Risks Management in the Reverse Logistics …
305
Hammer mills and shredders can release lead mist, drying out, and releasing lead dust if removed. The dust settled in the vibration equipment may be suspended back into the air and inhaled. Mitigation-oriented actions To mitigate the negative impact of this risk on people and the environment, it was proposed: 1.
2. 3.
Install insulation or seismic dissipation structures to reduce the vibration and release fewer lead particles. These elements lessen the transmission of vibrations to structural and secondary systems generated by rotating or static mechanical equipment in heavy machinery types such as crushers and mills. Keeping the anti-corrosive paint of the machinery in good condition will prevent dust/lead from getting caught up in corrosion of the metal parts. Take administrative measures to ensure that personnel operating near that machinery must wear bottle caps and special protection all the time, in addition to changing their garments before leaving the company.
Finally, it is possible to note that most actions to mitigate or eliminate risks related to reverse logistics of lead batteries are economical if performed preventively.
13.4 Conclusions Lead is an element recognized for its toxicity, whose handling carries many risks in its recycling process, including storage, transport, and manufacturing. According to the results, the most critical risks are related to human participation, precisely due to such toxicity. The application of the FQFD allows for prioritizing risks taking into account the strategic objectives of the processes associated with the reverse supply chain. By considering these objectives and the impacts that risks have on them, better actions can be defined or targeted, achieving a more significant effect on the reverse supply chain’s overall performance. Most actions aimed at mitigating or eliminating prioritized risks are focused on staff training and awareness of the use of personal protection at each stage of reverse battery logistics (storage, transportation, and manufacturing) as well as mandatory monitoring of current regulations. It is possible to note that most actions to mitigate or eliminate risks related to lead batteries’ reverse logistics are inexpensive. If they are carried out preventively, by informing those who will contact lead before their first approach to this element and continuously monitor the way they perform in their jobs. Those involved in the lead battery reverse logistics process agreed that, despite the 13 risks extracted from the literature and present at each stage of lead battery reverse logistics, those with the most significant negative impact are those related to lead inhalation and handling.
306
D. Sarria-Cruz et al.
Because lead is a highly polluting material, and its extraction and processing affect negatively the environment, recovery from it is significant to minimize such effects. However, if this process’s risks are not considered, the benefits that can be obtained in the environment would be counteracted with the negative impact on people’s health. This is one reason why prioritization and intervention of these risks is critical. Future work could consider the effects of using this affects the environment, for instance, systems dynamics models to evaluate the environmental impact in a long term.
References Aqlan F, Lam SS (2015) A fuzzy-based integrated framework for supply chain risk assessment. Int J Prod Econ 161:54–63. https://doi.org/10.1016/j.ijpe.2014.11.013 Bevilacqua M, Ciarapica FE, Giacchetta G (2006) A fuzzy-QFD approach to supplier selection. J Purch Supply Manag 12:14–27. https://doi.org/10.1016/j.pursup.2006.02.001 Costantino (2012) On the use of quality function deployment (QFD) for the identification of risks associated to warranty programs. In: ESREL conference, Helsinki, pp 4440–4449 Gento AM, Minambres MD, Redondo A, Perez ME (2001) QFD application in a service environment: a new approach in risk management in an university. Oper Res 1:115–132. https://doi.org/ 10.1007/bf02936289 Giannakis M, Louis M (2011) A multi-agent based framework for supply chain risk management. J Purch Supply Manag 17:23–31. https://doi.org/10.1016/j.pursup.2010.05.001 Govindan K, Fattahi M, Keyvanshokooh E (2017) Supply chain network design under uncertainty: a comprehensive review and future research directions. Eur J Oper Res 263:108–141. https://doi. org/10.1016/j.ejor.2017.04.009 Jarrín J, Bernal J, Pirachicán C, Guevara C (2011) Análisis y caracterización de la logística inversa de baterías recargables en Bogotá. Universidad de La Sabana Jayant A (2015) Reverse logistics practices in lead acid battery recycling plant: a case study Kannan G, Sasikumar P, Devika K (2010) A genetic algorithm approach for solving a closed loop supply chain model: a case of battery recycling. Appl Math Model 34:655–670. https://doi.org/ 10.1016/j.apm.2009.06.021 Lam J, Siu L, Bai X (2016) A quality function deployment approach to improve maritime supply chain resilience. Transp Res Part E Logist Transp Rev 92:16–27. https://doi.org/10.1016/j.tre. 2016.01.012 Li M, Liu J, Han W (2016) Recycling and management of waste lead-acid batteries: a mini-review. Waste Manag Res 34:298–306 Manotas DF, Osorio JC, Rivera L (2014) Handbook of research on managerial strategies for achieving optimal performance in industrial processes. i. https://doi.org/10.4018/978-1-52250130-5 Organización mundial de la salud (OMS) (2019) Intoxicación por plomo y salud Osorio-Gomez JC, Manotas-Duque DF, Rivera L, Canales I (2018) Operational risk prioritization in supply chain with 3PL using Fuzzy-QFD. In: New perspectives on applied industrial tools and techniques, management an industrial engineering, pp 91–109 Osorio-Gómez JC, Naranjo-Sanchez D, Agudelo-Ibarguen N (2019) Operational risk in storage and land transport of blood products. Res Comput Sci 149:23–31 Pastrana-jaramillo CA, Osorio-gómez JC (2018) Operational risk management in a retail company. Res Comput Sci 148:57–66 PNUMA (2013) Environmental risks and challenges of anthropogenic metals flows and cycles
13 Operational Risks Management in the Reverse Logistics …
307
Tummala R, Schoenherr T (2011) Assessing and managing risks using the Supply Chain Risk Management Process (SCRMP). Supply Chain Manag 16:474–483. https://doi.org/10.1108/135 98541111171165 Wang F, Li XH, Rui WN, Zhang Y (2007) A fuzzy QFD-based method for customizing positioning of logistics Service products of 3PLS. In: 2007 international conference on wireless communications, networking and mobile computing, WiCOM 2007, pp 3326–3329 Zhang J, Chen C, Zhang X, Liu S (2016) Study on the Environmental Risk Assessment of Lead-Acid Batteries. Procedia Environ Sci 31:873–879. https://doi.org/10.1016/j.proenv.2016.02.103
Chapter 14
Dynamic Evaluation of Livestock Feed Supply Chain from the Use of Ethanol Vinasses Rocío Ramos-Hernández, Cuauhtémoc Sánchez-Ramírez, Yara Anahí Jiménez-Nieto, Adolfo Rodríguez-Parada, Martín Mancilla-Gómez, and Juan Carlos Nuñez-Dorantes Abstract The sugarcane industry is the economic mainstay of many developing countries, such as Brazil, India, and Thailand. As a result, these countries are well known as world’s top sugar producers. As a by-product of sugar production, molasses is used for animal feed and as a fermentation source in ethanol production. In turn, the production of ethanol from sugarcane molasses generates vinasse, a liquid residue that, if left untreated, can be hazardous to the environment. However, since vinasse is rich in organic materials and minerals, if properly treated, it can have applications in energy generation and soil fertilization. Unfortunately, the viability of using vinasse to produce livestock feed (LF) is not sufficiently explored. To address this gap, in this chapter we propose the conceptual design of the vinasse-based LF supply chain. To this end, we rely on System Dynamics (SD) to identify the key variables of the chain and assess how vinasse can be efficiently used to produce animal feed. Keywords Livestock feed · Vinasses · System dynamics · Simulation
R. Ramos-Hernández · C. Sánchez-Ramírez (B) · J. C. Nuñez-Dorantes Tecnológico Nacional de México/I. T. Orizaba, Orizaba, Mexico e-mail: [email protected] R. Ramos-Hernández e-mail: [email protected] J. C. Nuñez-Dorantes e-mail: [email protected] Y. A. Jiménez-Nieto · A. Rodríguez-Parada · M. Mancilla-Gómez Faculty of Accounting and Administration, Universidad Veracruzana Campus Ixtaczoquitlán, Ixtaczoquitlán, Veracruz, Mexico e-mail: [email protected] M. Mancilla-Gómez e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_14
309
310
R. Ramos-Hernández et al.
14.1 Introduction Due to its contribution to the economy of many countries, sugarcane is one of the world’s most important crops. According to (Silalertruksa and Gheewala 2020), global sugarcane production in 2017 reached 1,841 million tons, with Brazil as the largest producer (758 million tons), followed by India (306 million tons) and Thailand (103 million tons). Moreover, since the sugar industry normally applies the principles of circular economy and waste reduction (Meghana and Shastri 2020), the large amounts of various types of waste generated during cane sugar production can be efficiently used to create new products, such as biofuels and biochemicals. A 2019 report jointly conducted among the Organization for Economic Cooperation and Development (OECD), the US Food Agriculture Organization (FAO), and Chapingo Autonomous University (FAO) on the 2019–2028 agricultural outlook indicated that per-capita consumption of staple foods is expected to stagnate as demand is saturated for most of the world’s population. However, the report estimates that per-capita sugar consumption will grow as a result of the increasing global demand for processed foods. Unfortunately, the latest COVID-19 (SARSCoV-2) crisis has had a critical impact on the agricultural industry, and in this sense, the (FAO) foresees a 6-million-ton global sugar production deficit for 2019/2020, with the European Union, India, Pakistan, and Thailand being the most affected producers, along with Brazil, China, and Russia, to a lesser extent. Certainly, market speculations for ethanol and molasses are uncertain (Zafranet). As a sub-product of sugar production, molasses plays a key role in ethanol production and is commonly used to produce livestock feed (LF) (Palmonari et al. 2020). In turn, vinasse results from the distillation of molasses. It is a liquid residue rich in organic materials and minerals commonly used for fertirrigation; however, if left untreated, vinasse can harm crops significantly (Rulli et al. 2020). Since treatments for vinasse are costly, the scientific and industrial communities have studied processes such as incineration and anaerobic digestion to use vinasse as an alternative source to generate energy (Palacios-Bereche et al. 2020). Additionally, vinasse can be used for soil fertilization, while contributing to current efforts from the ethanol industry to minimize wastewater (Cerri et al. 2020). A much less explored application of vinasse concerns LF production (RamosHernández et al. 2016). This alternative can be equally beneficial to both the livestock industry and the biofuels industry. That is, molasses destined to livestock industries can be replaced by vinasse-based LF supplements, thus increasing the availability of molasses to produce ethanol. Hence, in this chapter, we propose the conceptual design of the vinasse-based LF supply chain. The remainder of this chapter is structured as follows: Sect. 2 discusses state-of-the-art applications of vinasse, including energy generation and organic fertilizer production, among others. Then, Sect. 3 details the methods used to develop the conceptual design of the vinasse-based LF supply chain. Next, in Sect. 4 we introduce the resulting conceptual design and run multiple simulations to determine how the vinasse-based LF supply chain depends on both molasses availability and ethanol production levels. Finally, Sect. 5 proposes our conclusions and suggestions for future work.
14 Dynamic Evaluation of Livestock Feed Supply Chain …
311
14.2 Background This section discusses the most common industrial applications of sugarcane vinasse.
14.2.1 Vinasse for Energy Production The anaerobic digestion of vinasse generates considerable amounts of biogas, which can be used to produce electricity. In (Silva Neto and Gallo 2021), authors compared the potential of vinasse biogas to that of oil power plants in terms of energy generation, costs, and greenhouse gas (GHG) emissions in Brazil. In the end, the authors found that vinasse could be a viable alternative to generate electricity in the country. Also, in (Pereira et al. 2020), the authors performed an economic analysis of vinasse biogas for electricity generation in Sao Paulo. The analysis calculated the amount of sugarcane and volume of vinasse produced in the region. The study concluded that in Sao Paolo biogas from vinasse could help generate a total of 659 GWh/year of electric power, to be supplied to 295,702 inhabitants and covering 0.45% of the state’s energy demands. Finally, according to (Parsaee et al. 2019), global vinasse production is estimated to be of 22.4 gigaliters, which have the potential to produce 407.68 gigaliters of biogas. In this sense, vinasse is a potential source of renewable energy.
14.2.2 Vinasse as Soil Fertilizer Sugarcane vinasse is typically used as the raw material of organic fertilizer. In this sense, vinasse-based fertilizers significantly reduce the negative impact that is caused by releasing excessive amounts of vinasse into the environment. According to (Cerri et al. 2020), the physicochemical and morphological properties of pectin and chitosan, when combined with sugarcane vinasse, have great potential for soil fertilization. Moreover, a study conducted in (Bettani et al. 2019) combined high methoxyl pectin gel with sugarcane vinasse to produce a slow-release soil fertilizer. Vinasse acted as the biopolymer solvent, providing greater stability to the pectin gel, and as a source of nitrogen (N), potassium (K), calcium (Ca), and magnesium (Mg). Finally, a study in (Lourenço et al. 2019) revealed that vinasse combined simultaneously with mineral N fertilizers increases N2O emissions 2.9 fold, if compared to N-fertilizers alone. Therefore, the authors suggested not combing both inputs, but rather applying vinasse before or after mineral N fertilization.
312
R. Ramos-Hernández et al.
14.2.3 Vinasse for Animal Feed Production Food waste and crop residues are commonly reused as LF. In this practice, animals act as natural bioprocessors capable of converting food residues that humans cannot eat into human edible food, such as meat, eggs, and milk (Dou et al. 2018). The use of food waste and agro-residues as sources of animal feed often requires comprehensive analyses in terms of food safety, natural resource conservation, and climate change. A study conducted by (Mathews et al. 2011) examined whether land could be shared between food production and biofuel production. The study pointed out at sugarcane as an example of both a food crop and a biofuel crop. On the one hand, sugarcane is used for ethanol production (biofuel); on the other hand, it can produce biomass yeast, a single-cell protein used as animal feed additive. In the end, the multiple applications of sugarcane and its byproducts has an impact on the annual global yield of this crop. Finally, researchers in (Salami et al. 2019) discussed the feasibility of replacing edible feed grains with human-inedible biomass in animal diets as a strategy to reduce food-feed competition and mitigate the environmental impact of livestock. From this discussion on state-of-the-art vinasse applications, we conclude that research on the use of vinasse for animal feed is scarce, if compared to the number of initiatives exploring vinasse potential for energy production and soil fertilization. The background contributed to the classification of the main uses of vinasses. Figure 14.1 summarizes the most common industrial applications of sugarcane vinasses. Fig. 14.1 Most common industrial applications of sugarcane vinasses
14 Dynamic Evaluation of Livestock Feed Supply Chain …
313
14.3 Conceptual Design of the Vinasse-Based LF Supply Chain In this section we discuss the systemic methodology used to design the vinassebased LF supply chain. We introduce a case study to illustrate the principle of the methodology.
14.3.1 Methodology Figure 14.2 depicts the methodology followed to develop the conceptual model of the vinasse-based LF supply chain. The methodology comprises six main steps: (1) identifying the model variables, (2) defining the relationships between these variables, (3) building a block diagram, (4) developing a causal diagram, (5) analyzing
Fig. 14.2 A conceptual model of the vinasse-based LF supply chain
314
R. Ramos-Hernández et al.
the causal relationships resulting from the diagram, and (6) developing the final conceptual model. All the steps are explained bellow: 1. 2. 3. 4.
5.
6.
Search for model variables. We performed a systematic review of the literature using the following keywords: vinasse, supply chain, and ethanol production. Define relationships between variables. Primary relationships between variables were defined as follows: A → B. Build block diagram. The block diagram was built as a graphical representation of the primary relationships identified between variables. Develop causal diagram. We built a causal diagram to graphically visualize complex relationships between latent variables. The diagram also helped us find whether a given variable (the cause) had an either positive or negative impact on another variable (the effect). Causal diagrams also depict reinforcing and/or balancing feedback loops. Test causal relationships. We analyzed whether the identified causal relationships, impacts, and interrelated variables were representative of the system being studied—i.e. the vinasse-based LF supply chain. Once the causal diagram was successfully tested, we proceeded to develop the conceptual model. Develop conceptual model. The final conceptual model corresponds to the successfully tested version of the causal diagram. This model is intended to be used in further research to simulate the vinasse-based LF supply chain and consequently propose implementation policies and strategies based on the results of the simulation.
14.3.2 Case Study The case study is conducted at XYZ (named changed for confidentiality reasons), a company that distills ethanol from molasses and manufactures LF. As depicted in Fig. 14.3, the vinasse-based LF supply chain derives from the ethanol supply chain. LF production initiates with the harvest of sugarcane, which is subsequently taken to sugar mills to be processed into sugar. Molasses are generated as a byproduct during the sugar production process, and although they can be used to produce animal feed, in this case study molasses are used to produce bioethanol. In turn, ethanol has a wide range of applications in the make-up, medical, beverage, and biofuel industries. XYZ produces around 11 L of vinasse per liter of ethanol (FAO). Finally, the vinasse is used to manufacture LF to avoid its release into water bodies (e.g. local lakes and rivers), thus preserving local fauna and flora and maintaining oxygen in aquatic ecosystems. As mentioned above, vinasses can be used for different purposes, in the XYZ company it can be used in two different ways: as boiler fuel (using a biodigester) or as generation of livestock feed (trough an evaporation process). In both uses, the waste generated can be used as composting in the sugar cane fields. When this happens, the sugarcane cycle closes. This cycle is explained in Fig. 14.4, it shows the possibilities of ways to feedback the system.
14 Dynamic Evaluation of Livestock Feed Supply Chain …
315
Fig. 14.3 Illustration of the vinasse-based LF supply chain
Fig. 14.4 Different ways to system feedback
Considering the above, a systematic approach is used for the design and evaluation of the vinasse-based LF feed supply chain, this will allow a greater vision of the interconnection between elements of the system.
316
14.3.2.1
R. Ramos-Hernández et al.
Vinasse as a Source of Animal Feed
Vinasse from ethanol distillation is one of the most environmentally harmful organic waste. In fact, the chemical oxygen demand (COD) of vinasse ranges between 60 and 70 g/l with a Ph of 4. Table 14.1 summarizes the physicochemical characteristics of this byproduct. As previously mentioned, XYZ uses vinasse from ethanol distillation to produce LF. Overall, industrial LF production involves concentrating vinasse via evaporation leading to a thick nutrient-rich substance. Figure 14.5 shows the mass balance to obtain one tonne of LF out of 8.78 tonnes of vinasse. The resulting effluent is liquid water with a greater amount of impurities than drinking water, but still allowed to be disposed of in local water bodies or used in the company’s own processes. Table 14.1 Physicochemical characteristics of raw vinasse
Indicator
Amount (%)
Brix
13.1
Ph
4.1
Total nitrogen
0.25
Total solids
Fig. 14.5 Mass balance of LF
11.9
Ash
3.9
Organic content
7.1
P2 O5
0.014
CaO
0.57
K2 O
1.53
14 Dynamic Evaluation of Livestock Feed Supply Chain …
14.3.2.2
317
Variable Selection
It is important to accurately identify the critical variables involved in the logistic processes of the vinasse-based LF supply chain. The main processes of the vinassebased LF supply chain are as follows: • Procurement. Vinasse is procured as the raw material necessary to satisfy LF production. • Production. Vinasse is converted to LF, an animal-edible product. • Distribution. The feed is brought to final consumers via warehouses and retailers. Table 14.2 introduces the critical variables identified in the block diagram and used to build the causal diagram of the vinasse-based LF supply chain. As previously mentioned, causal diagrams are a tool for graphically visualizing and defining the multiple relationships governing a system. We also considered the causal diagram of the ethanol production process, since the vinasse used for the LF supply chain is a byproduct of said process.
14.4 Results This section introduces the conceptual design of the vinasse-based LF supply chain using a causal diagram. The diagram identifies the key variables of the supply chain system. Additionally, we present the system’s equations and propose a simulation model run with STELLA ® to understand the behavior of the key performance indicators involved in the vinasse-based LF supply chain, namely molasses procurement, ethanol production, vinasse usage, and LF production, among others.
14.4.1 Causal Diagram Causal diagrams are graphs that help identify how the variables within a particular system interact either positively or negatively (Ramos-Hernández et al. 2021). To this end, causal diagrams are based upon feedback loops, which interconnect the variables and occur as a result of the system’s own complexity. Also, feedback loops may be either open or closed. In a closed-loop system, a given variable has an impact on another variable, which in turns has feedback on the first → → variable A B A B . Additionally, feedback loops can be either balancing ← ← (negative) loops or reinforcing (positive) loops. The causal diagram of the vinasse-based LF supply chain comprises three balancing loops, further explained as follows:
318
R. Ramos-Hernández et al.
Table 14.2 Variables involved in the vinasse-based LF supply chain Variable
Descriptor
Molasses inventory (from supplier)
Amount of molasses that a Palmonariet al. (2020), sugar mill or refinery can supply (Ramos-Hernández et al. 2016) to an ethanol production factor
References
Molasses procurement
The process through which an ethanol production factory procures molasses from a supplier (i.e. sugar mill or refinery)
Palmonariet al. (2020), (Ramos-Hernández et al. 2016)
In-factory molasses inventory
Amount of molasses (Ton) stored within the factory and necessary for ethanol production
(Rendón-Sagardi et al. 2014)
Master production schedule
Production program predefined by the company according to the number of customer orders received
Palmonariet al. (2020), (Rendón-Sagardi et al. 2014)
Ethanol production capacity
Amount of ethanol produced on (Rendón-Sagardi et al. 2014) a daily basis
Ethanol production
Process of converting molasses to alcohol through distillation
Vinasse
Residue or byproduct generated Bettani et al. (2019), FAO Food during molasses distillation Outlook Biannual Report on Global Food Markets (2020b)
Pollution
Emissions to water (local rivers or lakes) in case vinasse is not treated or contained within factory facilities
Vinasse inventory
Amount of vinasse (Ton) stored Palmonari et al. (2020) and necessary for LF production
LF production capacity
Amount of vinasse processed on a daily basis.
Palmonari et al. (2020)
Production costs
Costs incurred by the factory from manufacturing LF, including steam power, electric power, and workforce, among others
(Ramos-Hernández et al. 2016)
LF production
The production process of LF
Palmonari et al. (2020)
(Ramos-Hernández et al. 2016; Rendón-Sagardi et al. 2014)
Meghana and Shastri (2020), Palacios-Bereche et al. (2020), Palmonari et al. (2020)
Finished product inventory Amount of LF ready to meet the demand Waste management
Amount of vinasse daily processed to obtain LF
Palmonari et al. (2020)
Finished product demand
Amount of LF consumed by final consumers
(Ramos-Hernández et al. 2016; Rendón-Sagardi et al. 2014)
Demand satisfaction
Percentage of customer orders successfully processed on time
Palmonari et al. (2020), (Ramos-Hernández et al. 2016; Rendón-Sagardi et al. 2014)
14 Dynamic Evaluation of Livestock Feed Supply Chain …
319
Fig. 14.6 Balancing loop B1
Balancing loop B1. As long as Molasses Inventory (from supplier) increases, Molasses Procurement levels at XYZ will also rise. However, as greater amounts of Molasses are requested by XYZ, Molasses Inventory (from supplier) decreases (See Fig. 14.6). Balancing Loop B2. Ethanol Production Capacity has a direct effect on Ethanol Production, since the latter may increase or decrease as the former either increases or decreases, respectively. Likewise, greater amounts of in-factory Molasses Inventory will lead to an increase in Ethanol Production. In turn, as Ethanol Production levels increase or decrease, Vinasse Inventory increases or decreases, respectively (see Fig. 14.7). Balancing Loop B3. Greater LF Production Capacity leads to higher LF Production. In turn, a surge in LF Production causes Vinasse Inventory to decrease. Additionally, Vinasse Inventory has a negative effect on the environment that can be explained as follows: the more vinasse is used for LF Production, the lower the environmental impact is caused. Notice that as LF Production levels increase, Vinasse Inventory levels also rise, which consequently reduces environmental pollution (see Fig. 14.8). The causal diagram comprises other variables, not forming loops but equally important to understand the performance of the vinasse-based LF supply chain. Figure 14.9 depicts the variables related to factors such as costs and profits. As LF Sales increase, Profits also increase but LF Inventory is diminished, which at Fig. 14.7 Balancing loop B2
320
R. Ramos-Hernández et al.
Fig. 14.8 Balancing loop B3
Fig. 14.9 Behavior of utilities within the vinasse-based LF supply chain
some point may affect Demand Satisfaction. On the other hand, higher levels of LF Production lead to higher levels of Finished Product Inventory, which in turn entail more Costs. Finally, Fig. 14.10 depicts our proposal for the causal diagram of the vinasse-based LF supply chain. Notice that the proposal also encompasses the causal diagram of the ethanol production process, from which vinasse is derived. The following section introduces the simulation model of the vinasse-based LF supply chain, which takes into account the causal diagram discussed in this section.
321
Fig. 14.10 Causal diagram of vinasse-based LF production
14 Dynamic Evaluation of Livestock Feed Supply Chain …
322
R. Ramos-Hernández et al.
14.4.2 Simulation Model We took as reference the causal diagram developed in Sect. 3 to build the simulation model of the vinasse-based LF supply chain. The model equations are introduced and explained below.
14.4.2.1
Equations
This section explains the equations used to run the simulation model of the vinassebased LF supply chain. Orders from Supplier. XYZ places daily supply orders (DSOs) of molasses as indicated by Eq. 14.1. During harvest season, XYZ receives 20–30 30-tonne batches of molasses per day. On the other hand, when it is not harvest season, the company may receive from 0 to eight batches daily of the same weight. DOS =
H ar vest season, 25 ≤ D O S ≤ 30, N o har vest season, 0 ≤ D O S ≤ 8.
(14.1)
Ethanol Production. XYZ distills ethanol from molasses. Hence, ethanol production is calculated by multiplying the amount of distilled molasses (DM) by molasses yield (MY) (see Eq. 14.2). We calculated daily EP with a normal distribution of 421.26 tonnes of molasses/day and a standard deviation of 24.58 (see Eq. 14.3). Additionally, we acknowledged that MY levels may vary due to distillation being a chemical reaction. Hence, we used the RANDOM function to represent this variation in the model, thereby working with random MY values within the range of 235.118 and 292.397 L of ethanol per tonne of molasses (see Eq. 14.4). E P = D M ∗ MY
(14.2)
D M = N or mal(421.26, 24.58)
(14.3)
MY = Random(235.11, 292.39)
(14.4)
Vinasse. It is the primary source for LF production. The amount of vinasse generated at XYZ directly depends on EP levels. Overall, XYZ produces 11 L of vinasse (GW) per liter of ethanol. This relationship is expressed in Eq. 14.5. V = E P ∗ GW
(14.5)
14 Dynamic Evaluation of Livestock Feed Supply Chain …
323
LF Production. LF production (LFP) directly depends on the flow of in-factory vinasse inventory (CFh). This flow is calculated in the following sections and is multiplied by the percent recovery (PR) of vinasse (see Eq. 14.6). In this sense, PR has a normal distribution with 11.58% as the mean value and 1.8% as the standard deviation (see Eq. 14.7). LFP =
n
P R(C Fh )
(14.6)
i=1
P R = N or mal(11.58, 1.8)
(14.7)
Inventories. XYZ works with three different types of in-factory inventories: LF Inventory (LFI), Vinasse Inventory (VI), and Molasses Inventory (MI). Equations 14.8, 14.9, and 14.10 are used to calculate these inventories as follows: t I L F = I M D|t=0 +
(V ∗ P R − S
(14.8)
(V ∗ P R − S
(14.9)
0
t I L F = I M D|t=0 + 0
t I M = I M|t=0 +
(D O S ∗ Q m − D M
(14.10)
0
In Eq. 14.8, S represents actual LF sales, whereas Qm in Eq. 14.10 stands for the amount of molasses per order requested to a supplier.
14.4.2.2
Validation
We conducted a dimensional consistency analysis and performed an outlier test to verify whether our simulation model of the vinasse-based LF supply chain behaved as expected. • Dimensional consistency analysis. As the most basic test offered by System (SD) Dynamics, a dimensional consistency analysis involves making sure that each term in a given equation has the same dimensions as the other terms in that equation. To this end, we checked all the input parameters from the simulation model. • Outlier test. It is a validation test well known across the many approaches to simulation modeling. Under a SD approach, the outlier test relies on a causal diagram that has conceptual validity from the moment it is first developed. To demonstrate the validity of the simulation model, we employed a portion of the
324
R. Ramos-Hernández et al.
Fig. 14.11 Feedback loop used in outlier test
causal diagram proposed earlier in Fig. 14.4. Validating this diagram portion, illustrated below as Fig. 14.11, implied confirming the following premise: as downtime length due to system maintenance increases, days of actual ethanol production decrease. Figure 14.12 summarizes the model validation results throughout four scenarios, which helped us test the model under extreme conditions. In the first scenario, represented by the blue line (1), planned downtime due to system maintenance lasts for 200 days. Evidently, ethanol production is halted during this period of time, thus having an adverse impact on ethanol inventory. In the second scenario, represented with a red line (2), downtime is planned for 100 days and begins on day 180. Hence, there may be more ethanol production days in the second scenario than in the first scenario. In the third scenario, indicated by the fuchsia line (3), downtime lasts for 15 days only—as XYZ normally plans—and hence affects little ethanol inventory. Finally, as the green line indicates (4), the fourth scenario shows continuous production of ethanol at XYZ, since no downtime sessions are being scheduled.
14.4.2.3
Simulation
XYZ uses three pits to store the molasses supplied by sugar mills and later used to produce ethanol. Figure 14.13 shows three different scenarios for the filling behavior of the pits, indicated by a blue, red, and fuchsia lines, respectively. As can be observed, each pit is filled in at a different time, since molasses are delivered to the factory daily in limited amounts. The figure also enables us to see that molasses inventory in the company decreases steadily in each pit as ethanol is being produced. Averagely, XYZ produces 120 L of ethanol on a daily basis (see Fig. 14.14), yet as explained in the previous section, production levels are affected by a planned 15-day downtime due to system maintenance. XYZ generates about 11 L of vinasse per liter of ethanol. This relation is depicted in Fig. 14.15, which indicates that on a daily basis XYZ generates more than one million liters of vinasse that can be used to produce LF. Figure 14.16 enables us to see the changing behavior of vinasse inventory at XYZ.
325
Fig. 14.12 Model validation results
14 Dynamic Evaluation of Livestock Feed Supply Chain …
Fig. 14.13 Molasses storage at XYZ
326 R. Ramos-Hernández et al.
14 Dynamic Evaluation of Livestock Feed Supply Chain …
327
Fig. 14.14 Daily ethanol production at XYZ
In fact, vinasse inventory increases with daily ethanol production but simultaneously decreases with daily LF production. Figure 14.17 represents the simulation of 500 h of this relationship. Figure 14.18 illustrates the simulated trend of LF production. Our simulations estimate that on average 6,876 L of LF can be obtained per hour; that is approximately 164,000 L of LF on a 24-h basis. This amount of final product can be stored in steel tanks to be subsequently delivered to final customers in either bulks or 20-L buckets. Additionally, as the figure shows, planned downtimes during ethanol production have a subsequent impact on both vinasse and LF production.
14.4.2.4
Sensitivity Analysis
In order to evaluate the overall behavior of the vinasse-based LF supply chain, we needed to incorporate its last stage: product distribution. To this end, we simulated multiple LF demand scenarios, whose values are listed in Table 14.3. As can be observed, the first scenario initiates with a 150,000-L demand of LF, which steadily increases by 10,000 L per day, thus reaching a demand of 190,000 L of LF in the fifth scenario. Figure 14.19 illustrates the results from the sensitivity analysis of LF demand, taking into account a finished product storage capacity of 5,000,000 L of LF. Scenarios 1, 2, and 3 point at an excess of LF inventory as a result of production being higher than demand. On the other hand, the fourth scenario indicates a more balanced relationship between the amount of LF that is produced and that being
Fig. 14.15 Daily ethanol production versus vinasse generated
328 R. Ramos-Hernández et al.
14 Dynamic Evaluation of Livestock Feed Supply Chain …
329
Fig. 14.16 Vinasse inventory
Fig. 14.17 Inventory for 500 h
sold, thus resulting in an ending inventory of 1,320,037 L of LF. Finally, in the fifth scenario, XYZ would not be able to meet a 190,000-L demand, as it would exceed its production capacity.
Fig. 14.18 Behavior of LF production on a yearly basis
330 R. Ramos-Hernández et al.
14 Dynamic Evaluation of Livestock Feed Supply Chain … Table 14.3 Scenarios for sensitivity analysis of LF demand
331
Scenario
Demand
1
150,000
2
160,000
3
170,000
4
180,000
5
190,000
14.5 Conclusions and Future Work The ethanol industry is key to the economic development of many regions. Among its multiple applications, ethanol is supplied as a raw material to other industries to produce alcoholic beverages, pharmaceutical products, biofuels, and most recently, sanitizers and disinfectants against COVID-19. As a byproduct of ethanol production, vinasse is a highly toxic substance that, if released untreated, becomes highly harmful to the environment. Hence, both industrial and scientific efforts are made to find alternative uses for vinasse in energy generation, soil fertilization, and LF production. However, this last alternative has not been sufficiently explored. To address this gap, we propose the conceptual design of the vinasse-based LF supply chain, in which we consider vinasse generated from ethanol production. Using the SD methodology, we first built a causal diagram of the vinasse-based LF supply chain system to identify its key variables. Then, we ran the simulation model on STELLA ®, thus generating the necessary scenarios to validate and evaluate the behavior of the key logistics processes involved in the vinasse-based LF supply chain system. Our results demonstrate that a continuous supply of molasses ensures a continuous production of ethanol, which in turn guarantees constant vinasse availability to produce LF. Using vinasse in animal feed production contributes to the world’s efforts to protect the environment. However, such a virtuous circle may easily become a vicious circle if vinasse is not used as much and as efficiently as in this case study. Similarly, the simulation model allowed us to generate different LF demand satisfaction scenarios that can support XYZ in its decisions to entirely meet LF demand from its customers. Finally, as future work, we propose to perform a cost-benefit analysis of the vinasse-based LF supply chain and simulate other scenarios, such as that of operating risks during product demand shocks, occurred as a result of unexpected events, such as the COVID-19 crisis.
Fig. 14.19 LF inventory behavior under different LF demand scenarios
332 R. Ramos-Hernández et al.
14 Dynamic Evaluation of Livestock Feed Supply Chain …
333
References Bettani SR, de Oliveira Ragazzo G, Leal Santos N et al (2019) Sugarcane vinasse and microalgal biomass in the production of pectin particles as an alternative soil fertilizer. Carbohydr Polym 203:322–330. https://doi.org/10.1016/j.carbpol.2018.09.041 Cerri BC, Borelli LM, Stelutti IM et al (2020) Evaluation of new environmental friendly particulate soil fertilizers based on agroindustry wastes biopolymers and sugarcane vinasse. Waste Manag 108:144–153. https://doi.org/10.1016/j.wasman.2020.04.038 Dou Z, Toth JD, Westendorf ML (2018) Food waste for livestock feeding: feasibility, safety, and sustainability implications. Glob. Food Sec. 17:154–161 FAO OCDE-FAO Perspectivas Agrícolas 2019–2028. http://www.fao.org/3/ca4076es/CA4076ES. pdf. Accessed 10 Nov 2020a FAO Food Outlook Biannual Report on Global Food Markets. http://www.fao.org/3/ca9509en/ca9 509en.pdf. Accessed 9 Nov 2020b Lourenço KS, Rossetto R, Vitti AC et al (2019) Strategies to mitigate the nitrous oxide emissions from nitrogen fertilizer applied with organic fertilizers in sugarcane. Sci Total Environ 650:1476– 1486. https://doi.org/10.1016/j.scitotenv.2018.09.037 Mathews JA, Tan H, Moore MJB, Bell G (2011) A conceptual lignocellulosic “feed + fuel” biorefinery and its application to the linked biofuel and cattle raising industries in Brazil. Energy Policy 39:4932–4938. https://doi.org/10.1016/j.enpol.2011.06.022 Meghana M, Shastri Y (2020) Sustainable valorization of sugar industry waste: status, opportunities, and challenges. Bioresour Technol 303:122929 Palacios-Bereche MC, Palacios-Bereche R, Nebra SA (2020) Comparison through energy, exergy and economic analyses of two alternatives for the energy exploitation of vinasse. Energy 197:117231. https://doi.org/10.1016/j.energy.2020.117231 Palmonari A, Cavallini D, Sniffen CJ et al (2020) Short communication: characterization of molasses chemical composition. J Dairy Sci 103:6244–6249. https://doi.org/10.3168/jds.2019-17644 Parsaee M, Kiani Deh Kiani M, Karimi K (2019) A review of biogas production from sugarcane vinasse. Biomass Bioenerg 122:117–125 Pereira IZ, dos Santos IFS, Barros RM et al (2020) Vinasse biogas energy and economic analysis in the state of São Paulo, Brazil. J Clean Prod 260:121018. https://doi.org/10.1016/j.jclepro.2020. 121018 Ramos-Hernández R, Mota-López DR, Sánchez-Ramírez C et al (2016) Assessing the impact of a vinasse pilot plant scale-up on the Key Processes of the Ethanol supply chain. Math Probl Eng. https://doi.org/10.1155/2016/3504682 Ramos-Hernández R, Sánchez-Ramírez C, Mota-López DR et al (2021) Evaluation of bioenergy potential from coffee pulp trough system dynamics. Renew Energy 165:863–877. https://doi.org/ 10.1016/j.renene.2020.11.040 Rendón-Sagardi M, Sánchez-Ramírez C, Cortes-Robles G et al (2014) Dynamic analysis of feasibility in ethanol supply chain for biofuel production in Mexico. Appl Energy 123:358e367. https:// doi.org/10.1016/j.apenergy.2014.01.023 Rulli MM, Villegas LB, Colin VL (2020) Treatment of sugarcane vinasse using an autochthonous fungus from the northwest of Argentina and its potential application in fertigation practices. J Environ Chem Eng 8:104371. https://doi.org/10.1016/j.jece.2020.104371 Salami SA, Luciano G, O’Grady MN et al (2019) Sustainability of feeding plant by-products: a review of the implications for ruminant meat production. Anim Feed Sci Technol 251:37–55 Silalertruksa T, Gheewala SH (2020) Competitive use of sugarcane for food, fuel, and biochemical through the environmental and economic factors. Int J Life Cycle Assess 25:1343–1355. https:// doi.org/10.1007/s11367-019-01664-0
334
R. Ramos-Hernández et al.
Silva Neto JV, Gallo WLR (2021) Potential impacts of vinasse biogas replacing fossil oil for power generation, natural gas, and increasing sugarcane energy in Brazil. Renew Sustain Energy Rev 135:110281. https://doi.org/10.1016/j.rser.2020.110281 Zafranet El azúcar en tiempos de virus. https://www.zafranet.com/2020/03/el-azucar-en-tiemposde-virus/. Accessed 20 Oct 2020
Part III
Artificial Intelligence Techniques
Chapter 15
Comparative Analysis of Decision Tree Algorithms for Data Warehouse Fragmentation Nidia Rodríguez-Mazahua, Lisbeth Rodríguez-Mazahua, Asdrúbal López-Chau, Giner Alor-Hernández, and S. Gustavo Peláez-Camarena Abstract One of the main problems faced by Data Warehouse (DW) designers is fragmentation. Several studies have proposed data mining-based horizontal fragmentation methods, which focus on optimizing query response time and execution cost to make the DW more efficient. However, to the best of our knowledge, it does not exist a horizontal fragmentation technique that uses a decision tree to carry out fragmentation. Given the importance of decision trees in classification, since they allow obtaining pure partitions (subsets of tuples) in a data set using measures such as Information Gain, Gain Ratio and the Gini Index, the aim of this work is to use decision trees in the DW fragmentation. This chapter presents the analysis of different decision tree algorithms to select the best one to implement the fragmentation method. Such analysis was performed under version 3.9.4 of Weka considering four evaluation metrics (Precision, ROC Area, Recall, and F-measure) for different selected data sets using the SSB (Star Schema Benchmark). Several experiments were carried out using two attribute selection methods: Best First and Greedy Stepwise, the data sets were pre-processed using the Class Conditional Probabilities filter and it was included the analysis of two data sets (24 and 50 queries) with this filter, to know the behavior of the decision tree algorithms for each data set. Once the analysis was N. Rodríguez-Mazahua · L. Rodríguez-Mazahua (B) · G. Alor-Hernández · S. G. Peláez-Camarena Tecnológico Nacional de México/ IT Orizaba, Orizaba, Veracruz, Mexico e-mail: [email protected] N. Rodríguez-Mazahua e-mail: [email protected] G. Alor-Hernández e-mail: [email protected] S. G. Peláez-Camarena e-mail: [email protected] A. López-Chau Universidad Autónoma Del Estado de México, Centro Universitario UAEM Zumpango, Estado de México, Mexico e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_15
337
338
N. Rodríguez-Mazahua et al.
concluded, we can determine that for 24 queries data set the best algorithm was RandomTree since it won in two methods. On the other hand, in the data set of 50 queries, the best decision tree algorithms were LMT and RandomForest because they obtained the best performance for all methods tested. Finally, J48 was the selected algorithm when neither an attribute selection method nor the Class Probabilities filter are used. But, if only the latter is applied to the data set, the best performance is given by the LMT algorithm.
15.1 Introduction A Data Warehouse (DW) is a theme-oriented, integrated, time variable, and nonvolatile data collection in support of management’s decision-making process. Data Warehousing provides architectures and tools for business executives to systematically organize, understand, and use the data to make strategic decisions. Data warehouse systems are valuable tools in the fast-changing and competitive world. In recent years, many companies have spent millions of dollars building company-wide data warehouses. Many people think that with the increasing competition across industries, data warehousing is the newest indispensable marketing strategy and a way to keep customers by learning more about their needs (Han et al. 2012). On the other hand, fragmentation is a distributed database design technique that consists of dividing each database relation into smaller fragments and treating each fragment as an object in the database separately, there are three alternatives for that: horizontal, vertical, and hybrid fragmentation (Ozsu and Valduriez 2020). One of the main problems faced by Data Warehouse designers is fragmentation. Several studies have proposed data mining-based horizontal fragmentation methods, which focus on optimizing query response time and execution cost to make the DW more efficient. However, to the best of our knowledge, it does not exist a horizontal fragmentation technique that uses a decision tree to carry out fragmentation. In this work, we propose using decision tree classifiers in the DW fragmentation because their construction does not require any domain knowledge or parameter setting, they can handle multidimensional data, the learning and classification steps of decision tree induction are simple and fast, and they have good accuracy (Han et al. 2012). Furthermore, decision trees allow obtaining pure partitions (subsets of tuples) in a data set using measures such as Information Gain, Gain Ratio, and the Gini Index. This chapter presents the analysis of different decision tree algorithms to select the best one to implement the fragmentation method performed under version 3.9.4 of Weka considering four evaluation metrics (Precision, ROC Area, Recall, and Fmeasure) for different selected data sets using the SSB (Star Schema Benchmark). Then, new experiments were performed to analyze the algorithm behavior. For it, two attribute selection methods were used included similarly in the version of Weka. The methods used were: Best First and Greedy Stepwise. Also, the data sets were pre-processed using the filter Class Conditional Probabilities and the analysis of the
15 Comparative Analysis of Decision Tree Algorithms …
339
same data sets (24 and 50 queries) was included for the purpose of knowing the differences between the decision tree algorithms. This chapter is made up of the following parts: Sect. 15.2 describes some basic concepts, Sect. 15.3 goes through the related works on DW horizontal fragmentation, Sect. 15.4 sets the method used in this work for the analysis of decision tree algorithms and a description of each algorithm is given, Sect. 15.5 reports the preliminary results in the work, and finally, the chapter is concluded and the future work is described in Sect. 15.6.
15.2 Background Horizontal fragmentation divides a database relation into subsets of tuples. There are two versions: primary and derived. Primary horizontal fragmentation of a relation is performed using predicates that are defined on that relation. Derived horizontal fragmentation, on the other hand, is the partitioning of a relation that results from predicates defined on another relation (Ozsu and Valduriez 2020). In the last decades, several horizontal fragmentation methods for data warehouses have been proposed because this technique is able to reduce the response time of OLAP (On-line Analytical Processing) queries and it has significant advantages during table loading and maintenance operations (Kimball and Ross 2008). Dimensional modeling is widely accepted as the preferred technique for presenting analytic data because it delivers data that is understandable to the business users and it achieves fast query performance. Dimensional models implemented in relational database management systems are called star schemas. These schemas have two key components: a fact table and several dimension tables. The former stores the performance measurements resulting from the business process events of an organization. The latter contains the textual context associated with a business process measurement event (Kimball and Ross 2013). In the relational context, derived horizontal partitioning is known as more suitable for DW because it takes into account the OLAP query requirements and avoids unnecessary calculates in join operations (Mahboubi and Darmont 2008). Therefore, most of the horizontal fragmentation techniques for data warehouses are derived, i.e., the partitioning of the fact table is developed according to the fragmentation schema of a selected dimension table. Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. The data sources can include databases, data warehouses, the web, other information repositories, or data that are streamed into the systems dynamically. Classification is a data mining task that obtains a model that describes and distinguishes data classes or concepts. The model is discovered based on the analysis of a training data set, which consists of data objects with a class label. The model predicts the class label of objects for which the class label is unknown (Han et al. 2012). Several classification based horizontal fragmentation methods for data
340
N. Rodríguez-Mazahua et al.
warehouses have been proposed (Amina and Boukhalfa 2013). In the next section, a number of approaches are discussed.
15.3 Related Works In Amina and Boukhalfa (2013), an approach based on classification and election to select a horizontal fragmentation scheme in the case of large workloads was proposed. First, the authors classify queries in classes to reduce the size of the workload using k-means. Then, they choose a query of each class to build a smaller workload. After that, they analyze the workload and collect metadata and select the fragmentation scheme using a genetic algorithm (Bellatreche et al. 2008). In contrast, a technique with a goal to divide (horizontally) data into fine-grained, size-balanced blocks in a way that queries can maximize the block jump was presented in Sun et al. (2014). This is an offline process that runs at data load time and can be run later to consider a more recent workload. Representative filters on a workload are first extracted as characteristics using frequent itemset mining. Based on these characteristics, each tuple can be represented as a vector of characteristics. Then, the blocking problem is formulated as an optimization problem on the feature vectors, called Balanced MaxSkip Partition, which was tested as NP-hard. To find a rough solution efficiently, the bottom-up clustering framework (Ward 1963) was adopted. Another approach is (Hanane and Kamel 2014), which is focused on the combined selection of horizontal fragmentation and bitmap join indexes. According to the authors, all the proposed approaches use algorithms to share attributes between these two techniques. This work shows that attribute sharing based approaches ignore some interesting workarounds. Therefore, a new approach based on data mining was proposed, which consists of classifying queries between the horizontal partition and the bitmap join indexes using k-means. Each subset of queries is exploited using a suitable optimization technique to select the appropriate optimization settings. Query sharing enables pruning of search space and reduced complexity of selection problems. A methodology based on statistics (Jaccard index), data mining (hierarchical clustering), and meta-heuristics (particle swarm optimization) was presented in Toumi et al. (2015) to solve the problem of horizontal fragmentation in data warehouses using a relatively large workload. First, it calculated the attraction between the predicates using the Jaccard index, followed by a hierarchical grouping of the predicate set with the Ward algorithm (Ward 1963). In the second step, it used DPSO (Discrete Particle Swarm Optimization) to select the best fragmentation scheme. Cloud SDW (Spatial DW) and spatial OLAP (On-line Analytical Processing) as a Service concept were presented in Mateus et al. (2016). Later, those concepts were used to describe two different hierarchy-based data partitioning techniques for the SDW hosted in the cloud: Spatial-based partitioning which fragments spatial dimension tables horizontally according to a spatial hierarchy and replicates the
15 Comparative Analysis of Decision Tree Algorithms …
341
remaining tables of the SDW, and Conventional-based that aims to fragment tables of conventional dimensions horizontally and replicate the remaining tables of the SDW. It also takes into account the existence of a hierarchy between conventional attributes in the dimension tables. In contrast, the approach proposed in Abdelaziz and Ouzzif (2017) consisted of an incremental horizontal fragmentation technique for the DW through a web service. This technique is based on updating the query load by adding new frequent queries and removing queries that no longer remain frequent. The goal was to automate the implementation of incremental fragmentation in order to optimize a new query load. In Kechar and Nait-Bahloul (2017), the authors established that horizontal fragmentation of the data warehouse is considered as one of the important performance optimization techniques of the decision-support queries. This optimization is reached only if the large data volume of the fact table is horizontally fragmented. For that reason, the fragments of the fact table are always derived from the fragments of the dimension tables. Unfortunately, in this type of fragmentation, the fragment number can dramatically increase, and their maintenance becomes quite hard and costly. Thus, to reduce the number of the fragments and to further optimize the decisionsupport queries performances, the authors proposed to fragment horizontally only the fact table by exploiting jointly: the selectivity of the selection predicates, their occurrence numbers, and their access frequencies. On the other hand, (Barkhordari and Niamanesh 2018) proposed a method called Chabok, which uses two-phase Map-Reduce to solve DW problems with big data. Chabok is used for star-schema data warehouses and can compute distributive measures. This method can also be applied to big dimensions, which are dimensions where data volume is greater than the volume of a node. Chabok fragments horizontally the fact table. If there are homogeneous nodes, the same number of records is allocated to each Fact-Mapper node. The proposed method was implemented on Hadoop, and TPC-DS queries were executed for benchmarking. The query execution time on Chabok surpassed prominent big data products for data warehousing. As part of their ongoing work on workload-driven partitioning, in Boissier and Daniel (2018), the authors implemented an approach called aggressive data skipping and extended it to handle both analytical and transactional access patterns. The main objective was to determine a partitioning scheme that (i) partitions the data set into a given number of fragments and (ii) is optimized for efficient partition pruning given a workload. This approach was evaluated with the workload and data of a production system of a global 2000 company. In contrast, the method presented in Barr et al. (2018) used linear programming to solve the NP-hard problem of determining a horizontal fragmentation scheme in relational DW. In addition to designing and solving the problem of selection of the horizontal fragmentation technique, the problem was considered in two simultaneous objectives, called: the number of Inputs /Outputs necessary to execute the global workload, and the number of fragments generated to identify the best solutions compared to the Pareto dominance concept.
342
N. Rodríguez-Mazahua et al.
Also, in Nam et al. (2018), it was established that as the amount of data to process increases, an efficient horizontal database partitioning method becomes more important for OLAP query processing on parallel database platforms. Existing methods have some major disadvantages, such as a large amount of data redundancy and that they do not support join processing without restructuring (shuffle) in many cases despite their high data redundancy. The authors proposed a graph-based database partitioning method called GPT that improves the performance of queries with less data redundancy. A technique based on frequent itemset mining was proposed by Ramdane et al. (2019a), to Partition, Bucket, and Sort the Tables (PBSTs) of a big data warehouse with the more frequent predicate attributes in the queries. The authors took into account the density of the attributes of the tables, data skew, and the physical characteristics of the cluster nodes. This approach used a hash-partitioning technique and consists of building horizontal fragments of the fact and dimension tables of a big relational data warehouse, using PBST techniques based on query workload. To improve their previous work, in Nam et al. (2019), the authors integrated their proposed GPT method into a parallel query processing system, Spark SQL, across all the relevant layers and modules, including the query plan generator and the scan operator. Through extensive experiments using three benchmarks, TPC-DS, IMDB, and BioWarehouse, Nam et al. showed that GPT significantly outperforms the state-of-the-art method in terms of both storage overhead and query performance. In Letrache et al. (2019), the authors stated that most of the strategies for DW fragmentation focus on the relational DW fragmentation and ignore OLAP cubes, although they are the first affected by multidimensional user queries. To address this problem, it was proposed a dynamic fragmentation strategy for OLAP cubes using association rule mining. Letrache et al. considered that even if fragmentation is supported by most OLAP vendors, the definition of an efficient fragmentation strategy cannot be done by tools. On the other hand, in Kechar and Nait-Bahloul (2019), the authors developed an enhanced version of their previous work (Kechar and Nait-Bahloul 2017). They presented a horizontal data partitioning approach tailored to a large DW, interrogated through a high number of queries. The idea was to fragment horizontally only the large fact table based on partitioning predicates, elected from the set of selection predicates used by analytic queries. In contrast, the authors of Ramdane et al. (2019b) assumed that horizontal partitioning techniques have been used for many purposes in big data processing, such as load balancing, skipping unnecessary data loads, and guiding the physical design of a data warehouse. Therefore, they proposed a new data placement strategy in the Apache Hadoop environment called “Smart Data Warehouse Placement (SDWP)”, which allows performing star join operation in only one Spark stage. The problem of partitioning and load balancing in a cluster of homogeneous nodes was investigated; experiments using the TPC-DS benchmark showed that the proposed method enhances OLAP query performances in terms of execution time.
15 Comparative Analysis of Decision Tree Algorithms …
343
Likewise, (Ramdane et al. 2019c) mixed a data-driven and workload-driven model to create a new scheme for distributed big data warehouses over Hadoop, called “SkipSJoin”. First, SkipSJoin builds horizontal fragments (buckets) of the fact and dimension tables of the DW using a hash-partitioning method and distributes these buckets evenly over the nodes of the cluster. Then, it allows skipping the scanning of some unnecessary data blocks, by hash-partitioning some DW tables with frequent attributes of the filters. With experiments using the TPC-DS benchmark, Ramdane et al. showed that the proposal outperforms some approaches in terms of query execution time. In contrast, (Hilprecht et al. 2019a) introduced that commercial data analytics products such as Microsoft Azure SQL Data Warehouse or Amazon Redshift provide ready-to use-scale-out database solutions for OLAP-style workloads in the cloud. Whereas the provisioning of a database cluster is in general, fully automated by cloud providers, customers still have to make important design decisions that were traditionally made by the database administrator such as selecting the partitioning schemes. Therefore, the authors proposed a learned partitioning advisor for analytical OLAP-style workloads based on Deep Reinforcement Learning (DRL). The leading idea was that a DRL agent learns its decisions based on experience by monitoring the rewards for different workloads and partitioning schemes. Similarly, (Hilprecht et al. 2019b) evaluated the learned partitioning advisor on three different database schemata and workloads (SSB, TPC-DS, and TPCCH) varying in complexity ranging from a simple star2019a schema to a complex normalized schema. In the experiments, PostgreSQL-XL was used as the distributed database system. Likewise, in their next work (Hilprecht et al. 2020), the authors showed that their approach is not only able to find partitionings that outperform existing approaches for automated partitioning design but that it can also adjust to different workloads and new queries. Finally, (Parchas et al. 2020) focused on the “Dist-Key” way of horizontal partitioning of most commercial data warehouse systems. This way hashes the tuples of a relation on the values of a specific attribute known as the distribution key. Therefore, the authors had the purpose of reducing the network cost of a given workload by selecting the best attribute to hash-distribute each table of the data warehouse. They proposed BaW (Best of All Worlds), a hybrid approach that combines heuristic and exact algorithms to choose the optimal distribution key for a subset of the relations. After an exhaustive analysis of the state of the art, the methods for horizontal fragmentation of data warehouses were classified according to their main characteristics, such as how the proposed method was validated and the basis of the fragmentation method. Tables 15.1 and 15.2 show the main characteristics of the methods described above. As we can see in Tables 15.1 and 15.2, several data mining-based approaches have been proposed. Nevertheless, most apply clustering (Amina and Boukhalfa 2013; Hanane and Kamel 2014; Sun et al. 2014; Toumi et al. 2015) or association (Sun et al. 2014; Letrache et al. 2019; Ramdane et al. 2019a) tasks, specifically, partitioning clustering (k-means) (Amina and Boukhalfa 2013; Hanane and Kamel 2014) or
344
N. Rodríguez-Mazahua et al.
Table 15.1 Comparative table of works on horizontal fragmentation (A) Work
Classification
Validation
Amina and Boukhalfa (2013)
Data mining-based (clustering)
APB-1 benchmark
Sun et al. (2014)
Data mining-based (association and clustering) Cost based
TPC-H Benchmark and workload from a video streaming company, called Conviva
Hanane and Kamel (2014)
Data mining-based (clustering)
APB-1 Release II benchmark
Toumi et al. (2015)
Data mining (clustering) and Meta-heuristic based
APB-1 benchmark
Mateus et al. (2016)
Hierarchy-based
Spatial data warehouse benchmark (Spadawan)
Abdelaziz and Ouzzif (2017)
Cost-based
Benchmark APB-1
Kechar and Nait-Bahloul (2017)
Cost-based
Benchmark APB-1
Barkhordari and Niamanesh (2018)
Map-Reduce-based
Benchmark TPC-DS
Boissier and Daniel (2018)
Cost-based
Benchmarks TPC-C, TPC-CH (CH-benCHmark), data, and workload of a SAP ERP system of a Global 2000 company
Barr et al. (2018)
Metaheuristic-based
Benchmark APB-1
Nam et al. (2018)
Cost-based Graph-based
Benchmark TPC-DS, The Internet Movie DataBase (IMDB) and BioWarehouse
Ramdane et al. (2019a)
Data mining-based (association)
Benchmark TPC-DS
Nam et al. (2019)
Cost-based Graph-based
Benchmark TPC-DS,the internet Movie DataBase (IMDB) and BioWarehouse
Letrache et al. (2019)
Data mining-based (association)
Benchmark TPC-DS
Kechar and Nait-Bahloul (2019)
Cost-based
SSB Star Schema Benchmark
Ramdane et al. (2019b)
Hash-based
TPC-DS benchmark using Scala language on a cluster of homogeneous nodes, a Hadoop-YARN platform, a Spark engine, and Hive
15 Comparative Analysis of Decision Tree Algorithms …
345
Table 15.2 Comparative table of works on horizontal fragmentation (B) Work
Classification
Validation
Ramdane et al. (2019c)
Hash-based
TPC-DS benchmark
Hilprecht et al. (2019a)
DRL-based SSB, TPC-DS, and TPC-CH benchmarks (neural networks)
Hilprecht et al. (2019b) DRL-based
SSB, TPC-DS, and TPC-CH benchmarks
Hilprecht et al. (2020)
DRL-based SSB, TPC-DS, and TPC-CH benchmarks (neural networks)
Parchas et al. (2020)
Hash-Based
Real1, Real2 join graph extracted at random from real life users of Redshift with various sizes and densities and TPC-DS benchmark
agglomerative hierarchical clustering (Ward method) (Sun et al. 2014; Toumi et al. 2015) and frequent itemset mining techniques (Apriori algorithm) (Letrache et al. 2019). References (Hilprecht et al. 2019a) and (Hilprecht et al. 2020) used neural networks implemented in Keras for learning the partitioning advisor. In this chapter, we propose a derived horizontal fragmentation method that utilizes decision trees to divide the dimension table previously selected considering the selectivity of the predicates and the frequency of the OLAP queries, and partitions the fact table according to the horizontal fragmentation scheme of the dimension table obtained by the decision tree.
15.4 Method In this section, the process followed for the analysis of the decision tree algorithms is established; after that, each of the algorithms available in the version of Weka used is described.
15.4.1 Collection and Preparation of Data To carry out the study of decision tree algorithms to select the best one to fragment the DW, we used SSB (Star Schema Benchmark) and PostgreSQL. We constructed eight data sets, the first four considering 24 queries and from two to five fragments, and the second four with 50 queries also from two to five fragments. We adapted the algorithm presented by Rodríguez-Mazahua et al. (2014) to build the data sets. The resulting data set for 24 queries and two fragments is visualized in Fig. 15.1. Each data set has as rows the OLAP queries and as columns the attributes of the dimension table selected for primary horizontal fragmentation and the frequency of the queries. For each attribute of the dimension table there are two variables in the data set, one categorical with the attribute value if such attribute is used by the predicate or NOT
346
N. Rodríguez-Mazahua et al.
Fig. 15.1 Data set with 24 queries and 2 fragments
USED otherwise, and one numerical which takes the value of 1 if it is used in the query or 0 otherwise. The class label attribute is fragment, this indicates to what fragment the OLAP query corresponds. Figures 15.2 and 15.3 present the algorithms for building the data sets. Algorithm 1 takes as input the Predicate Usage Matrix (PUM) of the fact table and the maximal number of fragments that can have the fragmentation scheme (W). The PUM is composed of a set of queries, a set of predicates, the selectivity of each predicate, and the frequency of the queries. The output of Algorithm 1 is the W-1 data sets. A Partition Tree (PT) is built, in the first step all the predicates are located in the same fragment, a Partitioning Profit Matrix is constructed, this measures the partitioning profit of a pair of predicates. The partitioning profit is calculated as the increased
Fig. 15.2 Algorithm 1. Generation of data sets
15 Comparative Analysis of Decision Tree Algorithms …
347
Fig. 15.3 Algorithm 2. Get PPM
number of remote tuples (IRT) accessed plus the decreased number of irrelevant tuples (DIT) accessed.
15.4.2 Application of Decision Tree Algorithms The seven decision tree algorithms that offer the version of Weka 3.9.4 were applied to the eight data sets. A description of the algorithms is presented below.
348
N. Rodríguez-Mazahua et al.
• Hoeffding Tree: It is an incremental, anytime decision tree induction algorithm that is capable of learning from massive data streams, assuming that the distribution generating examples does not change over time. Hoeffding trees exploit the fact that a small sample can often be enough to choose an optimal splitting attribute. This idea is supported mathematically by the Hoeffding bound, which quantifies the number of observations needed to estimate some statistics within a prescribed precision (Hulten et al. 2001). • Logistic Model Tree: Classifier for building ‘LMT’, which are classification trees with logistic regression functions at the leaves. The algorithm can deal with binary and multi-class target variables, numeric and nominal attributes, and missing values (Landwehr et al. 2005). • J48: C4.5 Decision Tree is one of the most broadly used and real-world approaches. In C4.5 the learned classifier is represented by a decision tree as sets of if-then rules to human readability improvement. The decision tree is simple to be understood and interpreted; besides, it can handle nominal and categorical data and perform well with large data set in a short time. In C4.5 training, the decision tree is built in a top-down recursive way (Saeh et al. 2016). • Decision Stump: It is one level decision tree that classifies instances by sorting them based on feature values. In a decision stump, each node represents a feature in an instance to be classified and each branch represents a value that the node can take. Instances are classified starting at the root node and sorting them based on their feature values (Kotsiantis et al. 2005; Shi et al. 2018). • Random Forest: This algorithm uses bootstrap methods to create an ensemble of trees, one for each bootstrap sample. Additionally, the variables eligible to be used in splitting is randomly varied in order to decorrelate the variables. Once the forest of trees is created, they vote to determine the predicted value of input data (Dean 2014). • Random Tree: It constructs a tree that considers a given number of random features at each node (Witten et al. 2016). • REPTree: It builds a decision or regression tree using information gain-variance reduction and prunes it using reduced-error pruning. Optimized for speed, it only sorts values for numeric attributes once. It deals with missing values by splitting instances into pieces, as C4.5 does. It can be set the minimum proportion of training set variance for a split and the number of folds pruning (Witten et al. 2016). We apply Class Conditional Probabilities to transform the input data set and evaluate the behavior of the decision tree algorithms with this supervised filter (Witten et al. 2016). Class conditional probabilities: converts the values of nominal and/or numeric attributes into class conditional probabilities. If there are k classes, then k new attributes are created for each of the original ones, giving pr (att val | class k). Can be useful for converting nominal attributes with a lot of distinct values into something more manageable for learning schemes that cannot handle nominal attributes (as opposed to creating binary indicator attributes). For nominal attributes,
15 Comparative Analysis of Decision Tree Algorithms …
349
the user can specify the number of values above which an attribute will be converted by this method. Normal distributions are assumed for numeric attributes (Documentation—Weka Wiki). The attributes of the data sets were selected using two search methods: Best first: performs greedy hill climbing with backtracking; it can be specified how many consecutive no improving nodes must be encountered before the system backtracks. It can search forward from the empty set of attributes, backward from the full set, or start at an intermediate point (specified by a list of attribute indexes) and search in both directions by considering all possible single attributes additions and deletions. Subsets that have been evaluated are cached for efficiency; the cache size is a parameter. Greedy Stepwise: searches greedily through the space of attribute subsets. Like Best First, it may progress forward from the empty set or backward from the full set. Unlike Best First, it does not backtrack but terminates as soon as adding or deleting the best remaining attribute decreases the evaluation metric. In an alternative mode, it ranks attributes by traversing the space from empty to full (or vice versa) and recording the order in which attributes are selected.
15.5 Results After having analyzed the different decision tree algorithms, the following results were found for the Area ROC, Precision, Recall, and F-measure metrics. Figure 15.4, 15.5, 15.6 and 15.7 demonstrate that considering Recall, Precision, ROC Area, and F-Measure metrics, respectively, for the 24 queries data sets, the J48 algorithm was
Fig. 15.4 Results of the Recall metric for 24 queries data sets
350
N. Rodríguez-Mazahua et al.
Fig. 15.5 Results of the Precision metric for 24 queries data sets
Fig. 15.6 Results of the ROC Area metric for 24 queries dataset
better for three, four, and five fragments, only for two fragments it was overcome by Random Forest. Regarding the data sets of 50 queries, the results of the application of the decision tree algorithms presented in Table 15.3 show that for two fragments, the best algorithm was REPTree because it had a better behavior for three metrics. In contrast, Table 15.4 demonstrates that for three fragments the best algorithm was Random Forest since it presented a better performance than the others. In Table 15.5, the results for four fragments are shown; J48 was the best for the major of metrics. Finally, in Table 15.6, the best decision tree algorithm was Random Forest for five fragments.
15 Comparative Analysis of Decision Tree Algorithms …
351
Fig. 15.7 Results of the F-Measure metric for 24 queries dataset
Table 15.3 Results of decision trees algorithms with 50 queries for two fragments Algorithm
Precision
Recall
ROC area
F-Measure
Decision stump
0.875
0.843
0.682
0.832
Hoeffding tree
0.875
0.843
0.885
0.832
J48
0.857
0.843
0.924
0.836
LMT
0.963
0.961
0.910
0.960
Random forest
0.963
0.961
0.998
0.960
Random tree
0.864
0.863
0.929
0.860
REPTree
0.964
0.961
0.994
0.961
Table 15.4 Results of decision tree algorithms with 50 queries for three fragments Algorithm
Precision
Recall
ROC area
F-Measure
Decision stump
0.561
0.686
0.691
0.617
Hoeffding tree
–
0.745
0.830
–
J48
0.681
0.725
0.679
0.693
LMT
0.722
0.725
0.907
0.723
Random forest
0.770
0.784
0.934
0.767
Random tree
0.654
0.647
0.782
0.627
REP tree
0.459
0.608
0.510
0.521
Once the analysis of the decision tree algorithms for 24 and 50 queries was concluded, it was determined that the two best algorithms were Random Forest and J48, so in this case, it was decided to select J48 since it is more efficient in building the model because the computational complexity of the J48 algorithm
352
N. Rodríguez-Mazahua et al.
Table 15.5 Results of decision tree algorithms with 50 queries for four fragments Algorithm
Precision
Recall
ROC area
F-Measure
Decision stump
–
0.431
0.610
–
Hoeffding tree
0.500
0.353
0.645
0.353
J48
0.709
0.706
0.825
0.707
LMT
0.572
0.588
0.830
0.579
Random forest
0.690
0.686
0.886
0.678
Random tree
0.501
0.490
0.715
0.487
REP tree
0.548
0.490
0.701
0.487
Table 15.6 Results of decision trees algorithms with 50 queries for five fragments Algorithm
Precision
Recall
ROC area
F-Measure
Decision stump
–
Hoeffding tree
0.464
0.294
0.591
–
0.314
0.654
J48
0.671
0.647
0.304
0.834
0.642
LMT
0.657
0.667
0.895
0.661
Random forest
0.749
0.745
0.927
0.743
Random tree
0.610
0.569
0.779
0.566
REP tree
0.613
0.569
0.770
0.582
given set D is O(n × |D| × log(|D|)) where n is the number of attributes describing the tuples in D and |D| is the number of training tuples in D (Han et al. 2012). Incontrast, the time complexity for building a forest of M randomized trees is , where K is the number of variables randomly drawn at 2 × log N O M×K×N each node and Ñ = 0.632|D| (Louppe 2015). Figure 15.8 represents a decision tree created by J48 for the 50 queries data set and four fragments.
Fig. 15.8 Decision tree created by J48
15 Comparative Analysis of Decision Tree Algorithms …
353
After the analysis of different decision trees algorithms with data sets of 24 and 50 queries, new experiments were carried out using two attribute selection methods included similarly in the version 3.9.4 of Weka The methods used were: Best First (S-BF) and Greedy Stepwise (S-GSW), also the data sets were pre-processed using the supervised filter Class Conditional Probabilities (WS-CCP) and it was included the analysis of the same data sets (24 and 50 queries) with this filter, to know the behavior of the decision tree algorithms using attribute selection methods and the Class Conditional Probabilities filter for each data set, the results of this analysis are shown in Figs. 15.9, 15.10, 15.11, 15.12, 15.13, 15.14, 15.15 and 15.16 in this chapter. Both data sets, 24 and 50 queries were considered in this evaluation. Figure 15.9 shows that for the 24 queries data set with two fragments, the best decision tree algorithm was RandomTree, for three metrics, while J48 and DecisionStump presented the worst performance for this data set. In Fig. 15.10, LMT, J48, and RandomTree were the best decision tree algorithms for the four evaluated metrics in the data set with three fragments, in contrast, HoeffdingTree and ReptTree represented a bad performance for the metrics.
Fig. 15.9 Results of decision tree algorithms for 24 queries and 2 fragments
354
N. Rodríguez-Mazahua et al.
Fig. 15.10 Results of decision trees algorithms for 24 queries and 3 fragments
Fig. 15.11 Results of decision trees algorithms for 24 queries and 4 fragments
Figure 15.11 shows that for 24 queries data set with four fragments the best decision tree algorithm was LMT for four metrics evaluated and the worst evaluated metric was F-measure for the three methods. Finally, Fig. 15.12 shows the results for the 24 queries data set with five fragments. The best decision tree algorithms were LMT and RandomForest, while J48
15 Comparative Analysis of Decision Tree Algorithms …
Fig. 15.12 Results of decision trees algorithms for 24 queries and 5 fragments
Fig. 15.13 Results of decision trees algorithms for 50 queries and 2 fragments
355
356
N. Rodríguez-Mazahua et al.
Fig. 15.14 Results of decision trees algorithms for 50 queries and 3 fragments
obtained only the best result for the WS-CCP (Without Selection Class Conditional Probabilities). For the 24 queries data set the best algorithm evaluated was RandomTree because it obtained the best result for both: GreadyStepWise and Best first methods while LMT and J48 only had the best performance for one evaluated method.
15 Comparative Analysis of Decision Tree Algorithms …
357
Fig. 15.15 Results of decision trees algorithms for 50 queries and 4 fragments
On the other hand, the data set of 50 queries had the same pre-process and the same methods mentioned previously were applied. In the case of two fragments as can be seen in Fig. 15.13, the results indicate that LMT and RandomForest were the best algorithms for this data set, RepTree had good behavior for the metrics. However, these results were only for the WS-CCP method. Figure 15.14 shows that for the data set of three fragments, the results point out that the best decision tree algorithms were LMT, RandomTree, and RandomForest for most metrics evaluated, while J48 presented a good performance but only for the S-BF method. With respect to four fragments, the results show in Fig. 15.15 that the best decision tree algorithm was LMT. Moreover, J48 presented good behavior, too, for the three methods analyzed. Finally, the results for five fragments depicted in Fig. 15.16 determined that LMT was the best decision tree algorithm for most methods used. Unless RandomTree obtained good behavior, it was overcome by LMT for all metrics evaluated. Once the analysis was concluded, we can determine that for 24 queries data set, the best algorithm was RandomTree to have won in two methods: BestFirst and GreedyStepWise, as we can see in Table 15.7. On the other hand, we have the data set of 50 queries in which the best decision tree algorithms were LMT and Random Forest because they obtained the best performance using all the methods. However,
358
N. Rodríguez-Mazahua et al.
Fig. 15.16 Results of decision trees algorithms for 50 queries and 5 fragments
Table 15.7 Results of decision tree algorithms for 24 and 50 queries Data set
Attribute selection method
Class conditional
Best algorithm
24
No
No
J48
50
No
No
Random forest
24
No
Yes
LMT
50
No
Yes
LMT
24
Best first
Yes
Random tree
50
Best first
Yes
LMT and random forest
24
Greedy stepwise
Yes
Random tree
50
Greedy stepwise
Yes
Random forest, Random tree, and LMT
J48 was the best algorithm without applying any selection method and without the Class Conditional Probabilities filter. Lastly, if only this filter is used the winner is LMT. In this chapter, we also present the flow diagram for the decision tree-based horizontal fragmentation method, the same could be appreciated in Fig. 15.17. It takes as input the PUM (Predicate Usage Matrix) and W, which corresponds to the number of fragments allowed. The decision tree algorithm is applied to the W-1 data sets obtained by the Algorithms 1 and 2 presented in Figs. 15.2 and 15.3, respectively.
15 Comparative Analysis of Decision Tree Algorithms …
Fig. 15.17 Horizontal Fragmentation method diagram
359
360
N. Rodríguez-Mazahua et al.
The output of the method is the best horizontal fragmentation scheme (HFS) of the dimension table selected previously according to the workload provided. The best HFS is the one that obtained the highest values for the evaluation metrics (Precision, Recall, ROC Area, and F-measure). Since the values of the metrics decrease considerably when the number of fragment increases, we propose to create synthetic queries to maintain the number of instances per class of the first data set (i.e., the data set with two fragments). Finally, the fact table of the data warehouse is fragmented according to the dimension table HFS obtained by the proposed method.
15.6 Conclusions and Future Work Data warehouses are applied in several areas and allow efficient data analysis. Fragmentation in a DW allows optimizing response times and execution costs for OLAP queries. In this work, it is proposed to take advantage of the potential of the decision trees in the classification to adapt them in the process of horizontal fragmentation of the DW. Because of that reason, this chapter described the process in which the analysis of different decision trees algorithms was carried out, in order to determine the best of them to be implemented in a horizontal fragmentation method for data warehouses. As a result of the analysis, both J48 and Random Forest were the best algorithms for decision tree induction without applying any filter and selection method, and J48 was the selected algorithm for the method implementation because it has a time complexity lower than Random Forest. After the analysis of different decision tree algorithms with data sets of 24 and 50 queries, new experiments were carried out using two attribute selection methods included similarly in version 3.9.4 of Weka. The methods used were: Best First and Greedy Stepwise, the data sets were pre-processed using the filter Class Conditional Probabilities, and it was included the analysis of the same data sets (24 and 50 queries) with this filter, to know the behavior of the algorithms for each data set and each method used. Once the new analysis was concluded, we determined that for 24 queries data set, the best algorithm was RandomTree to have won in two methods: BestFirst and Greedy Stepwise. On the other hand, we have the data set of 50 queries in which the best decision tree algorithms were LMT and RandomForest because they obtained the best performance for all methods tested. Finally, we conclude that without applying the Class Conditional Probabilities filter and an attribute selection method, the best decision tree algorithm was J48, but if only the filter is used, the winner is LMT. The future work is the implementation of the fragmentation method, which consists of determining the most frequent OLAP queries, analyzing the predicates used by the queries, and based on this build the decision tree, from which the horizontal fragments will be generated. The method will be evaluated in a Tourist Data Warehouse which integrates data from official sources that regulate tourist activity in Mexico.
15 Comparative Analysis of Decision Tree Algorithms …
361
Acknowledgements The authors are very grateful to the National Technological of Mexico for supporting this work. Also, this research was sponsored by the National Council of Science and Technology (CONACYT).
References Abdelaziz E, Ouzzif M (2017) Web service for incremental and automatic data warehouses fragmentation. Int J Adv Comput Sci Appl 8. https://doi.org/10.14569/ijacsa.2017.080661 Amina G, Boukhalfa K (2013) Very large workloads based approach to efficiently partition data warehouses. Studies in computational intelligence. Springer, Cham, pp 285–294 Barkhordari M, Niamanesh M (2018) Chabok: a Map-Reduce based method to solve data warehouse problems. J Big Data 5:40. https://doi.org/10.1186/s40537-018-0144-5 Barr M, Boukhalfa K, Bouibede K (2018) Bi-objective optimization method for horizontal fragmentation problem in relational data warehouses as a linear programming problem. Appl Artif Intell 32:907–923. https://doi.org/10.1080/08839514.2018.1519096 Bellatreche L, Boukhalfa K, Richard P (2008) Data dartitioning in data warehouses: hardness study, heuristics and ORACLE validation. Lecture notes in computer science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer, Berlin, Heidelberg, pp 87–96 Boissier M, Daniel K (2018) Workload-driven horizontal partitioning and pruning for large HTAP systems. In: Proceedings - IEEE 34th International conference on data engineering workshops, ICDEW 2018. Institute of Electrical and Electronics Engineers Inc., pp 116–121 Dean J (2014) Big Data, data mining, and machine learning: value creation for business leaders and practitioners. undefined Documentation-Weka Wiki (2020). https://waikato.github.io/weka-wiki/documentation/. Accessed 11 Dec 2020 Han J, Kamber M, Pei J (2012) Data mining: concepts and techniques. Elsevier Inc Hanane A, Kamel B (2014) A data mining-based approach for data warehouse optimisation. 2émes Journées Int Chim Organométallique Catal Jicoc’2014 Hilprecht B, Binnig C, Roehm U (2019a) Learning a partitioning advisor with deep reinforcement learning. arXiv Hilprecht B, Binnig C, Röhm U (2019b) Towards learning a partitioning advisor with deep reinforcement learning. In: Proceedings of the ACM SIGMOD international conference on management of data. Association for Computing Machinery, New York, New York, USA, pp 1–4 Hilprecht B, Binnig C, Röhm U (2020) Learning a partitioning advisor for cloud databases. In: Proceedings of the ACM SIGMOD international conference on management of data. Association for Computing Machinery, New York, NY, USA, pp 143–157 Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the Seventh ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery (ACM), New York, New York, USA, pp 97–106 Kechar M, Nait-Bahloul S (2017) Performance optimisation of the decision-support queries by the horizontal fragmentation of the data warehouse. Int J Bus Inf Syst 26:506–537. https://doi.org/ 10.1504/IJBIS.2017.087750 Kechar M, Nait-Bahloul S (2019) Bringing together physical design and fast querying of large data warehouses: a new data partitioning strategy. ACM International conference proceeding series. Association for Computing Machinery, New York, NY, USA, pp 1–8 Kimball R, Ross M (2008) The data warehouse lifecycle toolkit, 2nd edn. Wiley Publishing Kimball R, Ross M (2013) The data warehouse toolkit: the definitive guide to dimensional modeling, 3rd edn. John Wiley & Sons, Inc
362
N. Rodríguez-Mazahua et al.
Kotsiantis SB, Tsekouras GE, Pintelas PE (2005) Local bagging of decision stumps. In: Lecture Notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). Springer Verlag, pp 406–411 Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 59:161–205. https://doi. org/10.1007/s10994-005-0466-3 Letrache K, El Beggar O, Ramdani M (2019) OLAP cube partitioning based on association rules method. Appl Intell 49:420–434. https://doi.org/10.1007/s10489-018-1275-2 Louppe G (2015) Understanding random forests: from theory to practice. Universidad of Liège Mahboubi H, Darmont J (2008) Data mining-based fragmentation of XML data warehouses. In: DOLAP: Proceedings of the ACM international workshop on data warehousing and OLAP. ACM Press, New York, New York, USA, pp 9–16 Mateus RC, Siqueira TLL, Times VC et al (2016) Spatial data warehouses and spatial OLAP come towards the cloud: design and performance. Distrib Parallel Databases 34:425–461. https://doi. org/10.1007/s10619-015-7176-z Nam YM, Han D, Kim MS (2019) A parallel query processing system based on graph-based database partitioning. Inf Sci (Ny) 480:237–260. https://doi.org/10.1016/j.ins.2018.12.031 Nam YM, Kim MS, Han D (2018) A graph-based database partitioning method for parallel olap query processing. In: Proceedings-IEEE 34th international conference on data engineering, ICDE 2018. Institute of Electrical and Electronics Engineers Inc., pp 1037–1048 Ozsu MT, Valduriez P (2020) Principles of distributed database systems, 4th edn. Springer Nature Switzerland AG Parchas P, Naamad Y, Van Bouwel P, et al (2020) Fast and effective distribution-key recommendation for amazon redshift. Proc VLDB Endow 13:2411–2423. https://doi.org/10.14778/3407790.340 7834 Ramdane Y, Boussaid O, Kabachi N, Bentayeb F (2019a) Partitioning and bucketing techniques to speed up query processing in spark-SQL. In: Proceedings of the international conference on parallel and distributed systems - ICPADS. IEEE Computer Society, pp 142–151 Ramdane Y, Kabachi N, Boussaid O, Bentayeb F (2019b) SDWP: A new data placement strategy for distributed big data warehouses in Hadoop. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). Springer, pp 189–205 Ramdane Y, Kabachi N, Boussaid O, Bentayeb F (2019c) SkipSJoin: A new physical design for distributed big data warehouses in hadoop. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). Springer, pp 255–263 Rodríguez-Mazahua L, Alor-Hernández G, Abud-Figueroa MA, Peláez-Camarena SG (2014) Horizontal partitioning of multimedia databases using hierarchical agglomerative clustering. In: lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics). Springer Verlag, pp 296–309 Saeh IS, Mustafa MW, Mohammed YS, Almaktar M (2016) Static Security classification and evaluation classifier design in electric power grid with presence of PV power plants using C-4.5. Renew Sustain Energy Rev 56:283–290 Shi L, Duan Q, Dong P et al (2018) Signal prediction based on boosting and decision stump. Int J Comput Sci Eng 16:117–122. https://doi.org/10.1504/IJCSE.2018.090450 Sun L, Franklin MJ, Krishnan S, Xin RS (2014) Fine-grained partitioning for aggressive data skipping. In: Proceedings of the ACM sigmod international conference on management of data. Association for Computing Machinery, New York, New York, USA, pp 1115–1126 Toumi L, Moussaoui A, Ugur A (2015) EMeD-Part: An efficient methodology for horizontal partitioning in data warehouses. ACM international conference proceeding series. Association for Computing Machinery, New York, New York, USA, pp 1–7
15 Comparative Analysis of Decision Tree Algorithms …
363
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58:236– 244. https://doi.org/10.1080/01621459.1963.10500845 Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Elsevier Inc
Chapter 16
Data Analytics in Financial Portfolio Recovery Management Jonathan Steven Herrera Román, John W. Branch, and Martin Darío Arango-Serna
Abstract Financial inclusion is a social need that is gaining more and more strength in developing countries. Microcredit is an effective way to enable financial inclusion, but it represents a challenge in portfolio management. This work applies data analytics and machine learning techniques to predict the behavior of the loan default in a non-financial entity. Decision trees have shown the best prediction performance to determine whether a loan will be paid or become irrecoverable after running five predictive models. Keywords Data analytics · Financial portfolio · P2P lending
16.1 Introduction Colombia’s financial system includes a constantly rising adult population, while in 2014 the indicator of financial inclusion was 73.9%, by the end of 2019 it was already at 81.4%. This leaves the challenge of financially including 6.3 million Colombian adults (Banca de las Oportunidades and Superintendencia Financiera de Colombia 2018). Seeking to be an alternative to banks, there are non-financial entities that aim to be a link between users with no credit history and the banking system, trusting people and granting financial inclusion. These entities specialize in so-called microcredits, or P2P lending, whereas banks have products such as personal loans and credit cards. P2P lending coverage in adults only increased from 3 to 3.3 million adults between 2014 and 2017, which means a 10% increase. However, for 2018 it fell to 3.1 million J. S. Herrera Román (B) · J. W. Branch · M. D. Arango-Serna Universidad Nacional de Colombia, Medellín, Colombia e-mail: [email protected] J. W. Branch e-mail: [email protected] M. D. Arango-Serna e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_16
365
366
J. S. Herrera Román et al.
adults (Banca de las Oportunidades and Superintendencia Financiera de Colombia 2017) (Banca de las Oportunidades and Superintendencia Financiera de Colombia 2018). Stimulating these microcredits as a form of financial inclusion becomes essential for clients with no credit history but faces organizations with increasing its exposure to risk. Loan default, in general, has grown (Superintendencia Financiera de Colombia 2019). However, non-financial entities, the majority of which offer P2P lending, face a higher level of risk than banking system; among other reasons, because they cover a large number of people without stable income such as students, housewives, and independent informal workers. Although the increase in the loan default of these entities can be seen as proportional, due to increase in P2P lending itself (Fig. 16.1). The percentage increase in past-due loans is greater for microcredits, operated by non-financial entities, than for personal loans controlled by banks (Fig. 16.2). This represents a high risk, as it threatens to intensify the requirements faced by people who want a microcredit, with the consequent financial exclusion, contrary to what is desirable. Despite the fact that loan default management is better in banks than in nonfinancial entities, this last aspect cannot benefit from the experience of banks due to several factors: the knowledge of banks is rarely shared with entities outside of financial system, its default management requires entire departments with many employees dedicated to highly specialized tasks, and finally, banks are exposed to less risk by denying loans than non-financial entities. In non-financial entities, the default management area may be limited to a group of people who contact customers over the phone. In other cases, they may have automatic systems who send emails or SMS to customers, but without metrics allowing to know whether these channels are effective or not. If we add that the internal loan default management process is totally reactive, without any analysis of the data, we see that
Fig. 16.1 Increase in loan of P2P lending since January 2016 (Superintendencia Financiera de Colombia 2019)
16 Data Analytics in Financial Portfolio Recovery Management
367
Fig. 16.2 Personal loans default versus P2P lending default (Superintendencia Financiera de Colombia 2019)
a first approach to the default problem of non-financial entities consists in allowing them to understand their own data and carry out analyzes on them. For several years, information and communication technologies have been initially used to assist in default management, using techniques such as predictive dialing, automated management, automatic default segmentation and reporting systems (Deloitte 2012). In 2014, the use of machine learning in loan default management was presented as a corporate service, predicting those credits that are more likely to be paid, to concentrate default management on them. This technique exceeds 85% effectiveness in prediction and is now used by major call centers and default management entities (CNN Periodismo Digital 2016). Other corporate alternatives existing the market allows a debtor to make the payment of his debt without the intervention of a collection entity. Such is the case of the eResolve platform launched by Experian in 2017. This technique is available 24 h a day and without the need for a call or other more aggressive tactics (Experian 2017). The use of artificial intelligence techniques for loan default management has cemented the need to handle large volumes of data. So, data analytics has also taken its part in the process. Some companies have specialized in the use of analytics to offer financial risk management solutions and credit and loan default management optimization (Infórmese 2018). This proposal seeks to apply data analytics in a non-financial entity, specifically in its default management process, to find the relationships between its data that allow it to carry out effective and efficient default management.
368
J. S. Herrera Román et al.
16.2 Literature Review The following are artificial intelligence investigations applied to microcredits, the prediction of their behavior, the use of techniques that take advantage of both structured and non-structured data to define who will be a good or bad client and other works related to the field of microcredit, credit risk and portfolio. An approach from neural networks shows that this application is used to make better decisions about who can access a microcredit. Compared with regression models, they reflect a more precise behavior, allowing to reduce credit risk (Byanjankar et al. 2015). Evaluating from a statistical point of view, Bayesian models with non-linear regressions have been implemented, making use of both structured and nonstructured data. This with the objective of predicting the behavior of microcredit, managing to identify the variables in the movements of the microcredit market (Bitvai and Cohn 2015). The authors find that P2P market are predictable, through feature selection they can identified the variables that explain moving secondary market rates. An interesting investigation was carried out by Ruyi Ge et al. Where they used data from social networks to predict the behavior of delinquent microcredit clients. The authors used two datasets, one with the microcredit information and the other with data from social networks. Curiously, they achieved a reduction in the portfolio, but, in addition, customers were more willing to pay again after being contacted through social networks (Ge et al. 2017). A new Byanjankar job to predict microcredit credit risks allowed more than just classifying clients as good or bad. Using survival analysis, he was able to predict the probability of credit survival in specific periods. However, they only used one dataset for the analysis, which does not allow generalizing the result for microcredits (Byanjankar 2018). Returning to the usual classification of good and bad clients, we have the work of Archana Gahlaut, this time using a data mining model for the classification. The prediction made was generated with a dataset oriented to banks and allows the decision to accept or deny a credit to be made early, to continue with other applicants, it should be noted that the most representative variables found by the author for credit analysis were age, duration and amount of the credit (Gahlaut et al. 2017). Bearing in mind that the methods so far use so-called ‘hard’ data to be verified and leave out ‘soft’ data as unstructured text, Cuiqing Jiang shows us a work where precisely these soft data demonstrate to have as much or more information than those normally used by the prediction models. The author uses the LDA (Latent Dirichlet Allocation) method to extract important characteristics from the descriptive texts of the credits, then he designs a two-state characteristic selection method to generate effective variables in modeling. Finally, he uses a prediction model of credit portfolio that uses four classification methods, all with real microcredit data from China’s bankrupt platforms. The result shows that a traditional model that also includes soft data could improve its performance based on the same analysis (Jiang et al. 2018).
16 Data Analytics in Financial Portfolio Recovery Management
369
In a different paper, focused on portfolio prevention in bank mortgage loans, the authors present us with a three-step model for selection that includes a probabilistic framework for the portfolio. The result shows to have better performance than the ordinary least squares regression models. The novel model represents an impact on the efficiency of the use of banks capital (Do et al. 2018). Finally, in the latest work by Kim and Cho, they relate a model that combines the algorithm of label propagation with a transductive support vector machine (TSVM) and the Dempster—Shafer theory to fit it to prediction of microcredit with unlabeled data. The authors compare the results of their method with several of the previous works mentioned, obtaining up to 10% more accuracy in the prediction (Kim and Cho 2019). The following are works applied to Colombian entities: The objective of (Velásquez 2013) is categorize a client at a qualitative risk level. The applied method uses a set of indicators obtained from variables of the client, its credits, and the market, each of these indicators is given a numerical scale according to the values in which they can vary. The way to relate the variables was through a linear function, where each indicator has a weight, the sum of their weights is the output variable defined as the customer’s risk. Although the predictive model used does not use data analytics techniques, it conforms to the entity for which it was designed. Ultimately, the advantage of a model is just its practical utility and its potential for decision-making support. Another important work, which also seeks to predict the behavior of clients’ portfolio, is that of (Daza Sandoval 2015). In this case, the author used a systematic process that included expert judgment for the selection of explanatory variables regarding a client’s portfolio. Interestingly, each expert had relevance in their contribution to the process and did not belong to the same organizational hierarchy. The questionnaires carried out by said experts had weightings per variable to be taken into account in the study; the final result of the weight of each variable was defined according to the relevance of the expert and the weightings of the variables. This methodology reduced the dimensionality of the problem from 66 variables to only 19. Subsequently, the author used decision trees as a predictive model for the data, and she obtained a small but efficient tree with a depth of 8 and 6 leaves, with a percentage of precision greater than 92%. The contribution of the work is not only its results, but also the generation of policies with direct impact on the management of the portfolio of the banking entity under study. Also, using machine learning techniques, but this time neural networks and Bayesian classification, we have the work of (León Sánchez 2015), which assesses the liquidity risk of the Colombian collective portfolio. The analysis is important because it mainly uses the country’s macroeconomic variables and considers its behavior in liquidity risk. On the other hand, in the health sector portfolio, the article by (Castaño and Ayala 2016) also sought to predict its behavior using machine learning techniques. The author initially used logistic regressions, but the level of prediction did not appear
370
J. S. Herrera Román et al.
to be acceptable, while the use of decision trees proved to be the appropriate model for the analysis. Unlike (León Sánchez 2015) the data set that the author used lacked macroeconomic information and it is just a resulting recommendation the use of exogenous information related to the market to adjust the predictive capacity of the model. Considering the work done in portfolio prediction or credit risk in the Colombian market and other related work abroad, we find that there are several common factors that the literature shows. The initial use of data sets with many variables is necessary, (Zhu et al. 2019) had a dataset of more than 100 variables, meanwhile, (Zhou et al. 2019) used more than a thousand initial variables. However, starting with a large data set requires preprocessing of data and the use of techniques to reduce dimensionality; correlation analysis is one of them (Zhu et al. 2019), expert judgment should not be discarded as a useful technique, especially if we have specific data for an entity (Daza Sandoval 2015), including the use of learning techniques can be useful for this purpose (Zhou et al. 2019). The application of various techniques is essential to have a comparative analysis between them, the sensitivity, specificity, and accuracy of the models are important factors when deciding which one best predicts the behavior we expect.
16.3 Proposed Approach In this chapter, five artificial intelligence techniques are executed in a data set created with information from a non-financial entity. Then, each technique is compared to each other and choose the one with best behavior in predicting those credits that will not be paid. The five techniques used are neural networks, decision trees, support vector machines, logistic regression, and K neighbors. The company under study is a non-financial entity, classified as fintech, which offers its clients the possibility of buying products using a credit quota covered by the company, thus allowing financial inclusion. For several months, the information held by the non-financial entity was reviewed, the variables to be used were selected, and a transition matrix study was carried out. The binary condition, whether a credit is paid or not, becomes the dependent variable of the dataset.
16.3.1 Data Collection Despite having a technological basis as a principle, the market changes faster than the rate at which some companies can respond, and fintech companies are no exception. The analyzed entity did not have an area that performs data analytics, in fact, its
16 Data Analytics in Financial Portfolio Recovery Management
371
information has grown day by day, without being able to evaluate if any of the collected information is redundant, not useful for the company or ineffectual. The work of obtaining the data was carried out over several months, being developed in 3 phases: defining the data, creation of the transition matrix and creation of the dataset.
16.3.1.1
Defining the Data
Once the business activity of the study company was understood, it was jointly defined that a registry to the related data would be focus on credit and not on a person, this due to fact that many clients have few credits or even only one, which makes it inviable to have a previous behavior on which to study them. The following explanation will allow us to understand the nature of the data. A person can have several microcredits at the same time, these can have a different amount of payment fees, since they are chosen by the client at the time of the credit. The client’s total quota limits the amount of active credits that can be had at any time, but it totally depends on the amount of his credits, since a credit can consume any percentage of the client’s quota. Considering the debt collection, it is also carried out by credit, that is, for each credit that falls into default, a debt collection focused on said credit and not on the person is made. For example, if a client has two credits, one in default and the other within a few days of becoming default, when contacted, only the collection of the default credit will be made, being able to be contacted days later to collect the credit that it has just entered into default, ignoring the previous one for which he had already been contacted. This dynamic makes it very complex to identify effective collection strategies if we focus on the person and not on the credit. Once clarified that the information will be analyzed by credit, it is necessary to define the analysis period, since we can consider a month, a quarter, a semester, or a whole year as a period. In the entity, the credits can be deferred to a term defined by the client. In general, credits can be agreed to pay in terms between 1 and 18 months, but more than 80% of credits are agreed to 4 or fewer months. Considering this, it was agreed to define a month as a period unit.
16.3.1.2
Creating the Transition Matrix
Defining when a credit is considered to be in an irrecoverable portfolio is crucial for the analysis, although there is category E defined by (Superintendencia Financiera de Colombia 2001) as a credit or contract irrecoverable, this unrecoverable portfolio is for defaults of 6 months or more, too permissive for the microcredits under study. Therefore, it is necessary to find alternatives for credit risk definitions, where the unrecoverable portfolio can be determined for earlier periods.
372
J. S. Herrera Román et al.
Transition matrices are useful for this purpose and can help define specific parameters for each entity regarding its unrecoverable portfolio. Transition matrices were defined in 1997 by Morgan as the probability that an obligation or credit will migrate from an acceptable state to an unrecoverable state in a defined period (Morgan and Morgan Grenfell 1997). Usually a matrix is defined with the state ‘i’ as the initial state of a credit, and the state ‘j’ as the final state, the intersection defines the number of credits that changed between these states. The probability that a credit in an acceptable state ‘i’ will change to an unrecoverable state ‘k’ is defined as the number of credits that went from a state ‘i’ to state ‘k’, out of the total credits that started in the ‘i’ state. The point where the probability that a loan enters the unrecoverable portfolio from an acceptable state is greater than 50% is considered the default point, that is, where the credits will become unrecoverable. In Colombia, the (Zapata, 2003) study concludes that the probability that a credit is unrecoverable depends on the state of the economic cycles, consequently, it is necessary to establish the transitions of said cycles to combine them with credit transitions and thereby anticipate countercyclical losses. Since a monthly period was defined for credit analysis and a maximum term of 4 months for most of these, 1 million credits were selected and analyzed during the periods between December 2018 to April 2019, taking December as period 0, thus, we will have another 4 periods, so that the credit can be paid or in an unrecoverable state. Regarding the status of the portfolio, for each period the credit health was defined according to Table 16.1. Since category ‘−1’ indicates that a credit has already been paid, it should make it remain in the same category for any future period, all credits were reviewed and those in which this credit was not fulfilled were removed from the analyzes, since they suppose an error in the database, either by direct manipulation or by software problems. In total, 751 credits were eliminated, leaving for the analysis a total of 999,249 credits. Table 16.1 Health category for microcredits, according default age
Category
Credit state
0
Non-default credits
1
Default less than 30 days
2
Default greater than or equal to 30 and less than 60 days
3
Default greater than or equal to 60 and less than 90 days
4
Default greater than or equal to 90
-1
Paid credits
16 Data Analytics in Financial Portfolio Recovery Management
373
Table 16.2 Transition matrix Initial state Final state
0 (%)
1 (%)
2 (%)
3 (%)
4 (%)
0
0,96
18,06
6,30
6,79
3,85
1
0,06
6,58
0,12
0,01
0,00
2
0,20
3,50
0,19
0,04
0,00
3
0,43
4,19
1,10
0,05
0,00
4
3,83
7,83
59,39
74,49
92,36
−1
94,53
59,83
32,89
18,62
3,78
For all the analyzed credits, the transition matrix in Table 16.2 was defined, taking into account the probability that a credit that begins in a state ‘i’, will go to a state ‘j’ within the range of analyzed periods P (i, j). If we take P (i, 4) for each category, we will have the probability that a category will default, that is, it will be unrecoverable. Thus, for a credit that starts in nondefault state ‘0 it is unlikely that it will end in default (3.83%), but for a credit that is already past due between 30 and 60 days state ‘2’, it is more likely to end in irrecoverable state (59,39%). Identifying the category where the credits get worse in most cases is vital to define the dependent variable to use in dataset. Therefore, for all the credits to be analyzed, a binary state was defined for the dependent variable, ‘0’ if for the next period (January 2019) the credit changes to state 2 or worse and ‘1’ in other case.
16.3.1.3
Creating the Dataset
We have defined the data to use; 1 million credits each with a unique identifier. Likewise, the dependent variable has been defined; a binary value that indicates ‘0’ if a credit will default and ‘1’ otherwise. Now, to define the independent variables, three categories were considered for them: customer data, credit data and calculated data. Customer data: These variables include information associated with the customer who owns the microcredit, even though the analysis will be carried out by credit, it is important to consider people’s own variables. Among the data obtained related to the client, information related to their occupation, gender, marital status, residence and workplace information, and score in credit bureaus were obtained. There are 21 variables related to customer. • • • • •
Credit code Client code Birthday Marital status Profession
374
• • • • • • • • • • • • • • • •
J. S. Herrera Román et al.
Student Country Code State code City code Neighborhood Code Gender Residence address Dependents Occupation Code Date client creation Job address Job title code Salary code Date of employment Job City Code Living place code
Credit data: These are variables directly related to the credit obligation, among the information obtained is the date of creation of the credit, value of the installment, total amount of the credit, balance owed, value paid and age of default. There are 31 variables related to credit. • • • • • • • • • • • • • • • • • • • • • • • •
Central Risk Score Point of sale Credit Creation Date Bill value Debt term Debt frecuency Guarantee percentage Business name Quota value Quotas quantity Report period code Paid out Payment date Opening date Periodicity Rating Overdue by age Initial value Balance Montly Quota Default value Total Value of Capital Quota Total value paid capital Total Debt Value
16 Data Analytics in Financial Portfolio Recovery Management
• • • • • • •
375
Due Date Last Payment Date Pay Qty Paid quotas Overdue fees Income Days in Default
Calculated data: In addition to the data obtained directly from the databases, others that could be useful for the analysis were defined, these fields are the result of the operation of several pre-existing fields, giving rise to information that does not exist as a field in any database belonging to the study subject. Some of the calculated data take into account the amount of credit active by the same client, the days elapsed from the creation of the credit until the first collection process, or the maximum default by the client of the credit, taking into account the total credit of the last year. There are 14 calculated variables. • • • • • • • • • • • • • •
Create Date Months in Default Max Days in Default Average Days in Default Total balance Credit creation day Number of payments made Number of active credits Agent calls SMS Sent Email Sent Auto-agent calls Payment commitments First contact day This dataset, before the data preprocessing, had a million records and 66 fields.
16.3.2 Data Preprocessing In data preparation phase, several factors that can affect the general result of the analysis must be considered, such as date variables, categorical variables or null data, variables with zero variance and normalization must also be considered before creating the prediction model.
376
J. S. Herrera Román et al.
Table 16.3 Date variables conversion
16.3.2.1
Variable
Original
Numeric equivalence
Birthday
1995–08-14 09:33:01.00
808,392,781
Date of employment
15/07/2013
1,373,846,400
Opening date
20,150,715
1,436,918,400
Date Type Variables
The date type variables must be identified and converted to a format that is unique to all, so they can be considered in the normalization of the data. For the dataset there are 9 date variables and it is observed that some do not share a common format, some are not even detected as a date, but as a string type. The format ‘YYYYmmdd-HHmmss’ is defined for each variable. After unifying the format of each variable and to allow its integration with the other numerical variables, the value of the date is changed by its numerical equivalent, a control number that allows each date to be uniquely identified, and it also converts the date variables in a continuous variables, as can be seen in Table 16.3.
16.3.2.2
Categorical Variables
There are other types of discrete variables that are also expressed as text, these variables must be analyzed in each case to determine what to do with each of them. For each variable, the number of different categories they possess is evaluated, to be able to code them. Gender: This variable only has two categories; Feminine and Masculine. Therefore, it is coded in 1 or 2 as follows: {“Feminine”: 1, “Masculine”: 2}. Rating: Has 5 categories, each expressed as a letter; A, B, C, D, E. It is coded as a number between 1 and 5: {“A”: 1, “B”: 2, “C”: 3, “D”: 4, “E”: 5}. Credit creation day: Represents the day of the week in which the credit was created, it is coded according to a number between 1 and 7: {“Monday”: 1, “Tuesday”: 2, “Wednesday”: 3, “Thursday”: 4, “Friday”: 5, “Saturday”: 6, “Sunday”: 7}. For other categorial variables, like Residence address and Job address it is found that they have too many categories to be encoded, they are also freely written values, which gives rise to typographical errors and the subject interpretation of who enters the data, these are eligible fields to be removed from the dataset.
16 Data Analytics in Financial Portfolio Recovery Management
16.3.2.3
377
Null Data
Fields with null data can be inconvenient when applying a prediction model. However, deleting the records could mean the loss of vital information for the analysis, so each field with null data in Table 16.4 must be analyzed individually. There are 15 variables with null data, for each case its procedure is explained: Payment date, the field is deleted, due to having more than 30% of null data (82%). Last Payment Date, the field is deleted, due to having more than 30% of null data (45%). Job address, the field is deleted as previously defined. Date of employment, nulls are replaced by a default date (1900-01-01), taken as invalid data. Job City Code, nulls are replaced with 0 which means no data. Central Risk Score nulls are replaced with 0 which means no data. Occupation Code, nulls are replaced with 4, majority value. Country Code, the field is deleted, due it is known that all customers are from country code 57. Dependents, nulls are replaced with 0, majority value. Salary code, nulls are replaced with 2, majority value. Job title code, nulls are replaced with 0 which means no data. Residence address, the field is deleted as previously defined. Gender, nulls are replaced with 1, majority value. Student, null values are replaced with 0, the predominant value of the field. Neighborhood Code, nulls are replaced with 0 which means no data. Table 16.4 Variables with null data
Variable
Null data (%)
Payment date
82.487348
Last payment date
44.601095
Job address
13.177496
Date of employment
12.582149
Job city code
11.815874
Central risk score
10.195957
Occupation code
8.913694
Country code
8.798107
Dependents
8.618673
Salary code
8.473464
Job title code
7.878617
Residence address
0.263198
Gender
0.107781
Student
0.012910
Neighborhood code
0.000100
378
J. S. Herrera Román et al.
Table 16.5 Variables variance
Variable
Variance
Guarantee percentage
0,00E + 00
Report period code
0,00E + 00
Total value of capital quota
0,00E + 00
Total debt value
0,00E + 00
Create date
0,00E + 00
Student
1,27E + 04
… Quotas quantity
2,66E + 06
Pay qty
2,66E + 06
…
16.3.2.4
Birthday
6,45E + 22
Date of employment
1,47E + 24
Data Variance
Although the variance of a field can be the reason for analysis to determine how varied or repeated its data is, in this case only those fields whose variance is 0 will be considered, because it means data that do not vary for any of the 999,249, being a reasonably field to delete. Table 16.5 shows the variables variance, in ascendant order. We observe that 5 fields can be eliminated, since all the records have the same value for them, not being useful for dataset analysis. Although we also see data with very large variance, no actions will be taken with these fields, as they could be useful in the analysis for prediction models.
16.3.2.5
Normalization
As a final phase of data preprocessing, they are normalized, this to avoid comparing data with different scales among themselves. For each field, normalization is performed by subtracting the data from the mean and dividing by standard deviation of the field: X=
X −μ σ
After preprocessing dataset, there are 999,249 records and 56 fields. Then they are divided into subsets; training, testing and validation, each with 70%, 10% and 20% of the data, respectively. The techniques used for prediction are logistic regression, decision trees, the k neighbor method, neural networks, and support vector machines. Each technique is
16 Data Analytics in Financial Portfolio Recovery Management
379
trained with the training data and its performance is evaluated against the test and validation data.
16.4 Results For each technique used, its sensitivity are averaged, this mean the credits that will be paid, and the same is done for its specificity, the credits that will not be paid, finally an accuracy is obtained for each technique. This information is summarized in Table 16.6. Considering the overall specificity, it is observed that the neural networks and the decision trees have had the best performance. When comparing both techniques, it is concluded to use decision trees for several reasons; first, the parameterization that allows finding the best relationship between precision and complexity; second, the easy interpretation of a decision tree considering its graphic representation; third, for each input variable you can know the level of importance in the prediction, this not only does it serve to reduce the size of the dataset by eliminating variables of low importance in the result, it is also of great importance for the non-financial entity to know which variables have a greater influence to determine the payment of a microcredit. The decision tree used as part of the five techniques as a depth of 76 and 17.595 leaves, this shows a highly complex tree that is impossible to show graphically. Greater depth of a decision tree is not necessarily reflected in greater accuracy. To find the best relation between accuracy and complexity, the performance of decision trees with varied depths was evaluated, considering the same training data, the result of Fig. 16.3 showed that with a depth of 12, the best result is obtained. With this optimized decision tree, its performance was evaluated for test and validation data and the results are shown in Table 16.7, an improvement in the prediction of credits that will not be paid is noted. We see how the specificity have arisen to 98%, when before it was not even 70% with any algorithm. This performance increases false negatives, but this is acceptable to the entity of the exercise. Furthermore, the sensitivity was only affected by 5 percentage points, while the specificity was improved by more than 30 percentage points. Table 16.6 Evaluation metrics comparison of the five techniques Rank
Classifier
Accuracy (%)
Specificity (%)
Sensitivity (%)
1
Neural networks
96.8
65.2
98.7
2
Decision trees
96.3
64.8
98.2
3
Support vector machines
95.9
37.6
99.3
4
Logistic regression
95.8
42.3
99.0
5
K neighbors
95.8
44.1
98.8
380
J. S. Herrera Román et al.
Fig. 16.3 Accuracy versus depth on decision trees
Table 16.7 Optimized decision tree performance Accuracy
95,6%
Specificity
Sensitivity
Test (%)
Validation (%)
Test (%)
Validation (%)
98,1
98,0
93,2
93,2
98,0
93,2
The decision tree, depth 12, has 423 leaves, although this is less than 3% of the tree size without optimizing, it is still very large to be able to see its graphically, for this reason a sub-tree can be seen in Fig. 16.4. The graphical representation allows a detailed analysis of the way in which the decision tree predicts the behavior of a specific microcredit, important for the entity of the work to understand the internal process of the technique used.
Fig. 16.4 Sub-tree from the optimized decision tree
16 Data Analytics in Financial Portfolio Recovery Management Table 16.8 Variables and its importance in prediction
Variable
381 Importance (%)
SMS sent
36,46
Credit creation date
17,03
Days in default
12,03
Balance
11,67
Total value paid capital
6,67
Due date
3,49
Paid quotas
3,19
Email sent
2,16
First contact day
2,12
Number of payments made
1,64
Finally, the level of importance that each variable has in the prediction is one of the most valuable results of this work. Table 16.8 shows the ten most important variables. This result is very important and explanatory for the entity of the analysis, since it allows to understand what data is useful to define whether a microcredit will enter the unrecoverable portfolio or not, in addition, it allows you to focus your efforts on the most important variables, since some are related directly with activities in the entity.
16.5 Conclusion and Future Work In the chapter, five classification algorithms (neural networks, decision trees, support vector machines, logistic regression, and K neighbors) were run on a dataset of credit behavior data. Neural networks and decision trees were shown to have the best relation between specificity and sensitivity. Decision trees were the technique chosen to be refined, due to their ability to explain the importance of the input variables in the classification decision. The result shows an accuracy of 95,6% with a specificity of 98% and a sensitivity of 93,2%. In future analysis of similar data in the same or another entity, it is important that data can be obtained for various time cycles, depending on the way in which each entity works. In this way, there would be transition matrices for seasonal periods of the entity, for normal and for off-peak periods. Carrying out the analysis for these different periods would allow understanding the behavior of the portfolio in each period and the model adjusted to each of them could be used in the specific case. Even though the portfolio is typically evaluated in 30-day periods, it would be very helpful to be able to test with certain clients or credits and take actions for this portfolio for shorter periods, such as every 10 days. Once this is done and having data
382
J. S. Herrera Román et al.
on the behavior of these credits, the entire analysis of the present degree work could be carried out and surely early ages of default in which the collection is effective could be observed, perhaps the most accurate range of days in which the credit becomes irrecoverable in default, the entity’s work in debt collection may be more opportune before it reaches these days of default. A more drastic reduction in the dimensionality of the data could be carried out seeking to reduce the number of variables, since upon reviewing the literature we found that the predictive portfolio models, and specifically the decision trees, work adequately with about 20 variables. This reduction could be done using statistical techniques and expert judgment. The use of different techniques guarantees a more consistent and accurate result in the predictive model. Taking into account the techniques already tested in the present study, others could be included to verify if the precision of the model can be improved, specifically speaking of decision trees, a random forest could be used, which, by definition, would be more precise than any decision tree of the set that conform it. It is also recommended the generation of a database of updated macroeconomic variables, which serve as input along with the variables defined in this study, this to ensure global coverage of the behavior of a microcredit, from the customer’s perspective and the market.
References Banca de las Oportunidades, Superintendencia Financiera de Colombia (2018) Reporte de Inclusión Financiera 2018. https://bancadelasoportunidades.gov.co/sites/default/files/2019-06/RIFFINAL. pdf Banca de las Oportunidades, Superintendencia Financiera de Colombia (2017) Reporte de Inclusión Financiera 2017. https://bancadelasoportunidades.gov.co/sites/default/files/2018-07/RIF2017LI BROFINAL_WEB02_1.pdf Bitvai Z, Cohn T (2015) Predicting peer-to-peer loan rates using Bayesian non-linear regression. Proc Natl Conf Artif Intell 3:2203–2209 Byanjankar A (2018) Predicting credit risk in Peer-to-Peer lending with survival analysis. 2017 IEEE Symp Ser Comput Intell SSCI 2017 - Proc 2018-Janua:1–8. https://doi.org/10.1109/SSCI. 2017.8280927 Byanjankar A, Heikkila M, Mezei J (2015) Predicting credit risk in peer-to-peer lending: A neural network approach. Proc - 2015 IEEE Symp Ser Comput Intell SSCI 2015 719–725. https://doi. org/10.1109/SSCI.2015.109 Castaño L, Ayala C (2016) Diagnóstico de la Gestión de Cartera en una Empresa Proveedora del Sector Salud en Colombia CNN Periodismo Digital (2016) Ricard Bonastre: Cómo Generar más ingresos por campaña con algoritmos predictivos. https://www.callcenternews.com.ar/entrevistas/320-cgmi. Accessed 3 Jun 2019 Daza Sandoval LC (2015) Estrategias basadas en el modelo de análisis predictivo árbol de decisión para la mejora del proceso de recaudo de cartera de la línea de vehículo particular del banco Davivienda S.A. Pontificia Universidad Javeriana
16 Data Analytics in Financial Portfolio Recovery Management
383
Deloitte (2012) Tendencias de cobranza y recuperación de cartera en el sector financiero a partir de la crisis. https://www2.deloitte.com/content/dam/Deloitte/pa/Documents/financial-services/ 2015-01-Pa-FinancialServices-CobranzaCartera.pdf. Accessed 3 Jun 2019 Do HX, Rösch D, Scheule H (2018) Predicting loss severities for residential mortgage loans: a three-step selection approach. Eur J Oper Res 270:246–259. https://doi.org/10.1016/j.ejor.2018. 02.057 Experian (2017) Move debt collection practices into the digital age eResolve. https://www.experian. com/consumer-information/virtual-debt-resolution-negotiation-eResolve.html Gahlaut A, Tushar, Singh PK (2017) Prediction analysis of risky credit using data mining classification models. 8th internation conference computer communication networks technology ICCCNT 2017. https://doi.org/10.1109/ICCCNT.2017.8203982 Ge R, Feng J, Gu B, Zhang P (2017) Predicting and deterring default with social media information in peer-to-peer lending. J Manag Inf Syst 34:401–424. https://doi.org/10.1080/07421222.2017. 1334472 Infórmese (2018) Gestión Analítica de Crédito y Cobranza. https://www.informese.co/gestion-ana litica-credito-cobranza/. Accessed 3 Jun 2019 Jiang C, Wang Z, Wang R, Ding Y (2018) Loan default prediction by combining soft information extracted from descriptive text in online peer-to-peer lending. Ann Oper Res 266:511–529. https:// doi.org/10.1007/s10479-017-2668-z Kim A, Cho SB (2019) An ensemble semi-supervised learning method for predicting defaults in social lending. Eng Appl Artif Intell 81:193–199. https://doi.org/10.1016/j.engappai.2019.02.014 León Sánchez DP (2015) Modelo predictivo para riesgo de liquidez de una entidad fiduciaria usando minería de datos. Universidad Nacional de Colombia Morgan J, Morgan Grenfell D (1997) Introduction to CreditMetrics. New York Superintendencia Financiera de Colombia (2019) Evolución cartera de créditos. https://www.sup erfinanciera.gov.co/inicio/evolucion-cartera-de-creditos-60950. Accessed 3 Jun 2019 Superintendencia Financiera de Colombia (2001) Cartera de Crédito. https://www.superfinanciera. gov.co/publicacion/18575. Accessed 15 Mar 2020 Velásquez AB (2013) Diseño de un modelo predictivo de seguimiento de riesgo de crédito para la cartera comercial, para una entidad financiera del Valle de Aburrá Zhou J, Li W, Wang J et al (2019) Default prediction in P2P lending from high-dimensional data based on machine learning. Phys a Stat Mech Its Appl 534:122370. https://doi.org/10.1016/j. physa.2019.122370 Zhu L, Qiu D, Ergu D et al (2019) A study on predicting loan default based on the random forest algorithm. Procedia Comput Sci 162:503–513. https://doi.org/10.1016/j.procs.2019.12.017
Chapter 17
Task Thesaurus as a Tool for Modeling of User Information Needs J. Rogushina and A. Gladun
Abstract We consider task thesaurus as an element of user model that reflects dynamic aspects of user current work. Such thesaurus is based on domain ontology and contains the subset of it’s concepts that are dealt with user task. Task thesaurus can be generated automatically by analysis of task description or with the help of semantic similarity estimations of ontological concepts. Task thesaurus represents personalized user view on domain and depends on his/her abilities, experience and aims. We describe some examples of task thesaurus usage in intelligent applications for adaptation to user needs. Keywords Domain ontology · Task thesaurus · Semantic similarity
17.1 Introduction Many intelligent information systems (IISs) collect data about their users to model various aspects of their behavior. A lot of researchers integrate user modeling with ontological representation of knowledge. Many user-modeling approaches are based on content-based characteristics of users (users’ knowledge, interests, etc.) and user believes about world that define their behavior. Knowledge, beliefs, and background are main dimensions of user modeling. In Sosnovsky and Dicheva (2010) analyze application of ontological technologies for various aspects of user modeling and profiling. Various IISs model user tasks from a different perspective and typically distinguish between long-term and short-term user’s information preferences. The term J. Rogushina Institute of Software Systems of National Academy of Sciences of Ukraine, Kyiv, Ukraine e-mail: [email protected] A. Gladun (B) International Research and Training Center of Information Technologies and Systems of National Academy of Sciences of Ukraine and Ministry of Education and Science of Ukraine, Kyiv, Ukraine e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_17
385
386
J. Rogushina and A. Gladun
“user profile” defines traditionally various knowledge structures with information about users and means some compound knowledge structure comprising various static information about users, such as demography, background, cognitive style, etc. It is different from “user model” term that represents more dynamic aspect of the user (conceptual knowledge, interests, preferences, etc.) but a lot of researchers use both terms to define complex structure with static and dynamic parts. User profiles contain information about goals, needs, plans and tasks of user; various demographic characteristics such as age, gender, native language, user geographic location, qualification and education level (these characteristics depend on specifics of IIC that work with this profile); environment and access devices etc. Such profiling is widely used in retrieval, e-learning and recommender systems to provide personified representation of user interests as a subset of domain knowledge domain by atomic components (categories, topics, learning outcomes). Adaptive IISs provide collection of diverse information about users and maintain complex representation of user profiles with multiple aspects. But even the most typical characteristics of the users can be modeled by user profiles with various terms and categories. Commonly user works with multiple IIS that often cannot get an access to each other’s user model collections, and therefore they have to extract user interests on their own. The problem of translation between different user representations of independent IISs is a great challenge caused by different approaches to user modeling and different representations of user profile structure. The task of interfacing between different user models involves the resolution of model discrepancies. Integration of different models of user profiles is based on ontologies—centralized reference high-level ontology or matching (manual or automated) of independent ontologies. Representation of user’s knowledge, interests and needs can be based on domain ontology. User models on base of ontologies can support “personal ontology view” (POV)—ontological representation of individual beliefs about domain conceptualization (Kalfoglou and Schorelmmer 2003). Two basic variations of POVs are implemented into applied IISs: 1. subontology of domain ontology; 2. an unique conceptualization with structure deviated from the underlying domain ontology. Now a lot of adaptive IISs use open user models that provide various means of control such as content view, visualization and modifying. Users can analyze model of their knowledge according to their understanding of domain concepts. The term ‘user profile’ defines traditionally various knowledge structures with information about users, such as demography, background, etc. In this work we consider domain ontologies as a main instrument of formalization of static long-term user interests, and ontology-based task thesauri that can be considered as a special case of POV reflect the dynamic part of user model—his/her current informational needs and requests. The development of POV as a sub-graph of the domain ontology can be considered as an extended version of the model containing only a subset of domain concepts. Generation of such POV can be based on semantic similarity between ontological concepts selected by user and analysis of current user needs.
17 Task Thesaurus as a Tool for Modeling …
387
In this work, we analyze problems dealt with creation and usage of task thesauri— dynamic part of user profile that models current information needs pertinent to one of user tasks. We consider task thesauri as a user-oriented view on domain ontologies. In the first part of this work, we propose ontology-based formal model of task thesaurus that represent domain knowledge and analyze expressiveness of this model. We analyze user profiles that represent domain of user interests with the help of ontologies and various ontology-based structures such as thesauri. Second part is related with thesauri generation methods. We analyze elements of domain ontology formal model that can be used for generation and normalization of task thesaurus and propose various approaches that use properties of ontological concepts taxonomic and non-taxonomic relations between them. Special attention is attended to mereologic relations (defined by Lesnevsky). Semantic similarity estimations between ontology concepts are used as an instrument for quantitative rating of generated thesaurus. Estimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as knowledge management, knowledge acquisition and semantic information retrieval but in this research we use only parameters that are processed in different approaches to semantic similarity estimations (information content-based measures, path and depth based measures, Feature-based measures etc.) and don’t consider their calculation and normalization specifics. Our task is related in matching of these parameters with properties of various problem-specific simplifications of domain ontologies (such as Wiki ontology and competence ontology). In the third part of this chapter we propose some examples of practical use of task thesauri in various intelligent applications. Task thesauri are utilized as an element of user profile in semantic search, in competence analysis, retrieval of pertinent learning courses, in various recommended systems for personalized information processing. Ontological approach provides reuse of such profile: knowledge about use generated by one intelligent system can be exported to another one. In conclusion we analyze the prospects of automates generation of ontology-based task thesauri, sources of domain ontologies and their processing for adaptation of intelligent applications to user needs.
17.2 Domain Ontologies and Task Thesauri as Main Semantic Elements of User Profile Ontologies are widely used now in distributed intelligent applications to explicitly describe the domain knowledge system or information resource. In modern researches in the field of the distributed knowledge management the term “ontology” is used for explicit conceptualization of some subject domain (Gruber 1993). Ontologies provide common vocabulary for particular field of activity and determine (with various formalization levels) the meaning of terms and the relationship between them. In the most general case, ontology is an agreement on the shared
388
J. Rogushina and A. Gladun
use of concepts that provides the means of domain knowledge representation and agreement about their understanding. Thesaurus is a special case of ontology, which allows to represent concepts so that they become suitable for machining and automated processing. It can be considered as a model of the logical-semantic structure of domain terminology. Ontology is a knowledge base that describes facts that are always assumed to be true within a particular community based on the generally accepted meaning of the thesaurus.
17.2.1 Domain Ontologies and Their Formal Models Ontology is a formalized description of the worldview in a particular sphere of interest. It consists of a set of terms and rules for use of these terms, limiting their meaning within a particular domain. At present, the usefulness of domain ontologies is generally recognized. They are widely used for knowledge representation in intelligent Web applications. Domain ontology is knowledge base of a special kind with the semantic information about this domain. It is a set of definitions in some formal language of declarative knowledge fragment focused on joint repeated use by the various users in the applications. The most general formal model of domain ontology O is an ordered triple O = ,
(17.1)
where X—finite set of subject domain concepts that represents ontology O; R—finite set of the relations between concepts of the given subject domain; F—finite set of axioms and interpretation functions of given on concepts and relations of ontology O. Relations represent the type of interaction between the domain concepts. Axioms are used to model statements that are always true. This model can be concretized depending on development aims. The set X of the terms of ontology is completely determined by it’s domain, the set R of the relations used in ontologies is more domain-independent. But the elements and the structure of domain ontologies are not defined standardly in different applications. It is caused by specifics of practical tasks that demand advanced concretization of different ontological aspects. For example, some formal models distinguish various subsets of domain-specific relations or relation properties. On the meaningful level domain ontology is as a set of agreements (domain term definitions, their commentary, statements restricting a possible meaning of these terms, and also a commentary of these statements). A domain ontology can be defined as follows: the part of domain knowledge that is not to be changed; the part of domain knowledge that restricts the meanings of domain terms; a set of agreements about the domain; an external approximation represented explicitly of a conceptualization given implicitly as a subset of the set of all the situations that can be represented (Uschold 1998).
17 Task Thesaurus as a Tool for Modeling …
389
All existing approaches to definition of domain ontology can be grouped into three main categories by the ways of sciences used for ontological analysis (Kleshchev and Artemjeva 2001). The first one—humanitarian approach—suggests definitions in intuitively understandable for humans terms but these definitions cannot be used for solving of technical problems. For example, ontology is defined as a consensus about domain for certain purposes (Sánchez-Cervantes et al. 2016), i.e. ontology in this interpretation is explained as freely as possible. The second one—computer approach—is based on some formal languages (such as OWL (Antoniou and Van Harmelen 2004), DAML + OIL (DAML+OIL 2001)) for representation of domain ontology and applied software that realized the processing of knowledge represented on these languages. These definitions of domain ontology are used in development of knowledge-based applications and provide creation of interoperable knowledge bases. The third one—mathematical approach—defines the domain ontologies in mathematical terms or by mathematical constructions (for example, by descriptive logics (Calvanese et al. 2007)). These definitions provide inference of ontology characteristics, helps in estimation of ontology analyses time and other theoretical questions but are not use in practical software projects. Usually the humanitarian approach is used for decision about necessity of ontological approach for some problem, then the mathematical model of ontology is constructed (definitions of the third approach help in selection of language from the second one that is pertinent to solved problem), and at last software realization is developed. Ontological commitments are the agreements aimed at coordination and consistent use of the common dictionary. The agents (human beings or software agents) that jointly use the dictionary do not feel necessity of common) knowledge base: one agent can know something that don’t know the other ones, and the agent that handles the ontology is not required the answers to all questions that can be formulated with the help of the common dictionary. For many tasks ontology can be used as a controlled vocabulary expressed in a formal language of knowledge representation with unambiguous interpretation. This language has a grammar for using ontological concepts and relations as vocabulary terms to express something meaningful within specified domain.
17.2.2 Task Thesaurus as a Personified View on the Domain Ontology The term “thesaurus” for the first time was used still in XIII century by B.Datiny as the name of the encyclopedia. Now in informational technologies (IT) thesaurus means the complete systematized data set about some domain of knowledge allowing humans and software to orient in it. Thesaurus is a collection of controlled vocabulary
390
J. Rogushina and A. Gladun
terms. It uses associative relationships in addition to parent-child relationships. The expressiveness of the associative relationships in a thesaurus vary and can be as simple as “related to term” as in term A is related to term B (The differences between a vocabulary, a taxonomy, a thesaurus, an ontology 2003). Wikipedia defines thesaurus as synonym dictionary that characterize the distinctions between similar words and can be used for finding synonyms and antonyms of words. Usually thesauri do not contain definitions of terms. Some thesauri group words (monolingual, bilingual or multilingual) in a hierarchical taxonomy of concepts, others are organized alphabetically or sphere of sciences. Now a lot of thesauri are created for various spheres of human activities—medical domain, mathematics, computer science, etc. Thesaurus can be created for single information resource (IR), natural language (NL) document or the set of documents. It can contain all words of source or some subset of them (for example, nouns, words of reference vocabulary or concepts of domain ontology). Thesaurus terms can be extracted from text by means of linguistic analysis or manually. The main three international standards for the library and information field define the relations to be used between terms in monolingual thesauri (ISO 2788:1986), the additional relations for multilingual thesauri (ISO 5964:1985), and methods for examining documents, determining their subjects, and selecting index terms (ISO 5963:1985). ISO 2788 contains separate sections covering indexing terms, compound terms, basic relationships in a thesaurus, display of terms and their relationships, and management aspects of thesaurus construction. The general principles in ISO 2788 are considered language- and culture-independent. As a result, ISO 5964:1985, refers to ISO 2788 and uses it as a point of departure for dealing with the specific requirements that emerge when a single thesaurus attempts to express “conceptual equivalencies” among terms selected from more than one NL (Pastor et al. 2009). Formal models either of ontologies or of thesauruses include as the basic concept the terms and connections between these terms. Collection of the domain terms with indication of the semantic relations between them is a domain thesaurus. The thesaurus can be considered as a special case of ontology Th ⊆ O. Formal model of thesaurus is based on formal model of ontology (17.1): Th = ,
(17.2)
where Tth ⊆ T—finite set of the terms; and Rth ⊆ R—finite set of the relations between these terms, I—additional information about terms (this information depends on specifics of thesaurus goals and can contain, for example, weight of term or it’s definition). Task thesaurus has the simpler structure because it is not include ontological relations (all important for task about relations is used for construction of Tth ) and has additional information about every concept—it’s weight wi ∈ W, i = 1, n. Therefore, formal model of task thesaurus is defined as set of ordered pairs Thtask =< (ti ∈ Tth , wi ∈ W >), ∅, I > with additional information in I about source ontologies. The user’s task is the practical problem that he/she solves with the help of a certain information system. User has to formalize this task if he/she needs in personified
17 Task Thesaurus as a Tool for Modeling …
391
processing of information. Examples of such task are retrieval of documents or information objects of a certain type (organizations, vacancies, people etc.), obtaining recommendations distance learning. The domain of task is formally characterized by domain ontology, and the task itself can be characterized formally by use of task thesaurus or informally—by its NL description, keywords or example documents. The task thesaurus can be either built by the user manually or generated automatically by analysis of available NL documents and other IRs. Task thesaurus is personalized and represents user POD by selection of subset of domain ontology that is pertinent to currently solved problem. Task thesauri of different users developed for the same task and based on the same domain ontology can differ significantly one from other. It is caused by the individual beliefs and preferences of users, their experience and competencies. Therefore, every task thesaurus can be considered as an element of user model. It should be noted that construction of the task thesaurus is time-consuming, so it is advisable to perform this operation only in cases where this task deals with regular, repetitive and personified information needs of the user. For example, such tasks may include the search for new scientific literature or tools on a particular issue that is more interesting to user than other ones with similar information. It is impractical to build task thesaurus for one-time queries in an area where the user is not an expert and therefore cannot take into account a sufficient amount of external knowledge. In this case, the effort to build a thesaurus will be greater than the obtained effect.
17.3 Methods of Thesaurus Development Thesaurus involves the import of domain information from existing ontologies to model current information need of the user. Task thesaurus can be constructed as a combination of thesauri of natural language documents selected by the user or obtained from the relevant domain ontology. In practice, it is advisable to use a combination of both approaches, as well as allow the user to adjust the constructed thesaurus manually. The thesaurus of a natural language document is built by linguistic analysis of its content and metadata using knowledge of domain ontology, i.e. task thesaurus does not include all available ontological terms, but only those related to the current task. This reduces the thesaurus volume and time of its construction and analysis. Proposed approach differs from the existing ones by personification of domain modeling that allows user to clearly define the sphere of interests and provides automated construction of a simplified ontological model of this sphere as a subset of domain. User can adjust the boundaries of current needs according to the specifics of the individual problems. Task thesaurus is generated as a sum of IR thesauri. We can obtain thesaurus of user interests domain as a sum of thesauri of all user tasks. Therefore, at first we have to define an algorithm of thesaurus generation for single IR and then apply set-theoretical operations to these IR thesauri. If we use ontological approach then
392
J. Rogushina and A. Gladun
we generate thesauri that contain only concepts go one or more ontologies. Analysis of NL documents is aimed on detection of semantic links between text fragments and ontological concepts.
17.3.1 IR Thesaurus Generation Algorithm Every IR is described by not empty set of the textual documents connected with this IR—text of content, metadescriptions, results of indexing etc. If IR contains multimedia content then this content can be transformed into text (by speech and text recognition methods etc.) methods. Algorithm of IR thesaurus generation has following steps: 1.
2.
Formation of initial non-empty set A of the textual documents ai connected with this IR as an input data for algorithm. A = {ai }, i = 1, n. Each of documents ai from the set A has the coefficient of importance (for example, metadata of video are more important than the recognized speech) that allows to define different weight of document elements for IR thesaurus). IR dictionary construction. For every ai the set of words D(ai ) is constructed. D(ai ) is a dictionary that contains all words occurred in the document. Dictionary n D(ai ). of A is formed as a sum of the D(ai ) DIR = i=1
3.
Generation of IR thesauruses. With use of domain ontology IR thesaurus TIR is created as a projection of the set of ontological concepts X into the set DIR . TIR ⊆ X. This step of processing is aimed to remove stop-words and terms from other domains that are not interesting for user. The main problem deals with semantic connection of NL fragments (words) from TIR with concepts from the set X of domain ontology O. This problem can be solved by linguistic methods that use lexical knowledge bases for every NL and is beyond the scope of this chapter. Each word from the thesaurus it is necessary to link with one of the ontological terms. If the relation is lacking the word is considered as a stopword or marking element (for example, HTML tag) and should be rejected. ∀p ∈ T(ai )∃Term(p,o) ∈ X.
The group of the IR thesaurus words terms connected with one ontological term named the semantic bunch Rj , j = 1, n is considered as a single unit: ∀p ∈ TIR ∈ Rj , where Rj = {p ∈ DIR : Term(p)} = xj ∈ X. It allows to integrate processing of semantics of the documents written on various languages and, thus, to ensure the multilinguistic analysis of the IRs from the Web. If user doesn’t define domain ontology O then we consider that user domain of interests has no restrictions and therefore we don’t remove any elements from IR dictionary: TIR = DIR . User can generate task thesaurus by processing of task definition and other IRs pertinent to task (Fig. 17.1). These thesauri can be integrated such set-theoretic operation as sum, intersection and complement of sets. For example, thesaurus of
17 Task Thesaurus as a Tool for Modeling …
User task
Select domain ontology
393
Create domain ontology
+ онтології онтології Domain ontologies
Select thesaurus terms from domain ontology
Select the set of IR
Expand thesaurus by IR terms
Thesaurus is sufficient for user
-
+ Task thesaurus
Determ the weight of thesaurus terms
Manually edit thesaurus
Fig. 17.1 Generalized algorithm of task thesaurus generation
some domain can be formed as a sum of thesauri of IRs pertinent to this domain. The weight of term for set sum operation is defined as a sum ofits weight in every IR with m T the importance of IR SIR : ∀p ∈ T = m j=1 IRj ∃(p) = j=1 wIRj (p) ∗ SIRj . If user has to create thesaurus for some subset o domain than operations of set intersection and complement are used.
17.3.2 Semantic Similarity Estimations as a Theoretic Basis of Ontology-Based Thesaurus Generation Semantically similar concepts (SSC) are a subset of the domain concepts that can be joined by some relations or properties. If domain is modeled by ontology then SSC is a subset of the domain ontology concepts. There are several ways to build SSC which can be used separately or together. The user can define SSC directly (manually—by choosing from the set of ontology concepts or automatically—by any mechanism of comparison of ontology with description of user current interests that uses linguistic or statistical properties of this description.
394
J. Rogushina and A. Gladun
SSC can join concepts linked with initial set of concepts by some subset of the ontological relations (directly or through other concepts of the ontology). Each SSC concept has a weight (positive or negative) which determines the degree of semantic similarity of the concept with the initial set of concepts. A lot of different approaches used now to quantifying the semantic distance between concepts are based on ontologies that contain these concepts and define their relations and properties. (Taieb et al. 2014) classifies methods of such semantic similarity measuring and their software realizations. Methods are grouped by parameters used in estimations and differ within the groups by calculation of these parameters. Many measures take into account only the path length between concepts. The basic idea of such estimates is that the similarity of the two concepts is a function of the path length that connects concepts (by taxonomic relation “is a”) and their positions in the taxonomy. The same approach can be applied to arbitrary domain ontology where path between concepts can consist of all ontological relations. For example, in (Rada et al. 1989) ontology is considered as a directed graph where concepts are interconnected by universal and domain-specific relations, mainly taxonomic (is-a). The simplest way to estimate SS between concepts is to calculate the minimum path length that connects the corresponding ontological nodes using “is-a” relation. The longer path between concepts means the major semantic distance between them. If we define a path path(c1 , c2 ) = 11 ,…1 k as a set of links that connect the concepts c1 and c2 where |path(c1 , c2 ) = 11 ,…1 k | is the length of this path, then by analysis of all possible paths between and we can define the semantic distance between them as the minimum value of this length: SSRada = minpath(c1 , c2 ).
(17.3)
Despite the simplicity of such estimation, the assumption that different edges of the ontological graph reflect the same semantic distances which does not always correspond to domain causes many problems. Other estimations are based on the analysis of the path between concepts and their depth in the hierarchy. For example, (Wu and Palmer 1994) define the SS estimation between the concepts as follows: SSWP =
2H , N1 + N2 − 2H
(17.4)
where N1 and N2 are the number of “is a” relations between c1 and c2 respectively to the lowest common generic object c, and H is the number of “is a” connections between c and the taxonomy root. Measures of similarity based on information content (Resnik 1995) determine the similarity of two concepts is defined as information content of their lowest common generic object: SSre = IS(LCS(c1 , c2 )).
(17.5)
17 Task Thesaurus as a Tool for Modeling …
395
SS estimation parameters from various approaches (for example, from (17.3)– (17.5)) can be used for generation of task thesaurus. We can consider such thesaurus as a set of concepts that have semantic distance from some initial set of concepts greater than some constant. In these estimations we can use different coefficients for universal and domain-specific relations R of domain ontology O.
17.3.3 Ontological Relations as an Instrument of Task Thesaurus Generation Other approach to generation of task thesaurus is used ontological knowledge only and user defines semi-manually what part of ontology is interested for him/her. User selects the initial set of ontological terms T0 and then defines ontological relations (or type of relations) for expansion of this set by concepts linked with T0 by selected relations. Users can form this initial set manually on base of domain ontology (for example, with the help of Protégé and it’s visualization plugins (Protege Plugin Library 2019)) but it is not efficient. Application of specialized software tools can speed up and simplify generation of initial task thesauri. For example, we use OntoSearch that was developed for extraction of ontological elements into text file (Fig. 17.2). Ontological relations represent a type of interaction between concepts of the domain. They are formally determined as any subset of products (crossings) of sets n, such as: R: C1 × C2 × … × Cn. In Gymez-Perez et al. (2000) the most common relations in real domains divide into: equivalence, taxonomical, structural, dependence, topological, reason and consequence, functional, chronological, similarity, conditional and target. However not all these ontological relations are equal in value. There is possible to allocate the most important relations in this set, such as taxonomy and mereology. The relation is named fundamental if on the basis of this one given relation the formal system allowing to express the basic mathematical concepts can be constructed. Reflexivity is a basic choice in defining a mereological system: all things are parts of themselves. There are four fundamental relations, on the basis of each of which the mathematics can be constructed: 1. 2. 3. 4.
the relation of belonging (set theory ZF and NF), the relation between function, its argument and result (fon Neumann set theory), the naming relation (ontology of Lesnevsky), the relation “a part of” (mereology).
Taxonomy is a system of classification. A typical example of taxonomy is the hierarchical list (qualifier). The classes of ontology are usually organized in the taxonomy. The taxonomy in which the object can meet more than in one branch, refers to as “poly-hierarchical”. From the ontological point of view, taxonomy is the ontological organization based on the partial ordered relation called « x is A » by means of which the objects are grouped together or more high level is referred
396
J. Rogushina and A. Gladun
Fig. 17.2 Use of OntoSearch for selection of ontological classes to initial task thesaurus
to classes. Properties of taxonomic relations are asymmetry, transitivity and antireflexiveness. The most widely initial set of concepts is expanded with the help of taxonomic relations “is a subclass” by subclasses and superclasses of selected entities (Fig. 17.3a). User has to define the depth of expansion. This depth can be different for subclasses and superclasses. More complex way to use this relation is to find iteratively subclasses of selected superclasses with restriction of path length (Fig. 17.3b) or number of elements (Fig. 17.3c). User can expand task thesaurus by all hierarchical relations (Fig. 17.4a) or select single-relation (Fig. 17.4b) and multi-relation (Fig. 17.4c) chains. For example, user can select all concepts linked by single relation “is a parent” or by the set of relations “is older”, “has more competencies” and “has more experience”. The mereological relations “part—whole” are extremely important because they form a concept basis of scientific knowledge system. Mereology is a formal theory about parts and concepts connected to them. This term was used by the Polish philosopher Stanisław (2020) who had analyzed philosophical, logic and mathematical
17 Task Thesaurus as a Tool for Modeling …
397
subclasses
subclasses
T0
T0
superclasses
superclasses (a)
(b) 1 2 3
4 5 subclasses T0
1
2 3
superclasses
4 5
(c)
Fig. 17.3 Expansion of thesaurus on base of taxonomic relation
subclasses
T0
superclasses (a)
subclasses
subclasses
T0
T0
superclasses
superclasses
(b)
Fig. 17.4 Expansion of thesaurus on base of various types of hierarchical relation
(c)
398
J. Rogushina and A. Gladun
components of the mathematics bases. In contrast to Russell who precisely distinguished proper names and general names Lesnevsky considered all names (general, individual and empty), as belonging to the same semantic category. It is necessary to notice, that the Lesnevsky systems in general were stated in the so original form that researchers till now has no standard formulation of a triad of systems—protothetics, ontology and mereology). It is more correct to speak not about three systems Lesnevsky but about three section of one system named as “the basis of mathematics” that consists of the appropriate formal theories. The main axioms of mereology are: capacity—the systems connecting the same objects by the same ways are equal; system having even one part is an own part of itself; other connections of two objects can be more then one but the presence of empty connection denies an opportunity of any other connection. Other axioms of mereology describe relations between system and its elements. Mereology exceeds the bounds of study of the partial relations between elements of general systems. It also is engaged in those objects which parts are relevant to whole. Such objects are identified as instances. Among the mereological relations it is possible to allocate seven various classes, and in general, the transitivity is not accepted among instances of different classes: 1. 2. 3. 4. 5. 6. 7.
Component—object: page—book; Member—collection: tree—wood; Part/weight: piece—bread; Material-object: aluminum—airplane; Property—activity: to see—to read; Stage/process: boiling—preparation of tea; Place/area: Ukraine—Europe.
In generation of task thesaurus user can select from domain-specific ontological relations all mereological relations or mereological relations of one group. For example, domain-specific relations “fabricated”, “consisting of” and “containing” belong to group “Material-object”. They can be used for thesaurus generation as was described above at Fig. 17.3c. If user identify some ontological relation r ∈ R as mereological (part—whole) then it is profitable to define its type and group it with other relations from this group. Such grouping needs in semantics of domain and cannot be automated. User can define the weights of mereological relations from various groups and then use them for thesauri generation (Gladun and Rogushina 2010).
17 Task Thesaurus as a Tool for Modeling …
399
17.4 Practical Use of Task Thesauri in Various Intelligent Applications 17.4.1 Search Results Filtering by Task Thesaurus Task thesauri are very useful for filtering of results obtained from various information retrieval systems. Such filtering is based on comparison of user task thesaurus with thesauri of obtains IRs. The normalized IR thesauri TIR and task thesaurus Ttask are the subsets of the domain ontology terms O chosen by the user: TIR ⊆ X, Ttask ⊆ X. If IR description contains more words linked with terms of domain interest for user (that is reflected in the normalized domain thesaurus) then it is possible to suppose that this IR can satisfy informational needs of the user with higher probability than other IR relevant to same formal query. Thus, it is necessary to find IR q satisfied the conditions f q, Ttask = maxf(TIR , Ttask ) where the function f is defined as number of elements in crossing of sets TIR and Ttask : f(A,B) = |A ∩ B|. If the various terms of the normalized thesauruses have for the user different importance it is possible to use the appropriate weight coefficients Wj that take into account their importance. In that case the criterion function is f(A,B) = zj=1 y(tj ), where the function y is determined / A ∨ tj ∈ /B 0, tj ∈ . for all terms of domain ontology and y tj = wj, tj ∈ A ∧ tj ∈ B User can form the set of thesauri for typical tasks (for example, the search of scientific article with algorithm or information about publication rules of journal) and them add them to requests with current conditions. Example of such personified search is proposed in Rogushina (2018) that describes MAIPS—semantic retrieval system that use domain ontologies and task thesauri for user profiling (Fig. 17.5).
17.4.2 Analysis of User Competencies for Selection of Learning Resources Automated methods for matching of qualifications can be used to various information objects (humans, organizations, learning courses, requirements of employer etc.) that are formalized by different terms and concepts and based on different qualification systems. Now a lot of ontologies are developed for unified representation of such knowledge. For example, the ESCO (European Skills, Competencies, Qualifications and Occupations) ontology (ESCO) contain information about professions, skills and qualifications. ESCO can be used as a dictionary, describing, identifying and classifying professional occupations, skills, and qualifications relevant for the EU labor market and education and training.
400
J. Rogushina and A. Gladun
Hyper–link
Analysis
Name
Description
Computer
Text complexity Fog Index of Ganning: 15.78
Бакалаврат. Спеціальність Bachelor, 122. Комп’ютерні науки та
Speciality 122
Query
інформаційні технології.
Informatics By specialtity 122 Query editing Ганінга: 12.8 Індекс туманності Plan of and computer Ф-ла Флеща-Кінкейда: 2.4 Ф-ла studies Key words sciences Speciality 122 “Information technologies” passport Thesaurus editing
Query results
Basic ontology
Basic ontology
knowledge frame
Competence_e.owl Competence_e.owl
Words
Competence_e.owl Competence_e.owl
frame
Basic ontology
Thesaurus
skills Compet_frame.xml(Competence_e.owl) Compet_frame.xml(Competence_e.owl) competence education Save
speciality Word Weight
competence
5
knowledge
3
framework
2
knowledge
Tag cloud of thesaurus
Thesaurus
Fig. 17.5 Use of task thesaurus for semantic retrieval in MAIPS
User competence thesaurus is generated as a sum of NL descriptions of user qualifications and learning outcomes (formal and non-formal) and than can be matched with vacancies or learning resources. In Gladun et al. (2015) the following method of competence matching is proposed: 1.
2. 3. 4.
5. 6.
define the documental content A that can be used for description of the set of atomic competencies that define some complex information object (for example, requirements of employer or passport of postgraduate speciality) transform these documents from A into the Wiki representation for structured representation of content; select competence ontology; semantically mark up these Wiki resources by the concepts of this ontology that can be used as classes and by object properties of this ontology that can be used as semantic properties at Semantic Media Wiki; generate thesauri of wikified documents from A; match the sum of these thesauri with thesauri of learning resources from the Web.
This approach is based on use of semantic Wiki resources (Krotzsch et al. 2017) and Wiki ontologies that formalize knowledge structure of such resources (Rogushina 2016).
17 Task Thesaurus as a Tool for Modeling …
401
Fig. 17.6 e-VUE ontology and user interface
The integration and interaction between ontologies and Wiki resource is an actual research field. Now many semantic extensions of Wiki technology are used for development of distributed knowledge bases. There is a lot of important difference in ontological expressivity and functionalities between various extensions. In this work we use portal version of the Great Ukrainian Encyclopedia (e-VUE—vue.gov.ua) as a source of domain ontologies and initial sets of task thesauri (Fig. 17.6). E-VUE is based on MediaWiki expanded by Semantic MediaWiki. It provides content structuring by mechanisms of namespaces, templates and categories, and Semantic MediaWiki enrichs them by semantic properties and their types. e-VUE becomes a knowledge source not only for humans, but also for other Semantic Web applications that can use knowledge exported from e-VUE in generally accepted presentation formats.
17.5 Conclusion This work has attempted to bridge two research areas important for the future generation of intelligent knowledge-based applications: user modeling in adaptive applications (such applications provide personified processing of information according
402
J. Rogushina and A. Gladun
to user needs and specifics) and ontological analysis that ensures reuse and sharing of domain knowledge. The role played by ontologies in the adaptive system design, which is often reduced to the representation of a particular knowledge component, can be broadened by experience of user interaction both with this system and with other applications. We propose to use task thesaurus as a dynamic element of user model that is based on domain ontology that represents more stabile aspects of user interests. Simple structure of task thesaurus provides it’s fast and efficient processing, and use of domain ontologies for their generation causes to avoid loss of important information. Semantic similarity estimations provide the theoretical basis for generation of task thesaurus as a set of concepts similar to user current task. Similarity is an important and fundamental concept in AI and many other fields. Analysis of various approaches to these estimations helps in selection of concept parameters that influence on semantic distance between concepts. The prospects of automates generation of ontology-based task thesauri depend on accessibility of pertinent domain ontologies and well-structured, trusted and actual IRs that characterize user information needs and interests. Therefore, we can find information resources where such parameters are defined explicitly and can be processed without additional preprocessing. Semantic Wiki the where the relationship between concepts and their characteristics are defined through semantic properties correspond with such conditions. We propose some examples of practical usage of task thesauri for the Web personifies information search and retrieval of complex information objects to demonstrate advantages and disadvantages of proposed approach but the sphere of task thesauri usage isn’t limited by search. Task thesauri as an element of user profile can be applied for personified adaptation of intelligent system behavior: different users receive different answers on the same request according to their needs, competencies and cognitive abilities. Task thesauri as an element of user profile can be applied for e-commerce, distant learning and control of its results, cybersecurity, e-medicine etc.
References Antoniou G, Van Harmelen F (2004) Web ontology language: owl. Handbook on ontologies. Springer, Berlin Heidelberg, pp 67–92 Calvanese D, De Giacomo G, Lembo D (2007) Tractable reasoning and efficient query answering in description logics: the DL-Lite family. JAR 39:385–429 DAML + OIL (2001) No Title. http://www.daml.org/2001/03/daml+oil-index.html. Accessed 30 Oct 2020 ESCO ESCO (European Skills, Competencies, Qualifications and Occupations. https://ec.europa. eu/esco/portal/home. Accessed 20 Oct 2019 Gladun A, Khala K, Sammour G (2015) Semantic web and ontologies for personalisation of learning in MOOCs. In Proceedings of IEEE 7th International Conference on Intelligent Computing and Information Systems. ICICIS-2015, pp 185–190
17 Task Thesaurus as a Tool for Modeling …
403
Gladun A, Rogushina J (2010) Mereological aspects of ontological analysis for thesauri constructing. In: Publishing N (ed) Buildings and the Environment. New York, USA, New York, pp 301–308 Gruber TR (1993) A translation approach to portable ontology specifications. Knowl Acquis 5:199– 220 Gymez-Perez A, Moreno A, Pazos J (2000) Knowledge Maps: an essential technique for conceptualization. Data Knowl Eng 33:169–190 Kalfoglou Y, Schorelmmer M (2003) Ontology mapping: the state of the art. The Knowl Eng Rev 18:1–31 Kleshchev A, Artemjeva I (2001) A structure of domain ontologies and their mathematical models Krotzsch M, Vrandecic D, Volkel M (2017) Semantic MediaWiki. http://c.unik.no/images/6/6d/ SemanticMW.pdf. Accessed 22 Sep 2020 Pastor JA, Martínez FJ, Rodríguez JV (2009) Advantages of thesaurus representation using the Simple Knowledge Organization System (SKOS) compared with proposed alternatives. Inf Res 14:1–16 Protege Plugin Library (2019) Protege Plugin Library. http://protegewiki.stanford.edu/wiki/Pro tege_Plugin_Library. Accessed 22 Sep 2020 Rada R, Mili H, Bicknell E (1989) Development and application of a metric on semantic nets. IEEE Trans Syst Man Cybern 19:17–30 Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy Rogushina J (2016) Semantic Wiki resources and their use for the construction of personalized ontologies. In CEUR workshop proceedings, pp 188–195 Rogushina J. (2018) Models and methods of ontology use for the web semantic search. In Proceedings of the 11th international conference of programming UkrPROG, pp 197–203 Sánchez-Cervantes JL, Radzimski M, Rodriguez-Enriquez CA, et al (2016) SREQP: a solar radiation extraction and query platform for the production and consumption of linked data from weather stations sensors. J Sensors. https://doi.org/10.1155/2016/2825653 Sosnovsky S, Dicheva D (2010) Ontological technologies for user modeling. International Journal of Metadata, Semantics and Ontologies. Int J Metadata Semant Ontol 5:32–71 Stanisław L (2020) Stanisław Le´sniewski. In Stanford Encycl. Philos. https://plato.stanford.edu/ent ries/lesniewski/. Accessed 22 Sep 2020 Taieb M, Aouicha M, Hamadou A (2014) Ontology-based approach for measuring semantic similarity. Eng Appl Artif Intell, 238–261 The differences between a vocabulary, a taxonomy, a thesaurus, an ontology and a meta-model (2003) The differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a metamodel. http://www.welchco.com/02/14/60/03/01/1501.HTM. Accessed 12 Aug 2020 Uschold M (1998) Knowledge Level Modeling: Concepts and Terminology. Knowl. Eng. Rev. 13:5–29 Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on association for computational linguistics. association for computational linguistics, Stroudsburg, PA, USA
Chapter 18
NHC_MDynamics: High-Throughput Tools for Simulations of Complex Fluids Using Nosé-Hoover Chains and Big Data Analytics Luis Rolando Guarneros-Nolasco , Manuel Suárez-Gutiérrez , Jorge Mulia-Rodríguez , Roberto López-Rendón , Francisco Villanueva-Mejía , and José Luis Sánchez-Cervantes Abstract One of the most promising alternatives to acquire knowledge of the behavior of biological systems are the methods of molecular simulation such as Molecular Dynamics (MD). MD has become one of the most powerful numerical tools to study thermodynamic properties of macromolecules such as proteins and other biomolecules. However, MD has two major limitations when it comes to study large systems such as proteins at the molecular level, these are: the size of the system (the number of atoms to study) and the integration time used to evaluate the equations of motion. Therefore, this work is focused on the implementation of the NVT algorithm of molecular dynamics based on the Nosé-Hoover thermostat chains with the high performance computing technology such as Graphical Processing Units (GPUs) and Big Data analytics for the generation of knowledge to help understand the functioning of thermodynamic properties in simulated systems of Lennard-Jones L. R. Guarneros-Nolasco Tecnológico Nacional de México/I. T. Orizaba, Orizaba, Veracruz, Mexico e-mail: [email protected] M. Suárez-Gutiérrez Universidad Veracruzana, Xalapa, Veracruz, Mexico e-mail: [email protected] J. Mulia-Rodríguez · R. López-Rendón Universidad Autónoma del Estado de México, Toluca, Mexico e-mail: [email protected] R. López-Rendón e-mail: [email protected] F. Villanueva-Mejía Instituto Tecnológico de Aguascalientes, Aguascalientes, Mexico e-mail: [email protected] J. L. Sánchez-Cervantes (B) CONACYT—Tecnológico Nacional de México/I. T. Orizaba, Orizaba, Veracruz, Mexico e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_18
405
406
L. R. Guarneros-Nolasco et al.
fluids, as well as the study of the behavior of proteins, such as diabetes, to refine their structures to formulate and integrate them into the improvement of food in an improved diet for people with this condition. Keywords Molecular Dynamics (MD) · Nose-Hoover Chains (NHC) · Graphic Processing Units (GPUs) · Canonical ensemble (NVT) · Knowledge acquisition
18.1 Introduction Molecular systems are complex and consist of a large number of atoms, so it would be impossible to determine their properties analytically. To use molecular dynamics in the study of biological systems, a robust computer infrastructure is needed to overcome these difficulties. Currently, the technology used for the manufacture of graphics cards or GPUs (Graphics Processing Unit) has evolved significantly, so that they are no longer exclusive processors for graphics processing and have become sophisticated coprocessors of low price and high performance, which allowed to increase the capacity of a desktop or laptops personal computer with the improvements in its architecture and a flexible programming model for the massive handling of parallel data, which makes them an attractive alternative for high performance computing. For this reason, the next generation of supercomputers will have dramatically higher performance than current systems, generating more data that needs to be analyzed (for example, in terms of number and length of trajectories in molecular dynamics). In the future, the coordination of data generation and analysis can no longer depend on the manual and centralized analysis that is traditionally performed after the simulation is completed or on the current data representations that have been available for the traditional visualization tools. Powerful data preparation phases (i.e., phases in which the data are transformed from the original row to concise and still meaningful representations) will have to proceed to the data analysis phases. According to the above, now the problem we have is that the volumes of data that are exponentially generated by high performance applications, such as the application in our case study and the methodology of molecular dynamics, need to be stored in structured, semi-structured and unstructured formats for later analysis that will allow taking advantage of automatic learning methods in order to identify recurrent patterns and predict infrequent events in the trajectories. Therefore, visualization models should be elaborated to identify important characteristics of the atoms by adequately analyzing the positions and velocities through time. This chapter presents a software that generates datasets of positions and velocities of a Lennard-Jones fluid to analyze thermodynamic properties from a molecular dynamics methodology using a constant temperature ensemble as well as Big Data to perform the corresponding analyses. In the dataset metadata are stored for each step in the dynamics to identify the positions, velocities and temperature reached, in
18 NHC_MDynamics: High-Throughput Tools for Simulations …
407
order to visualize the trajectories that run through each atom during the simulation time. This chapter is structured as follows: Sect. 18.2 presents the methodology of molecular dynamics and a set of software that have implemented the NVT ensemble with Nosé-Hoover’s chains related to our proposal and in which the technology in which the integration thermostat is programmed is compared, as well as whether they perform the canonical ensemble (NVT) of each of them. In Sect. 18.3 we present the proposed tool for the simulation of a Lennard-Jones system to acquire knowledge of the thermodynamic properties of molecular dynamics based on an NVT ensemble and using Nosé-Hoover’s chain thermostat. A case study of a 2048, 10,976 and 23,328 particles system for the simulation of the fluid in our proposal is presented in Sect. 18.4 and, finally, conclusions and future work are presented in Sect. 18.5.
18.2 Related Work Several molecular dynamics software have been developed that include the thermostat Nosé-Hoover chains such as LAMMP (Plimpton 1995), AMBER (Phillips et al. 2020), HOOMD-blue (Anderson et al. 2020), Gromacs (Hess et al. 2008) using CPU and other developments that use GPUs but the thermostat algorithm is different as NAMD (Phillips et al. 2020) and ACEMD (Harvey et al. 2009).
18.2.1 Software that Implement Thermostat Nosé-Hoover Chain LAMMPS (Plimpton 1995) is the most popular code that makes use of the chains and in which parallel computing acceleration schemes have been implemented to obtain the maximum performance in CPU clusters, however, the routine that implements the thermostat is still programmed in CPU. LAMMPS is a classic code for molecular dynamics. It has potentials for solid state and soft matter materials, as well as coarse and mesoscopic systems. It can be used to model atoms or, more generally, as a parallel particle simulator in the atomic, meso or continuous scale. Assisted Model Building with Energy Refinement (AMBER) (Case et al. 2018), is a family of force fields for molecular dynamics of biomolecules originally developed by Peter Kollman’s group at the University of California, San Francisco. AMBER is also the name for the molecular dynamics software package that simulates these force fields. It is maintained by an active collaboration between David Case at Rutgers University. The AMBER software suite provides a set of programs to apply the AMBER forcefields to simulations of biomolecules. It is written in the programming languages Fortran 90 and C, with support for most major Unix-like operating systems
408
L. R. Guarneros-Nolasco et al.
and compilers. The routine that implements the thermostat is still programmed in CPU. HOOMD-blue (Anderson et al. 2020) is a freely available software designed explicitly for GPU execution that comprises a set of general-purpose particle simulation tools. It specializes in molecular dynamics simulations of polymer systems. HOOMD-blue keeps all the simulation data inside the GPU memory to overcome the transfer bottleneck from the CPU to the GPU. Several GPU-specific algorithms and approaches are used, including atom classification to reduce branch divergence, effective use of pair lists, and optimizations that take advantage of atomic operations, along with other features found only in the state-of-the-art GPUs. The routine that implements the thermostat is still programmed in CPU. GROMACS (Hess et al. 2008) is a versatile package for performing molecular dynamics, i.e. simulating the Newtonian equations of motion for systems with hundreds of millions of particles. It is designed mainly for biochemical molecules such as proteins, lipids and nucleic acids that have many complicated linked interactions, but since GROMACS is extremely fast at calculating unlinked interactions (which generally dominate the simulations) many groups are also using it for research in non-biological systems, for example, polymers. The routine that implements the thermostat is still programmed in CPU. On the other hand, with the appearance of the graphic processing units (GPUs) in these days it is possible to have a computer with an integrated GPU that offers massive numerical calculation capacities, reason why they have revolutionized the molecular simulation techniques when developing codes like, which are the most representative, nevertheless, they implement Langevin thermostat with GPUs.
18.2.2 Software that Implement Thermostat Langevin NAMD (Phillips et al. 2020) It is one of the first packages to incorporate GPU acceleration. It is a parallel molecular dynamics code designed for high performance simulation of large biomolecular systems. Based on the parallel objects in Charm++, NAMD scales to hundreds of cores for typical simulations and beyond 500,000 cores for the largest simulations. NAMD uses the popular VMD molecular graphics program for simulation setup and path analysis, but also supports AMBER, CHARMM and X-PLOR. NAMD is distributed free of charge with the source code. ACEMD (Harvey et al. 2009) It is production-level molecular dynamics software specially optimized to run on NVIDIA graphics processing units (GPUs) and is one of the world’s fastest molecular dynamics engines. This software has a powerful scripting interface and extensions in Python that uses HTMD (Doerr et al. 2016), allows the use of the popular CHARMM (Vanommeslaeghe et al. 2010) and AMBER force field formats without any changes, and allows the execution of multiple hosts for mirror exchange methods. ACEMD has been used to perform molecular dynamics simulations of globular and membrane proteins, oligosaccharides, nucleic acids, and synthetic polymers.
18 NHC_MDynamics: High-Throughput Tools for Simulations …
409
Table 18.1 Comparative table of thermostats and architecture in which it is designed Software
MD
Ensemble NVT
Thermostate
Architecture
License
NAMD
✔.
✔
Langevin
GPU
Free
Gromacs
✔
✔
Nosé-Hoover Chains
CPU
Free
HOOMD-blue
✔
✔
Nosé-Hoover Chains
CPU
Free
ACEMD
✔
✔
Langevin
GPU
Commercial
LAMMPS
✔
✔
Nosé-Hoover Chains
CPU
Free
Amber
✔
✔
Nosé-Hoover Chains
CPU
Free
As shown in Table 18.1, the works analyzed use several thermostats to perform the molecular dynamics, being the most common, Langevin (Harvey et al. 2009) being this the one that routines are programmed for GPUs, and Nosé-Hoover chains, which are designed to run in CPU environments. Unlike Gromacs, HOOMD-blue, LAMMPS and AMBER, our proposal implements the Nosé-Hoover chains making use of the graphics processing units (GPUs) so that using this technology makes massive use of parallel data. On the other hand, when making the proposal of use of GPUs, we make use of conventional computation that allows to create a robust computational infrastructure in which it is not necessary to have a cluster of computers to perform the obtaining of statistical information that generates the molecular dynamics, allowing to increase the capacity of a desktop or laptop personal computer with the improvements in its architecture and a flexible programming model for the massive handling of data in parallel.
18.3 Models and Methods 18.3.1 Molecular Dynamic Molecular dynamics (Tuckerman 2013) is a computational simulation technique that studies the behavior of a system of many particles by calculating the evolution in time and averaging an amount of interest over a long enough time. For this it is necessary to integrate numerically the equations of motion of a system of N particles through Newton’s second law: Fi = mi
d2 ri i = 1, . . . , N dt 2
(18.1)
where Fi are the forces acting on each particle due to a pair potential U(r ij ), that is:
410
L. R. Guarneros-Nolasco et al.
Fig. 18.1 Interactions in the atom-atom model between different molecules where the atom a = 1 of molecule i interacts with the atoms of molecule j and so on with the other atoms
Fi =
N
Fi j
(18.2)
i= j
Fij is the force between particles i and j that is expressed in terms of the potential: N
Fi j = −
i= j
N
∇U (ri j )
(18.3)
i= j
So, we have: N
Fi j = −
i= j
N dU (ri j ) r i j i= j
dri j
ri j
(18.4)
where r ij = |r i − r j | is the relative distance between the centers of the 2 particles. The heart of a simulation with molecular dynamics depends on an adequate description of the system in terms of interaction potential (Fig. 18.1).
18.3.2 Initial Conditions As a first step in the simulation of any system using DM, it is necessary to specify the initial positions of the particles that make up the system. To do this, the face centered cubic network (FCC) is used (Tuckerman and Martyna 2000). The number of atoms contained in a network will be N = 4n3 , being n a positive integer. This implies that the network can only be built with certain numbers of atoms: 32, 108, 256, 500, 864 and so on. Once the positions of all the atoms in the liquid have been assigned, their initial velocities must also be specified. The usual thing is to choose these velocities at
18 NHC_MDynamics: High-Throughput Tools for Simulations …
411
random within a certain interval, distributed in a uniform way or by means of a Gaussian distribution.
18.3.3 Initial Velocity Distribution The initial velocity distribution is based on a Maxwell-Boltzmann distribution (Frenkel and Smit 2002). P vi,α =
m 2π k B T
21
mv 2
e
− 2k 1,αT B
(18.5)
where vi,α is the component α (= x, y, z) of the velocity of the i atom. The distribution can be used to define the instantaneous temperature T (t) using the equipartition theorem. 2 mvα 1 = kB T (18.6) 2 2 which relates the average kinetic energy to temperature (—overall average). Equation (18.6) can be obtained directly from Eq. (18.5) because the ensemble mean corresponds to the mean of all the velocities of the atoms, the instantaneous temperature T (t) is: k B T (t) =
1 2 mvi,α N f i,α
(18.7)
where N f is the number of degrees of freedom. Therefore, Eq. (18.7) allows us to calculate the instantaneous temperature from the velocity distribution. It is also clear that, for a given velocity distribution, T (t) is not strictly equal to T. Although the velocities are generated using the distribution in Eq. (18.5) at temperature T, the molecular system contains only a finite number of atoms and the (instantaneous) temperature T (t) will deviate from T. To keep the temperature constant, the velocities can be rescaled according to: vi,α =
T vi,α T (t)
(18.8)
It is easy to show that the instantaneous temperature after scaling T (t) ≡ T. If no rescaling is carried out, the relative temperature fluctuations in the N atom system are given by (for the system of approximately 1000 atoms the fluctuations in T (t) are ~3%).
412
L. R. Guarneros-Nolasco et al.
T (t) = T (t)
1 T 2 (t) − T (t)2 2 1 ∼ N2 T (t)
(18.9)
18.3.4 Periodic Boundary Conditions A special feature when performing a DM simulation is mainly the number of particles that make up the system to be modeled. Mainly when performing simulations involving hundreds or thousands of atoms. The time of computation use in DM programs grows exponentially with the number of atoms to be modeled, this is due to the evaluation of the forces between the atoms, so it is necessary to keep the number as small as possible. However, the problem is that in such a small system (compared to the number of particles in a mole, of the order of 1023) it is not representative of the sine of a liquid, since the system is dominated by surface effects. Due to the above, this is solved when the so called periodic boundary conditions are applied (Frenkel and Smit 2002). With this technique the cube in which the system is located, the primary cell, is surrounded by exact replicas in all directions, which are called image cells, with which an infinite network is formed. With this the image cells store the same atoms as the primary cell and, during a simulation, each of the atoms in the image cells moves in the same way as the atoms in the primary cell. Thus, if an atom from the primary cell leaves it on one side, its image of the opposite side enters the primary cell (Fig. 18.2). Fig. 18.2 Periodic boundary conditions in a periodic two-dimensional system The shaded cell corresponds to the central cell
18 NHC_MDynamics: High-Throughput Tools for Simulations …
413
18.3.5 Interaction Potential The success of a DM simulation will depend on the use of a suitable potential model that will contain the essential physics of the system (Johnson et al. 1993). The quality of the results will depend on how faithfully it represents the real interactions between the particles.
18.3.5.1
Potential of Lennard-Jones
It describes the interactions between liquid atoms and Van der Waals type interactions between atoms of different molecules (Johnson et al. 1993). For the system analyzed in the present study the interaction between atom a in molecule i and atom b in molecule j is represented by the following expression: UL J =
Nj Ni N −1 N σab 12 σab 6 u L J ria jb = 4εab − ria jb ria jb b=1 i=1 j>i a=1 b=1
Nj Ni N −1 N i=1 j>i a=1
(18.10) 12 6 The term rσiaabjb describe the repulsion and the term rσiaabjb describe the attraction between the atoms, σab is the measure of the diameter of the interacting atoms and εab is the parameter that represent the measure of attraction between the sites in different molecules, ria jb is the distance between site a in molecule i and site b in molecule j, Fig. 18.3.
18.3.6 Canonical Ensemble (NVT) The canonical ensemble or NVT (Hu and Sinnott 2004) is known as the constant temperature ensemble. It is a system where particles contained in a fixed volume are kept in contact with their surroundings in such a way that they can exchange heat, but not matter. Energy is not now a constant, but fluctuates around an average value. In order to simulate an NVT ensemble, a large number of methods have been proposed (Toxvaerd 1980; Nosé 1984). One of the most used is the Nosé-Hoover Thermostat Chain(Martyna et al. 1992) where a chain of thermostats is connected to build a Nosé-Hoover chain, however, when performing this chaining the evaluation is computationally expensive that needs to be done in the simulation. An efficient way to keep the temperature constant in a molecular dynamic is with the methodology of the so-called extended phase space proposed by Andersen (1980). In this methodology the positions and moments of the particles are complemented by additional variables in the phase space that control the temperature fluctuations. The methods widely used to perform the simulation of an NVT ensemble that are
414
L. R. Guarneros-Nolasco et al.
Fig. 18.3 Lennard-Jones potential. The blue line represents the attraction of atoms and the red line their repulsion
based on the extended system scheme are the Nose-Hoover (NH) Thermostat (Nosé 1984) and the Nose-Hoover Chain Thermostat (NHC). The NHC methodology is a generality of the NH method and was proposed by Martyna, Klein and Tuckerman (MTK) (1992). The idea is to couple the thermostat of the actual physical system to a chain of thermostats, and incorporate the extended Hamiltonian these degrees of freedom. Only the first thermostat already defined by NH interacts with the real system and the others are coupled together. The phase space generated by the NHC is formed by an additional set of extended variables that evolve both their positions and moments. In this way a point in phase space is complemented by these extended variables in the following way (t) = pi , r i , pη , {η}
(18.11)
where pη is the time and η is the position per molecule of the extended variables. The equations of motion proposed by MKT to simulate a system in the canonical ensemble are: r˙ i =
pi mi
p˙ i = Fi −
pηi pi Q1
18 NHC_MDynamics: High-Throughput Tools for Simulations …
η˙ k = p ηk = G k −
p ηk Qk
415
k = 1, . . . , M
pηk+1 pη Q k+1 k
(18.12)
k = 1, . . . , M − 1
p˙ η M = G M where Q k is the mass of the thermostats, the G’s are the forces of the bath. G1 =
N pi2 − 3N kb T mi i=1
Gk =
pη2k−1 Q k−1
− kb T
(18.13)
The temperature is controlled by these forces which in turn influence the timing of the thermostats in the chain. If we do M = 1 is equivalent to having the original NH thermostat and eliminating the thermostats leads to the NVE ensemble. The physics incorporated in Eq. (18.12) is based on the fact that the term − pη1 /Q 1 pi acts as a type of dynamic friction force. This friction force regulates kinetic energy so that its average is the correct canonical value. In a similar way, the bath variable (k + 1) serves to modulate the fluctuations in the k-essima variable so that each bath variable is driven to have its own canonical average.
H = H (p, r) +
M pη2k k=1
2Q k
+ k b T d N η1 +
M
ηk
(18.14)
k=2
where H (p, r) is the Hamiltonian of the physical system. According to the work of Tuckerman et al. (2006), applying mathematics and algebra we can arrive at the following scheme (Fig. 18.4):
Fig. 18.4 NVT algorithm incorporating NHC thermostats. The NVE ensemble is located in the central part
416
L. R. Guarneros-Nolasco et al.
In this equations we can see the implementation of the NVT algorithm in a Molecular Dynamics code. The central part corresponds to the NVT ensemble. Thermoupdate is the routine where the NHC thermostats are incorporated and applied at the beginning and end of the dynamics. The following equations that are programmed in the routine Thermal-update written in terms of velocities, corresponds to scale the velocities of the extended variables of the first thermostat that is coupled to the particles or atoms in the system.
18.4 Approach To achieve a satisfactory result, a processing approach using graphics processing units (GPUs) is proposed. The methodology used in the realization of this tool focuses on molecular simulation methods, specifically the method of molecular dynamics, which is based on numerically solving the classical equations of motion. To develop the methodology of molecular dynamics is needed: (1) Initial configuration of the study system; (2) Module of molecular dynamics; (3) Module of calculation of the properties and (4) presentation of results. The algorithm and processing diagram of the proposal of this tool to integrate the Nosé-Hoover Chain thermostat into the molecular dynamics calculation cycle using the graphic processing cards is shown in Figs. 18.5 and 18.6. To ensure the efficiency of the algorithm, a simulation was Fig. 18.5 Algorithm of the application of the Nosé-Hoover chain thermostat
18 NHC_MDynamics: High-Throughput Tools for Simulations …
417
Fig. 18.6 Processing diagram of a simulation with molecular dynamics and big data analytics
performed to generate the statistics of the thermodynamic properties in equilibrium for a Lennard-Jones fluid (Johnson et al. 1993) (Figs. 18.5 and 18.6).
18.4.1 Initial Configuration Module In this module the initial configuration of the system is made providing the temperature, density and the size of the cell with the purpose of obtaining the number of particles of study. To generate the velocities and initial positions of the system, a Maxwell-Boltzmann distribution is used (Frenkel and Smit 2002). The number of particles contained will be N = 4n3 , being a positive integer. This implies that the grid can only be built with certain numbers of atoms: 32, 108, 256, 500, 864 and so on. For our case it was initialized with 2048, 10,976 and 23,328 particles.
18.4.2 Molecular Dynamics Module The calculation of the positions and velocities is the heart of the molecular dynamics, so in this module the calculation of the thermodynamic properties of the interest of study that is the canonical NVT ensemble (number of particles, volume and constant temperature), for which the following is done: 1.
Nosé-Hoover Chain Thermostat. In this step the calculation of the thermostat implemented by Nosé-Hoover chains is made, in this case 4 chains are calculated.
418
2. 3.
4. 5. 6.
L. R. Guarneros-Nolasco et al.
Positions and velocities. In this step the positions and velocities are calculated through the Velocity Verlet Algorithm (Frenkel and Smit 2002). Reviewing the Verlet List (Frenkel and Smit 2002). This step is processed to efficiently maintain a list of all particles within a given cut-off distance from each other, also called a neighbor list. Lennard-Jones forces. In this step the equations of motion of an N-particle system are numerically integrated through Newton’s second law. Velocities. The velocities are recalculated. Nosé-Hoover Chain Thermostat. The thermostat is recalculated with the 4 NoséHoover chains.
The application of these tasks generates the cycle of the molecular dynamics to execute it in a certain number of steps and this entails the biggest effort of use of computation, so this module makes use of the technology of graphic processing cards (GPUs).
18.4.3 Properties Calculation Module with Big Data and Data Analytics This module is responsible for calculating the thermodynamic properties of the system of interest, such as temperature, kinetic energy, potential as well as generating the three-dimensional model of the positions and movements of the particles. These are stored in dataset files for visualization with supporting applications (Rosales-Morales et al. 2020) in the results presentation module. One of the main advantages is to generate the dataset of the calculation of the positions and velocities of the study system (Rodríguez-Mazahua et al. 2016), in order to be able to perform Big Data Analytics (Mital et al. 2015) which is the process of examining large datasets—or Big Data—that include a variety of data types (Golchha 2015) to help make informed decisions. This dataset is stored using Apache’s technology Cassandra™, which is a transactional and scalable distributed database system that offers high availability, capable of supporting huge data sets in clusters with thousands of nodes deployed in multiple data centers (Chebotko et al. 2015), without a main point of failure, for this reason the Big Data modeling focuses on the persistence of position and velocities, and data queries, which is desired to satisfy, sometimes sacrificing consistency and introducing redundancy in such data, for greater access and velocity of writing. Below are the logical and physical diagrams of the model implemented at Cassandra™, as well as the CQL statements obtained, to store the data from the NVT statistical set generated during the simulation. Since the modeling for Cassandra™ is a query oriented modeling, the queries supported by the system during the creation of the logical diagram are considered, which is illustrated in Fig. 18.7 in Chebotko et al. (2015) format. In the modeling for Cassandra™, during the transition from the logic to the physical diagram, two main changes are made: the assignment of data types and the
18 NHC_MDynamics: High-Throughput Tools for Simulations …
419
Fig. 18.7 Big data logical diagram
optimizations that are deemed necessary. Figure 18.8 shows the physical diagram of the non-relational database. In the logical diagram, the Trayectory table has the capacity to store more than 31 million values each, so, in theory, Cassandra™ operates without any inconvenience, which will allow to store the statistics generated by the NVT ensemble and to generate the dataset for the visualization of the positions and velocities. For this reason, no optimization was done on the tables that will be used to store the information. To complete the implementation of the non-relational database model at Cassandra™, the master key space was defined, which will contain the definitions of types and tables used by our system. The CQL statements used are shown in Fig. 18.9. Similarly, Apache Spark™ is used, which is a unified analytical engine for largescale data processing, which achieves high performance for batch or continuous flow data, using a task scheduler DAG (Directed Acyclic Graph) last generation, a query optimizer and a physical execution engine (Zaharia et al. 2010). In this way it is achieved a form of advanced analytics involving complex applications, with elements such as predictive models and statistical algorithms, executed on high performance analysis systems.
420
L. R. Guarneros-Nolasco et al.
Fig. 18.8 Big data physical diagram
18.4.4 Results Presentation Module For the visualization of the calculated thermodynamic properties it is made with the support of free use applications as Xmgrace (Turner 2005) that shows the graphs of the generated statistics and VMD (Humphrey et al. 1996) that will allow us to visualize the 3D model of the study system with the purpose of observing the movements of the potions calculated for the particles.
18.5 Case of Studies: Calculation of the Thermodynamic Properties of a Lennard-Jones Fluid System of 2048, 10,976 and 23,328 Particles 18.5.1 Case Study 1: Lennard-Jones Fluid System 2048 Particles This data corresponds to a fluid of LJ particles, simulated in a canonical ensemble. The conditions of temperature and density was 2.0 and 0.7, while the number of particles in the system is 2048. Other parameters of simulation are time step t = 0.002,
18 NHC_MDynamics: High-Throughput Tools for Simulations … Fig. 18.9 CQL sentences used in the creation of the database schema for Cassandra™
421
CREATE TABLE manage_atoms ( Id_atom INT, Number_a INT, sigma FLOAT, epsilon FLOAT, masa FLOAT, PRIMARY KEY ((Id_atom)) ); CREATE TABLE simulation_config ( Id_simulation INT, steps INT, temperature FLOAT, Dt FLOAT, Rcut FLOAT, Rlist FLOAT, ensamble INT, PRIMARY KEY ((Id_simulation)) ); CREATE TABLE atom_positions ( Id_pos INT, Id_atom INT, Pos_X FLOAT, Pos_Y FLOAT, Pos_Z FLOAT, No_step INT, PRIMARY KEY ((Id_pos, Id_atom)) ); CREATE TABLE atom_velocities ( Id_vel INT, Id_atom INT, Vel_X FLOAT, Vel_Y FLOAT, Vel_Z FLOAT, No_step INT, PRIMARY KEY ((Id_vel, Id_atom)) );
422
L. R. Guarneros-Nolasco et al.
Fig. 18.9 (continued)
CREATE TABLE trayectory ( Id_film INT, Id_step INT, Id_pos INT, Id_vel INT, No_atoms INT, PBC_X FLOAT, PBC_Y FLOAT, PBC_Z FLOAT, PRIMARY KEY ((Id_film, Id_pos, Id_step, Id_vel)) );
characteristic time scale of particles τT = 0.2, cutoff radius rC = 4.0σ and Verlet radius r V = 4.5σ . Units of reference selected are: size of particles σ = 3.405 Å, mass of particles m = 39.95 g mol−1 and the depth of potential kεb = 119.8K . Simulations were equilibrated with 20,000 steps and the thermodynamic properties, energy and temperature, were averaged. The stage of production was of 40,000 MD steps. We used velocity Verlet algorithm of integration and the NHC as thermostat to keep the temperature in the ensemble NVT. The use case was developed in two stages, which are described below: 1.
2.
Initial Configuration: Using the initial configuration module, the system configuration was made for 2048 particles with Temperature 2.0 and Density 0.7, obtaining the initial positions and velocities for the study system. Using VMD, the initial system is visualized as shown in Fig. 18.10. Molecular Dynamics Calculation: With the initial configuration, an input file is created containing the configuration parameters, as well as the number of steps to be calculated for the dynamics, in this case 20,000 for the equilibrium and 40,000 for the production. The calculation of the dynamics is carried out in two phases:
(a)
Calibration system
The system must be balanced so that the calculated thermodynamic properties are constant over time. The following results are obtained from the thermodynamic properties and the 3D representation model of the positions and velocities, which shows the change in the initial configuration. Below are the results obtained with the properties calculation module to carry out its review and analysis (Figs. 18.11 and 18.12). (b)
Production dynamics
Once the system is balanced, the simulation of the dynamics is carried out at a greater number of steps in order to obtain the results of study interest. The Figs. 18.13 and 18.14, shows the obtained results.
18 NHC_MDynamics: High-Throughput Tools for Simulations …
423
Fig. 18.10 Initial configuration of a 2048 particle LJ fluid system with temperature 2.0 and density 0.7
Fig. 18.11 Representation of: A Kinetic energy; B Potential energy and, C Temperature calculated to balance the system. Note the initial values of each of them and later the constant values
424
L. R. Guarneros-Nolasco et al.
Fig. 18.12 3D model of the new positions and velocities
Fig. 18.13 Representation of: A Kinetic energy; B Potential energy and, C Calculated temperature in a production dynamic with 40,000 steps. In each one of them, the constant conservation of the calculated values is noted, characteristic of the NVT ensemble
18 NHC_MDynamics: High-Throughput Tools for Simulations …
425
Fig. 18.14 3D model of the new positions and velocities
18.5.2 Case Study 2: Lennard-Jones Fluid System 10,976 Particles This data corresponds to a fluid of LJ particles, simulated in a canonical ensemble. The conditions of temperature and density was 2.0 and 0.7, while the number of particles in the system is 10,976. Other parameters of simulation are time step t = 0.002, characteristic time scale of particles τT = 0.2, cutoff radius rC = 4.0σ and Verlet radius r V = 4.5σ . Units of reference selected are: size of particles σ = 3.405 Å, mass of particles m = 39.95 g mol−1 and the depth of potential kεb = 119.8K . Simulations were equilibrated with 20,000 steps and the thermodynamic properties,
426
L. R. Guarneros-Nolasco et al.
Fig. 18.15 Initial configuration of a 10,976 particle LJ fluid system with temperature 2.0 and density 0.7
energy and temperature, were averaged. The stage of production was of 40,000 MD steps. We used velocity Verlet algorithm of integration and the NHC as thermostat to keep the temperature in the ensemble NVT. The use case was developed in two stages, which are described below: 1.
2.
Initial Configuration: Using the initial configuration module, the system configuration was made for 10,976 particles with Temperature 2.0 and Density 0.7, obtaining the initial positions and velocities for the study system. Using VMD, the initial system is visualized as shown in Fig. 18.15. Molecular Dynamics Calculation: With the initial configuration, an input file is created containing the configuration parameters, as well as the number of steps to be calculated for the dynamics, in this case 20,000 for the equilibrium and 40,000 for the production. The calculation of the dynamics is carried out in two phases:
(c)
Calibration system
The system must be balanced so that the calculated thermodynamic properties are constant over time. The following results are obtained from the thermodynamic properties and the 3D representation model of the positions and velocities, which shows the change in the initial configuration. Below are the results obtained with the properties calculation module to carry out its review and analysis (Figs. 18.16 and 18.17). (d)
Production dynamics
18 NHC_MDynamics: High-Throughput Tools for Simulations …
427
Fig. 18.16 Representation of: A Kinetic energy; B Potential energy and, C Temperature calculated to balance the system. Note the initial values of each of them and later the constant values Fig. 18.17 3D model of the new positions and velocities
428
L. R. Guarneros-Nolasco et al.
Fig. 18.18 Representation of: A Kinetic energy; B Potential energy and, C Calculated temperature in a production dynamic with 40,000 steps. In each one of them, the constant conservation of the calculated values is noted, characteristic of the NVT ensemble
Once the system is balanced, the simulation of the dynamics is carried out at a greater number of steps in order to obtain the results of study interest. The Figs. 18.18 and 18.19, shows the obtained results.
18.5.3 Case Study 3: Lennard-Jones Fluid System 23,328 Particles This data corresponds to a fluid of LJ particles, simulated in a canonical ensemble. The conditions of temperature and density was 2.0 and 0.7, while the number of particles in the system is 23,328. Other parameters of simulation are time step t = 0.002, characteristic time scale of particles τT = 0.2, cutoff radius rC = 4.0σ and Verlet radius r V = 4.5σ . Units of reference selected are: size of particles σ = 3.405 Å, mass of particles m = 39.95 g mol−1 and the depth of potential kεb = 119.8K . Simulations were equilibrated with 20,000 steps and the thermodynamic properties, energy and temperature, were averaged. The stage of production was of 40,000 MD steps. We used velocity Verlet algorithm of integration and the NHC as thermostat to keep the temperature in the ensemble NVT. The use case was developed in two stages, which are described below:
18 NHC_MDynamics: High-Throughput Tools for Simulations …
429
Fig. 18.19 3D model of the new positions and velocities
1.
Initial Configuration: Using the initial configuration module, the system configuration was made for 23,328 particles with Temperature 2.0 and Density 0.7, obtaining the initial positions and velocities for the study system. Using VMD, the initial system is visualized as shown in Fig. 18.20.
Fig. 18.20 Initial configuration of a 23,328 particle LJ fluid system with temperature 2.0 and density 0.7
430
2.
(e)
L. R. Guarneros-Nolasco et al.
Molecular Dynamics Calculation: With the initial configuration, an input file is created containing the configuration parameters, as well as the number of steps to be calculated for the dynamics, in this case 20,000 for the equilibrium and 40,000 for the production. The calculation of the dynamics is carried out in two phases. Calibration system
The system must be balanced so that the calculated thermodynamic properties are constant over time. The following results are obtained from the thermodynamic properties and the 3D representation model of the positions and velocities, which shows the change in the initial configuration. Below are the results obtained with the properties calculation module to carry out its review and analysis (Figs. 18.21 and 18.22). (f)
Production dynamics
Once the system is balanced, the simulation of the dynamics is carried out at a greater number of steps in order to obtain the results of study interest. The Figs. 18.23 and 18.24, shows the obtained results.
Fig. 18.21 Representation of: A Kinetic energy; B Potential energy and, C Temperature calculated to balance the system. Note the initial values of each of them and later the constant values
18 NHC_MDynamics: High-Throughput Tools for Simulations …
431
Fig. 18.22 3D model of the new positions and velocities
Fig. 18.23 Representation of: A Kinetic energy; B Potential energy and, C Calculated temperature in a production dynamic with 40,000 steps. In each one of them, the constant conservation of the calculated values is noted, characteristic of the NVT ensemble
432
L. R. Guarneros-Nolasco et al.
Fig. 18.24 3D model of the new positions and velocities
The calculation of the simulation of molecular dynamics of the proposed proposal aims to demonstrate that the information generated by the algorithm is of great relevance for those researchers in the field of simulation with molecular dynamics, since the statistics generated allow us to observe that the main properties such as temperature remain constant over time.
18.6 Conclusions and Future Work The results obtained in which it is observed that the temperature is kept constant is a main validation of the NVT canonical ensemble. For this reason, it was validated that the use of the technology of graphic processing units is viable for the programming of applications that need to make massive use of computer time to model systems with a greater number of particles and generate knowledge of the behavior of the systems. The calculated properties allowed to review the results generated by the simulation through free tools and mainly it was possible to generate the 3D model with a visualization of the behavior of positions and velocities. As future works, it is contemplated to incorporate the simulation of diatomic particles for a better detail of the thermodynamic properties in Lennard-Jones fluid systems, as well as to increase the number of particles, which will allow the modeling of larger systems that in the future will include the study of proteins, which contain thousands of atoms, as in the case of diabetes, understanding the behavior and functioning of its structures for the design or improvement of new proteins and their inclusion in the foods incorporated into the diet of people who suffer from this chronic generative disease, or in the case of COVID-19/SARS-CoV-2, generate new
18 NHC_MDynamics: High-Throughput Tools for Simulations …
433
drugs that help in the prevention of this type of disease achieving an improvement in the quality of life. According to the above, the application of molecular dynamics modeling and simulation allows having a potential impact on any industrial sector where the analysis of the behavior and innovation of the product or process is required and that this depends on the chemical and electronic control of the physical properties of the material that one wants to design or develop. In this way, the application of molecular modeling and simulation allows to help solve industrial problems of great relevance for the country, for example, in the oil and gas industry to determine properties of many systems of interest or in the pharmaceutical industry for the design and development of new drugs that help in the mitigation of new diseases. Acknowledgments All simulations reported in this study and the development of our HIMD code were performed using the Supercomputer OLINKA, located at the Laboratory of Molecular Bioengineering at Multiscale (LMBM), at the Universidad Autónoma del Estado de México. L.G.N. and F.V.M acknowledges CONACyT México for supporting their postdoctoral stays at Tecnológico Nacional de México. J.L.S.C. thanks CONACYT for the research position under the CatedrasCONACYT program. This research work was sponsored by the National Council for Science and Technology (CONACYT) and the Secretariat of Public Education (SEP). The authors are grateful to Tecnológico Nacional de México (TNM) for supporting this work.
References Andersen HC (1980) Molecular dynamics simulations at constant pressure and/or temperature. J Chem Phys 72:2384–2393. https://doi.org/10.1063/1.439486 Anderson JA, Glaser J, Glotzer SC (2020) HOOMD-blue: a Python package for high-performance molecular dynamics and hard particle Monte Carlo simulations. Comput Mater Sci 173:109363, Feb 2020. https://doi.org/10.1016/j.commatsci.2019.109363 Case DA, Belfon K, Ben-Shalom IY, Brozell SR, Cerutti DS, Cheatham TE III, Cruzeiro VWD et al (2020) AMBER 2020, University of California, San Francisco Chebotko A, Kashlev A, Lu S (2015) A big data modeling methodology for Apache Cassandra. In: Proceedings—2015 IEEE international congress on big data, big data congress 2015. Institute of Electrical and Electronics Engineers Inc., pp 238–245 Doerr S, Harvey MJ, Noé F, De Fabritiis G (2016) HTMD: high-throughput molecular dynamics for molecular discovery. J Chem Theory Comput 12:1845–1852. https://doi.org/10.1021/acs.jctc. 6b00049 Frenkel D, Smit B (2002) Understanding molecular simulation. Elsevier, New York Golchha N (2015) Big data—the information revolution. Int J Adv Res 1:791–794 Harvey M, Giupponi G, De Fabritiis G (2009) ACEMD: accelerated molecular dynamics simulations in the microseconds timescale. J Chem Theory Comput 5:1632–1639 Hess B, Kutzner C, Van Der Spoel D, Lindahl E (2008) GRGMACS 4: algorithms for highly efficient, load-balanced, and scalable molecular simulation. J Chem Theory Comput 4:435–447. https://doi.org/10.1021/ct700301q Hoover WG (1985) Canonical dynamics: equilibrium phase-space distributions. Phys Rev A 31:1695–1697. https://doi.org/10.1103/PhysRevA.31.1695 Hu Y, Sinnott SB (2004) Constant temperature molecular dynamics simulations of energetic particle—solid collisions: comparison of temperature control methods. J Comput Phys 200:251– 266. https://doi.org/10.1016/j.jcp.2004.03.019
434
L. R. Guarneros-Nolasco et al.
Humphrey W, Dalke A, Schulten K (1996) VMD: Visual molecular dynamics. J Mol Graph 14:33– 38. https://doi.org/10.1016/0263-7855(96)00018-5 Johnson JK, Zollweg JA, Gubbins KE (1993) The Lennard-Jones equation of state revisited. Mol Phys 78:591–618. https://doi.org/10.1080/00268979300100411 Luna-Aveiga H, Medina-Moreira J, Lagos-Ortiz K et al (2018) Sentiment polarity detection in social networks: an approach for asthma disease management. In: Advances in intelligent systems and computing. Springer, pp 141–152 Martyna GJ, Klein ML, Tuckerman M (1992) Nosé-Hoover chains: the canonical ensemble via continuous dynamics. J Chem Phys 97:2635–2643. https://doi.org/10.1063/1.463940 Mital R, Coughlin J, Canaday M (2015) Using big data technologies and analytics to predict sensor anomalies. Proc Adv Maui Opt Space Surveill Technol Conf 84, Sep. 2014 Nosé S (1984) A unified formulation of the constant temperature molecular dynamics methods. J Chem Phys 81:511–519. https://doi.org/10.1063/1.447334 Phillips JC, Hardy DJ, Maia JDC, et al (2020) Scalable molecular dynamics on CPU and GPU architectures with NAMD. J Chem Phys 153:044130. https://doi.org/10.1063/5.0014475 Plimpton S (1995) Fast parallel algorithms for short-range molecular dynamics. J Comput Phys 117:1–19. https://doi.org/10.1006/jcph.1995.1039 Rodríguez-Mazahua L, Rodríguez-Enríquez CA, Sánchez-Cervantes JL et al (2016) A general perspective of big data: applications, tools, challenges and trends. J Supercomput 72:3073–3113. https://doi.org/10.1007/s11227-015-1501-1 Rosales-Morales VY, Sánchez-Morales LN, Alor-Hernández G et al (2020) ImagIngDev: a new approach for developing automatic cross-platform mobile applications using image processing techniques. Comput J 63:732–757. https://doi.org/10.1093/comjnl/bxz029 Toxvaerd S (1980) Molecular dynamics simulations at constant pressure and/or temperature. J Chem Phys 72:2384–2393. https://doi.org/10.1063/1.439486 Tuckerman ME (2013) Statistical mechanics: theory and molecular simulation, 1st edn. Oxford University Press Tuckerman ME, Martyna GJ (2000) Understanding modern molecular dynamics: techniques and applications. J Phys Chem B 104:159–178. https://doi.org/10.1021/jp992433y Tuckerman ME, Alejandre J, López-Rendón R et al (2006) A Liouville-operator derived measurepreserving integrator for molecular dynamics simulations in the isothermal-isobaric ensemble. J Phys A: Math Gen 39:5629–5651. https://doi.org/10.1088/0305-4470/39/19/S18 Turner P (2005) Center for Coastal and Land-Margin Research. Oregon Graduate Institute of Science and Technology, Beaverton, OR Vanommeslaeghe K, Hatcher E, Acharya C et al (2010) CHARMM general force field: a force field for drug-like molecules compatible with the CHARMM all-atom additive biological force fields. J Comput Chem 31:671–690. https://doi.org/10.1002/jcc.21367 Zaharia M, Chowdhury M, Franklin MJ et al (2010) Spark: Cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud’10). USENIX Association, USA, 10
Chapter 19
Determination of Competitive Management Perception in Family Business Leaders Using Data Mining Ángel Rodrigo Vélez-Bedoya , Liliana Adriana Mendoza-Saboyá , and Jenny Lorena Luna-Eraso Abstract This work seeks to determine competitive management perception of family business leaders, in order to establish working assumptions in new research and propose improvement and consolidation initiatives for these types of companies. This non-probabilistic, intentional study applied an instrument with 10 dimensions and 94 variables to a sample of 133 family business leaders from an intermediate city and a large city in Colombia. Data collection was achieved using supervised machine learning algorithms in the Python programming language, using techniques such as Cronbach’s Alpha Test, KMO, Levene, Bartlett, Discriminant Analysis and Decision Trees. The results allow us to identify four main components in 19 variables: Management and technology, Quality Management, Compensation, and Country competitiveness. Keywords Family business · Competitiveness · Perception · Data mining · Machine learning
19.1 Introduction In recent years, family businesses have undergone an accelerated process of technological modernization, organizational development and strategic consolidation, in order to increase their productivity and therefore their competitiveness (Savchina et al. 2016). Three facts can explain this phenomenon of modernization: first, the challenges inherent to the process of globalization and internationalization of economies, Á. R. Vélez-Bedoya (B) Fundación Universitaria CEIPA Business School, Antioquia, Colombia e-mail: [email protected] L. A. Mendoza-Saboyá ISLP, International Statistics Institute, Bogotá, Colombia e-mail: [email protected] J. L. Luna-Eraso Universidad de Nariño, Pasto, Colombia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 J. A. Zapata-Cortes et al. (eds.), New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Studies in Computational Intelligence 966, https://doi.org/10.1007/978-3-030-71115-3_19
435
436
Á. R. Vélez-Bedoya et al.
which has required SMEs to update themselves in different areas to maintain their stability and permanence, considering the “family effect” on the company performance (Dyer 2006); second, the growing interest of governments in entrepreneurship in SMEs by creating sources of financing, developing training programs and ensuring legal conditions to facilitate their technological and competitive development (UNCTAD 2005; OECD 2015; World Bank 2017; ERASMUS 2018); and third, the entrepreneurs’ acceptance of external advice, training in operational areas and the appropriation of modern management and information tools (Espinoza Aguiló and Espinoza Aguiló 2012), actions that strengthen productivity and increase the competitive capacity of family businesses (Romero 2006) generating a strategic vision. Family businesses are vital in all economies worldwide, as they contribute directly to a country’s development (Fried Schnitman and Rotenberg 2009); in the west, we find records from the eleventh century (Macías Ramirez 2011). Today, their contribution to the GDP of nations is a matter of great strategic importance (Sanguino et al. 2018), hence the growing need of an in-depth study of the competitiveness and alternatives, as well as instruments available to face different challenges (Fried Schnitman 2011). The conflicts faced by these organizations (IESE Business School 2011) must also be examined, as well as the problems of generational replacement (Cámara de Gipuzcoa 2018) to ensure their growth and sustainability. In Colombia, in-depth work has been carried out on the family business (Gallo Fernández 2009); especially regarding their competitiveness problems and specifically regarding the conflict between capital, company and family. However, science and technology policies have been timid regarding the appropriation of solid processes for the benefit of this type of company. The scope of the initiatives is still limited by many factors, among them the insufficiency of resources from the State destined to this sector and the lack of knowledge of the instruments created to support these entrepreneurs. This problem, which has been academically evidenced, has led to the generation of relevant research and training processes that help solve business problems, as well as mitigate the insufficiency of initiatives that promote the integration of the university and the company, in order to increase competitiveness of family SMEs (Vélez Bedoya and Beltrán Duque 2005). It highlights the ambiguity and imprecision of the information to make assertive decisions that guide a coherent strategy to increase the competitiveness and productivity of these organizations; Furthermore, there is still a need to consolidate research methodologies that make it possible to understand not only the problems that these companies are experiencing, but also to develop innovative and pertinent proposals that can be applied in companies to favor their competitive and sustainable growth (Vélez 2005; Vélez Montes et al. 2008; Vélez Bedoya and Mendoza 2009; Vélez Bedoya and Rueda Prieto 2017). The family business is an actor in the economic development of continents such as Europe, Asia, North America and emerging countries such as Latin America. 86.5% of Colombian companies belong to family companies, with a failure rate of 87% (CONFECÁMARAS 2018). The competitiveness of the family business is determined by challenges in the national and international environment such as the
19 Determination of Competitive Management Perception …
437
scientific and technological lag that shows the low capacity of the country to identify, produce, disseminate, use and integrate knowledge (Vélez Bedoya and Rueda Hernández 2017), as well as for Colombia’s investment in Science, Technology, and Innovation of just 0.5% of gross domestic product (GDP) in 2017 (DNP 2017). The present work determines, through machine learning analysis techniques, the perception of competitiveness of 133 family entrepreneurs in two cities in Colombia, to which a 10-dimensional, 94-variable questionnaire was applied. The results indicate that: first, there are 4 main components associated with 19 variables, which reflect 73.4% of the behavior of the sample; second, according to the discriminant analysis, there are 25 independent variables associated with the dimensions of leadership and government, family and family values, and 64 dependent variables associated with the dimensions of strategy and planning, ICTs, human talent management, quality management and competitiveness approach; third, according to the decision tree analysis, are the family member, first-generation and second-generation managerial style, based on the variables of the management component policy. This result allowed to validate theoretical developments of family business management models that indicate that the competitiveness of these organizations is conditioned by the capacities developed and their management groups to overcome the conflict between: family, business, and property (Tagiuri and Davis 1992); the firm, the property, the family and the management (Donckels and Frohlich 1991); business, family, property, management and succession (Amat Salas 2004); property, company and family from an evolutionary perspective (Gersick et al. 1997); integration with the environment (Ussman et al. 2001) the strategic approach (Sharma et al. 2003). This work allows for the identification of critical factors in the understanding of management practices in this type of company, facilitates the development of new research hypotheses about the problems that affect business failure, and it contributes to the construction of programs and intervention models that result in the growth and sustainability of these important companies. This study of the perceptions of competitiveness in family businesses, from a methodological perspective of applying machine-learning techniques, enables systemic and structural heuristics. In the development of this chapter, a review of the literature on family business competitiveness issues is carried out using the methodological approach for the analysis, and, finally, the results and conclusions are communicated.
19.2 Literature Review The family business: values, conflicts and challenges. A number of factors, highlighting the overcoming of the conflict between families interests, capital and work, determines the competitiveness of family businesses. In this sense, family and business concerns, strong leadership and a culture of continuous change are key success factors (Pounder 2015). Secondly, are the dynamic capacities to generate strategic processes in relation to the challenges imposed by the global economy that will
438
Á. R. Vélez-Bedoya et al.
enable it to develop theoretical and practical resilience (Beech et al. 2020). The recurring conflict between family, business and property is relevant, since they are issues that involve, in turn, major problems such as generational change that determines the management style and the mediation of emotions and their effects both among family members and among third parties (Efendy 2018). In this sense, it is necessary to generate dynamic capacities so that these organizations can overcome the challenges of the global economy (Weimann et al. 2020) and incorporate R&D processes in order to increase their strategic performance indicators (Babelyt˙e-Labanausk˙e and Nedzinskas 2017). The literature on family businesses shows the theoretical advance of recent years; first, one of the great findings is the effect on the company due to the full control of the property by a family (Van Essen et al. 2015); second, non-observable problems of heterogeneity and endogeneity in relation to the company’s performance (SacristánNavarro et al. 2011); third, the results of common goals for company leaders (Afonso Alves and Matias Gama 2020); fourth, the level of trust among family members (Eddleston et al. 2010; Ayranci 2017); fith, the benefits of human capital (Calabrò et al. 2020); sixth, the benefits for social capital formation because the family and the company are two sources generating support networks (Shi et al. 2015); nineth, negative effects are identified at the same time as the problems inherent in every family (Martín Castejón et al. 2014); tenth, hiring staff without taking into account merits, simply considering family relationships (Howorth and Kemp 2019); eleventh, the lack of management control or empiricism with the resulting conflicts (Castoro and Krawchuk 2020). The family business is overwhelmed by situations such as the lack of clear rules that declare conflicts of coexistence, order and hierarchy (Lucero Bringas 2017), the mixing of company and partner accounts with personal accounts that can lead to the company’s bankruptcy (Corey-Edstrom 2012), nepotism that acts to the detriment of the engagement of employees (Sugundan et al. 2018), (Tabor and Vardaman 2020), non-succession planning that threatens continuity (Soto et al. 2018) and the “decharacterization of the organization” that leads to the loss of market positioning and space in its environment. According to the Steinberg Family, case (Kets de Vries 1996) cited by Kets de Vries et al. (2007). Various studies on the problems of family businesses analyze the organizational architecture, the development of corporate culture, marketing strategies, human resource practices (Smith 2018; Castro 2010) and the impact of family businesses on society (Rahman et al. 2017). The archetypes of clan organizational culture, hierarchy, adhocracy and market (Cameron and Quinn 1999) directly affect the performance of companies (Baykal 2019). Clan culture motivates and satisfies employees by reducing absenteeism and staff turnover (Homburger 2015) cited by Kaupp (2018), adhocratic and market culture-influences the performance of firms (Duréndez et al. 2011). Hierarchical culture has negative effects on performance, customer satisfaction, adaptation to market needs and the image of the company and the product (Felipe et al. 2017). The culture of innovation, on the contrary, increases the internal efficiency in relation to the quality of products, services, processes and organization of tasks (Chen et al. 2018). Furthermore, it improves customer satisfaction, adaptation to the market,
19 Determination of Competitive Management Perception …
439
increases the market share, profitability and productivity and the performance in general (Tedla 2016). In this review, it is found that most of the theoretical models on family businesses pose the interrelation between three main elements: family, company and property, which allows the generation of relevant learning about productivity, competitiveness and sustainability (Botero et al. 2016). One of the most critical challenges for the cultural transformation of family businesses is related to the consensual planning of the generational change, since this is fundamental in the stability of the management and durability of the company, although it is the growth of the company that regulates this process (Fang et al. 2016). Theoretical and practical strategic issues are the successor’s experience gained from outside the family business, the successor’s professional development plan, their professional competencies, the transmission of policies and strategies from leader to successor, individual ownership and the concentration of the patrimony, the extra familiar partners, the knowledge of the goals and the participation in the definition of strategies (Gómez et al. 2010). In this sense, the generational change in leadership in this type of company is currently the object of study and intervention (Lozano Posso 2011). These topics are the ones that have received attention from researchers, specifying that stability and development of family businesses is played in this cultural process as one of the variables that deserves to be studied. Strategy and planning. In the administration of family businesses, the formalization of strategic management (Ritson 2011) entails limits and difficulties as the professionalization of management is limited by the action of the founder and their knowledge framework (Chua et al. 2009). Thus, environment analysis can become an unsystematic practice without support in business intelligence tools and sector competitive analysis (Chrisman et al. 2013), which allow it to achieve high levels of agility, necessary to face the complexity of the environment (Felipe et al. 2016). In this sense, “corporate strategic management refers to the process of establishing and implementing strategies for competitive survival with the ultimate goal of maintaining a competitive advantage” (Kang and Na 2020). On the other hand, the analysis of the operational capacity of the family business is not possible due to the absence of self-critical on the part of the family members who lead (Levinson 1971; Whidya et al. 2017). The statement of mission, vision, objectives and strategy, as well as the assurance of deployment, monitoring and control processes (Canan et al. 2016), are necessary aspects so that the family business can respond effectively to the expectations of its stakeholders (Sharma et al. 2013); strategic marketing management must lead the family business to generate positioning processes that impact growth and financial sustainability (Kang and Na 2020; Fauzia Jabeen and Dixon 2018). The use of management indicators and management control tools are part of the strategic action of the company in terms of consolidating its investment portfolios (Harvey 2009; Oro and Facin Lavarda 2019). ICTs: Information and communication technologies support the production process through innovation, although family participation in ownership, management and government can affect this important objective (De Massis et al. 2013).
440
Á. R. Vélez-Bedoya et al.
Information technologies are widely recognized as strategic instruments, due to their ability to modify the structure and business models of organizations. These resources contribute to the competitiveness of the company by increasing the efficiency of its processes, facilitating the implementation of a flexible and dynamic organizational architecture and providing the necessary means to manage resources and deploy synergies with the internal capacities of the organization (Sanchez et al. 2016). ICTs are important in the management of inventories, purchase orders and logistics and distribution, understood as “science (and art) that allows the required products to arrive at the intended place in adequate quantity and conditions and at the right time to satisfy market demands” (Martí Espinosa 2016). Gonzalez et al. (2017), consider that Information and Communication Technologies offer mechanisms and procedures that allow inventory control that takes into account the supply chain and internal logistics, allowing companies to be at the forefront with better processes, which ultimately generate increased customer satisfaction. ICTs contribute to the improvement of financial management, to the extent that they allow entrepreneurs to overcome the traditional paradigms of investment, return and insurance risk (Hagelina et al. 2006; Visser and Van Scheers 2018), since rapid modernization requires the use of various information technologies to improve understanding of the situation and are more efficient in identifying risks in strategic decision-making (Pol Lim 2017). With regard to human resource management, ICTs allow the consolidation of the skills and abilities of the staff to achieve greater efficiency in the use of resources, in sales, in the use of management indicators, document management, customer service, recruiting, critical analysis and data management, inventory control tools, career development and human capital management (Profiles 2020). In the context of family businesses, from the entry of new generations it is possible to develop ICT training processes (Cesaroni et al. 2010); these generational changes in the management have also allowed the use of cloud computing technology (CCT) to boost financial benefits, develop software, use accessible and flexible technological tools, improve computer security and take advantage of social media opportunities (Attaran and Woods 2018). Human talent management. The human management process, from a strategic point of view, focuses on the life cycle of the collaborator from their recruitment, training, development and separation (Bratton and Gold 1999). Critical aspects of this process in family businesses are salary allocation, job rotation, occupational health, organizational culture and performance evaluation (Ogbonna and Harris 2000; Rodriguez and Walters 2017). With regard to owners and managers, it is necessary to generate processes of training, autonomy, job rotation, compensation and leadership training (Azoury et al. 2013), his makes sense because responsibilities, values and attitudes are supported by ethics and social responsibility (Hasnah et al. 2015) and are essential elements of the resilience necessary to overcome times of crisis (DeCiantis and Lansberg 2020). Quality management system. A quality management system enables the company to account for how it achieves results (Farinha et al. 2016), build trust with consumers, stay in business and ensure sustainable development (Siva et al. 2016). Its principles
19 Determination of Competitive Management Perception …
441
that are conceived with customer focus, leadership, people commitment, improvement, evidence-based decision making, and relationship management “are a set of fundamental beliefs, norms, rules and values that are accepted as true and can be used as a basis for quality management” (ISO 2015). This field is based on the philosophy of Edward Deming who conceived the alignment of production and service systems with customer expectations, reducing variability and continuously improving (Kenyon and Sen 2015). A quality management system shows that the documentation of processes and procedures is carried out with managers to generate audits, corrective and preventive actions (Rainforest Alliance 2020). In this system, the owner family is required to be committed and actively participate, since the results of the system must be the fulfillment of the commitments with all stakeholders and therefore the assurance of the same quality system based on the sustainability of the business by increasing its relative market share. Competitiveness approach. The competitiveness approach according to the FEM model includes a platform made up of dimensions (WEF 2020). A company is competitive if the country has the institutional fabric necessary to develop business, the infrastructure necessary for efficient operation; stable macroeconomic conditions, health and education systems that allow people not only to enjoy full physical conditions, but also adequate capacities to contribute to productivity with value, goods market, labor market and financial market with high quality standards, technological disposition to operate businesses, to progressively sophisticate them, and, finally, an adequate innovation environment that allows satisfying the needs and exceeding the expectations of consumers and society in general (Porter 1990).
19.3 Proposed Approach 19.3.1 Data Preprocessing The study is based on a survey of family business management from which 94 questions are constructed, associated with the following dimensions: • • • • • • • •
Direction and governance Family Family values Strategy and planning Information technologies Human talent management Quality management Competitive approach
These dimensions are composed of 75 dependent variables and 2 independent variables: Origin of the company and generation that runs the company. The survey
442
Á. R. Vélez-Bedoya et al.
was applied to a population of 133 companies that operate in two cities in Colombia. The variables are the result of the answers to questions that were registered on a Likert scale between 1 and 5, with 5 being the highest perception (Ankur et al. 2015). Data processing took place in Python with the help of bookstores pandas, numpy and scikit-learn which have the training algorithms based on multivariate descriptive and inferential statistics. In the first place, the research is based on the observation of multiple variables that, together, describe the perception of the determinants of competitiveness in family businesses. Since no a priori assumptions were decreed, an unsupervised algorithm entitled Principal Component Analysis (PCA), is used (Navarro Céspedes et al. 2010) in order to find patterns in the variables that describe this perception without defining any predetermined attributes (Kotsiantis 2007). This dimensionality reduction process was validated through Cronbach Alpha statistical tests, (Snedecor and Cochran 1989), KMO, Levene y Bartlett, indicating a good fit. From the analysis, the main components that describe perception are identified, and for this reason, it is appropriate to use with the total sample. Second, for the selection of the supervised classification algorithm (Devi et al. 2020), a flow management process is carried out (Kotsiantis 2007). The evaluated algorithms generated the following accuracy results for the company’s source variable: Logistic Regression, 0.407; Support Vector Machine, 0.556; Decision Tree, 0.593, Linear Discriminant Analysis, 0.407 y Quadratic Discriminant Analysis, 0.148. For the generational changeover variable, the accuracy was Logistic Regression, 0.630; Support Vector Machine, 0.667 and Decision Tree, 0.481. The Support Vector Machine (SVM) is selected as the classification algorithm for the revision of the generational change. As the next step, the training algorithm is refined, evaluating the hyperparameters of the SVM algorithm and applying the cross-validation method with 5 partitions (Liu et al. 2009; Weerts et al. 2020). To find the hyperparameters of the firm’s origin classification training, 816 combinations are tested with the following parameters: criterion = [gini, mse, entropy]; max_depth = [3, 8, 12, 15] and pca__n_components = [1,…, 68]. The most appropriate combination was confirmed by recalculation as [gini, 12, 12] with a mean precision (precision) of 0.547 in plus or minus two standard deviations (±0.328). In the case of the generational change classification, 231 combinations are tested with the following kernel parameters: [rbf sigmoid, linear, Polynomial]; gamma = [2. 1, 0.1, 1e−2, 1e−3, 1e−4, 1e−5] and C = [e−8. e−6, e−4, e−2, 2.5e−2, e−1, e+2, 25, 50, 100, 1000]. The combination of hyperparameters that offers the best training for founders companies is [rbf, 25, 0.1] with a mean recall of 0.58 in plus or minus two standard deviations (±0.115). To classify second-generation companies, the best combination was [sigmoidal, 1, 400] with a mean recall of 0.578 in plus or minus two standard deviations (±0.245). To classify third-generation companies, the best combination was [sigmodal, 1 and 10] with a mean recall of 0.815 in plus or minus two standard deviations (±0.245). The processing was done through the GridSearchCV library (Pedregosa et al. 2011).
19 Determination of Competitive Management Perception …
443
After this selection of algorithms, each sorting machine was trained. In the case of the decision trees algorithm, an accuracy of 44.4, 55.3 and 88.8% was obtained for companies with origin in Family Entrepreneurship, Economic Need and Economic Solvency, respectively (Gupta et al. 2017). Confusion matrices are [[4 8], [7 8]]; [[12 4], [8 3]] and [[24 2], [1 0]], respectively. In addition, accuracy training was between the ranges of 0.943–0.994. From Figs. 19.1, 19.2, and 19.3, it can be identified that the number of appropriate nodes for each origin were: 8 for Family Entrepreneurship, 7 for financial need and 4 for Economic Solvency. According to the results of decision tree technique, the origin of the company that can be classified the least is that of economic solvency. In the same way, after training for the classification of companies according to their generational change, there is an accuracy: 48.14% and a confusion matrix [[13 1 Fig. 19.1 Family entrepreneurship
Fig. 19.2 Financial need
444
Á. R. Vélez-Bedoya et al.
Fig. 19.3 Economic solvency
4], [7 0 0], [2 0 0]]. Which reveals the weakness of the sample in terms of quantity of data to achieve adequate training that estimates the origin of the company. However, it is interesting to review the results of this pilot exercise as prior evidence for future studies.
19.3.2 Identifying the Main Components In order to increase the knowledge of the available data and to reduce the dimensionality of management in family businesses, an algorithm was used to find the main components (Ankur et al. 2015). The general behavior of the sample can be explained by 25 components, whose eigenvalues are >1. However, as can be seen in the Fig. 19.4, the first 4 components explain 76.2% and the fifth component contributes 2.6 percentage points, for a total of 78.8%. It was identified that component 5 is associated with 5 variables related to the dimensions: Management and governance, Family and Family values of the family business, to the management and governance of the family business. It is therefore appropriate to consider them as dependent variables or responses. In this sense, the variables that make up the remaining dimensions are taken as independent variables that are analyzed through the behavior of the 4 main components. After defining the independent variables, the new coordinates are found for each of the companies in the population. The coefficients of the first five companies in the 67 dimensions are partially observed in Table 19.1, which contains the linear decomposition based on the original variables. It is identified that the coefficients of the first components are greater than the coefficients of the last components. Which validates that with the selection of the first 4 components, 73.4% of the variance of the company sample is gathered in the two cities.
19 Determination of Competitive Management Perception …
445
Fig. 19.4 Variance contribution of each component
Table 19.1 Linear decomposition based on the original variables
Once the 4 main components are defined, the factor analysis with varimax rotation is developed to determine the linear combinations of the main factors of each company. Combinations that are explained by 14 variables with a KMO measure >0.7, a good medium level. This indicates that the remaining variables are not significant, to describe the behavior of the phenomenon that is being analyzed. The selected components have the following structure: component 1 is associated with 9 variables related to information technology; component 2 is associated with 2 variables related to quality management systems; component 3 is associated with 2 variables related to human talent management and component 4 is associated with 1 variable related to the competitiveness approach. In this way, the 4 components represent 73% of the phenomenon with the variables, according to the Table 19.2. Table 19.2 Main components composition I8: Family participation
H7: Occupational health
16/106
0.0
2.5
0.0
1.5
71/106
2.5
5.0
7/106
0.0
2.5
1.5
5.0
J2: Macroeconomic stability
0.0
4.5
0.0
3.5
G3: Budget, accounting and treasury
2.5
5.0
446
Á. R. Vélez-Bedoya et al.
The test statistics have the following results: KMO = 0.82, Levene’s test shows a statistic of 0.844; Bartlett’s test yields a statistic of 15966.6 and p = 0.0. And the Cronbach’s Alpha is 0.675, in an I.C [0.592, 0.749]. The contribution of each selected component is: 0.48, 0.64, 0.69 and 0.73, respectively.
19.3.3 Data Analysis After Dimension Reduction The main components identified can be conceptualized as follows: Component 1: Component 2: Component 3: Component 4:
Management and technology: includes a managerial management supported by technological tools Quality management: includes an adequate management of processes and the increase of the relative participation in the market. Compensation: includes a consistent salary with respect to effort and bonuses for achievements Country competitiveness: includes a macroeconomic stability that guarantees business action.
In panel A, according to Fig. 19.5, it is observed that entrepreneurs identify themselves with a positive perception of the management of management and technology, independent of the management they give to compensation. It is identified that family businesses with an undetermined management and technology management consider that their remuneration is positive, contrary to companies established by economic necessity can focus on an openness to management and technology, but not on adequate compensation. In panel B, according to Fig. 19.6, it can be identified that only some companies of family businesses and economic need are perceived in a positive or negative way in relation to quality management. However, most companies are indifferent to quality management, which is more marked in companies established by economic solvency. Fig. 19.5 Panel A origin
19 Determination of Competitive Management Perception …
447
Fig. 19.6 Panel B origin
In panel C, according to Fig. 19.7, regarding the country competitiveness component, companies tend to consider themselves indifferent. However, it is observed that those who are not indifferent to this dimension of the country’s competitiveness, their perception is negative. In particular, family entrepreneurship companies show a positive correlation between the perception of management and technology and the competitiveness of the country. In panel D, according to Fig. 19.8, it can be observed that the companies in the sample have a tendency to perceive themselves with low compensation and to be indifferent to their quality management. In particular, family startup companies that have better compensation are companies that exhibit a positive or negative perception of their quality management. Panel E, according to Fig. 19.9, identifies a great dispersion of family entrepreneurship companies, on the positive quadrant of the compensation component and on the Fig. 19.7 Panel C origin
448
Á. R. Vélez-Bedoya et al.
Fig. 19.8 Panel D origin
Fig. 19.9 Panel E origin
negative quadrant of the country’s competitiveness. It can be seen in the companies originated by necessity and economic solvency, which are not indifferent to the competitiveness of the country; there is a positive association between the competitiveness of the country and compensation. In other words, the entrepreneur perceives that if one of the two dimensions improves, the other will improve. In panel F, according to Fig. 19.10, the concentration of the sample in the center of the Cartesian plane shows indifference to the components of quality management and country competitiveness. However, companies with origin in family entrepreneurship that perceive the competitiveness of the country as low may have the tendency to think that quality management is important for their companies. The generational change variable has the following behavior with respect to the new dimensionality of the phenomenon. Panel A, according to Fig. 19.11, shows that second and third generation companies are totally positive about management and technology in their companies.
19 Determination of Competitive Management Perception …
449
Fig. 19.10 Panel F origin
Fig. 19.11 Panel A generational change
Contrary to the companies in which their founders are active in management, in this segment of companies there is greater dispersion, in the positive axes of the component. For the group of companies in which the second and third generation are in the management of the company, compensation has a negative perception. It is highlighted that the founders who have a lower degree of perception towards management and technology, consider compensation to a better degree than those who are outstanding in management and technology. In panel B, according to Fig. 19.12, it is identified that the companies with their founders in management that are not considered to have great perception towards management and technology, show a positive perception towards quality management.
450
Á. R. Vélez-Bedoya et al.
Fig. 19.12 Panel B generational change
There is also a small group, well differentiated from the other companies of founders, who do not positively perceive either management and technology, or quality management. Panel C, according to Fig. 19.13, identifies a group of companies of founders who do not have a positive perception of the country’s competitiveness and management and technology make a level of positive perception. Although these companies are in the negative quadrant of the country’s competitiveness component, a positive relationship is identified between the perception of management and technology and the competitiveness of the country in companies whose management is in the hands of the founders. Second and third generation companies are extremely indifferent to the degree of competitiveness of the country and very positive about management and technology. In panel D, according to Fig. 19.14, it is observed that most of the companies in the three generations focus on being indifferent to quality management and having low Fig. 19.13 Panel C generational change
19 Determination of Competitive Management Perception …
451
Fig. 19.14 Panel D generational change
compensation. However, there is a small group of third-generation companies that have a positive perception of the compensation component, exhibiting less indifference than the common of their generation. Likewise, some companies with founders in management are identified who are not indifferent to quality management, where compensation is, to some degree, better or worse than companies of their generation than if they are indifferent to quality management. Panel E, according to Fig. 19.15, identifies that the companies run by the second generation is divided into two groups. On one side are the companies in which remuneration is low and indifferent to macroeconomic stability. On the other hand, there are second-generation companies with a perception of high remuneration and whose perception in the macroeconomic environment is not indifferent, that is, it is positive or negative. Fewer companies represent this second group. However, second-generation companies, as well as the companies of the founders, have a greater dispersion between the two components. In these generations of Fig. 19.15 Panel E generational change
452
Á. R. Vélez-Bedoya et al.
Fig. 19.16 Panel F generational change
companies, no specific trend is observed; they are companies that consider that macroeconomic stability is not good and have a remuneration that is not very low or not very high. In panel F, according to Fig. 19.16, it is observed that most of the sample is indifferent in its perception of the components of quality management and macroeconomic stability. However, there is a group of companies made up of founding companies whose perception of macroeconomic stability is negative and whose perception of quality management is positive or negative, but not indifferent. As can be seen in this section, both the origin of the company, and the generations that are in the management of family businesses, have different perceptions regarding the main components identified. In training processes, these findings are of great value as they allow establishing starting points to generate proposals for changes and transformation of ways of thinking, mental models and management practices. On the other hand, the determination of perception allows the maturation of working hypotheses in understanding the phenomenon of generational change and the ideals of administration and competitiveness.
19.3.4 Making Decision Trees In order to identify the determining variables in the management that characterize the origin of the company, the post-process classification algorithm is applied to each of the origins of the company: Family Entrepreneurship, Economic Need and Economic Solvency. However, no patterns were identified to classify the origin of Family Entrepreneurship and Economic Need. The solvency tree graph illustrates, according to Fig. 19.17, the pattern of companies originating from Economic Solvency.
19 Determination of Competitive Management Perception …
453
Fig. 19.17 Tree for economic solvency
In Table 19.3. it is evidenced that, of the 106 companies that are referenced for training, the machine classifies as companies with solvency origin 16 companies with a perception of