189 100 50MB
English Pages XIV, 808 [809] Year 2020
LNBIP 402
Marinos Themistocleous Maria Papadaki Muhammad Mustafa Kamal (Eds.)
Information Systems 17th European, Mediterranean, and Middle Eastern Conference, EMCIS 2020 Dubai, United Arab Emirates, November 25–26, 2020 Proceedings
123
Lecture Notes in Business Information Processing Series Editors Wil van der Aalst RWTH Aachen University, Aachen, Germany John Mylopoulos University of Trento, Trento, Italy Michael Rosemann Queensland University of Technology, Brisbane, QLD, Australia Michael J. Shaw University of Illinois, Urbana-Champaign, IL, USA Clemens Szyperski Microsoft Research, Redmond, WA, USA
402
More information about this series at http://www.springer.com/series/7911
Marinos Themistocleous Maria Papadaki Muhammad Mustafa Kamal (Eds.) •
•
Information Systems 17th European, Mediterranean, and Middle Eastern Conference, EMCIS 2020 Dubai, United Arab Emirates, November 25–26, 2020 Proceedings
123
Editors Marinos Themistocleous Department of Digital Innovation School of Business University of Nicosia Nicosia, Cyprus
Maria Papadaki British University in Dubai Dubai, United Arab Emirates
Muhammad Mustafa Kamal School of Strategy and Leadership Coventry University Coventry, UK
ISSN 1865-1348 ISSN 1865-1356 (electronic) Lecture Notes in Business Information Processing ISBN 978-3-030-63395-0 ISBN 978-3-030-63396-7 (eBook) https://doi.org/10.1007/978-3-030-63396-7 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The European, Mediterranean, and Middle Eastern Conference on Information Systems (EMCIS) is an annual research event addressing the discipline of Information Systems (IS) from a regional as well as a global perspective. EMCIS has successfully helped bring together researchers from around the world in a friendly atmosphere conducive to the free exchange of innovative ideas. EMCIS is one of the premier conferences in Europe and the Middle Eastern region for IS academics and professionals, covering technical, organizational, business, and social issues in the application of information technology. EMCIS is dedicated to the definition and establishment of IS as a discipline of high impact for IS professionals and practitioners. EMCIS focuses on approaches that facilitate the identification of innovative research of significant relevance to the IS discipline, following sound research methodologies that lead to results of measurable impact. Unlike previous years, the EMCIS 2020 conference was held online due to the developments in the COVID-19 virus which has created a worldwide tragedy. EMCIS 2020’s top priority is the health and safety of all participants, considering the expanding spread of COVID-19. The pandemic, the travel bans, movement and gathering restrictions issued by many governments, as well as the restrictions on staff’s mobility by many universities and organizations led EMCIS to run the conference online. This year, we received 161 papers from 37 countries from all continents and 54 of them were accepted with an overall acceptance rate of 34.8%. Portugal leads the table with the most submitted papers followed by Sweden, Greece, France, Poland, Tunisia, Cyprus, and UK. All papers were submitted through the easyacademia.org online review system. Track chairs assigned reviewers and the papers were sent for double-blind review. The papers were reviewed by members of the Conference Committee and/or external reviewers. Track chairs submitted 7 papers and each of these papers were reviewed by a member of the EMCIS Executive Committee and one member of the International Committee. The conference chairs submitted 1 paper in total, which was reviewed by two senior external reviewers each. Overall, 54 papers were accepted for EMCIS 2020 as full papers submitted to the following tracks: • • • • • • • • • • •
Big Data and Analytics (6 papers) Blockchain Technology and Applications (8 papers) Digital Services and Social Media (7 papers) Digital Government (4 papers) Emerging Computing Technologies and Trends for Business Process Management (2 papers) Enterprise Systems (3 papers) Information Systems Security and Information Privacy Protection (5 papers) Healthcare Information Systems (3 papers) Management and Organisational Issues in Information Systems (10 papers) IT Governance and alignment (3 papers) Innovative Research Projects (3 papers)
vi
Preface
The papers were accepted for their theoretical and practical excellence and for the promising results they present. We hope that the readers will find the papers interesting and we are open for a productive discussion that will improve the body of knowledge in the field of IS. October 2020
Marinos Themistocleous Maria Papadaki Muhammad Mustafa Kamal
Organization
Conference Chairs Maria Papadaki Marinos Themistocleous
The British University in Dubai, UAE University of Nicosia, Cyprus
Conference Executive Committee Muhammad Mustafa Kamal (Program Chair) Vincenzo Morabito (Publications Chair) Paulo da Cunha (Publications Chair) Gianluigi Viscusi (Public Relations Chair)
Coventry University, UK Bocconi University, Italy University of Coimbra, Portugal Imperial College Business School, UK
International Committee Piotr Soja Angelika Kokkinaki Peter Love Paulo Melo Heinz Roland Weistroffer Yiannis Charalambidis Vishanth Weerakkody Gail Corbitt Miguel Mira da Silva Lasse Berntzen Marijn Janssen Stanisław Wrycza Kamel Ghorab Hemin Jiang Luning Liu Slim Kallel Walid Gaaloul Mohamed Sellami Celina M. Olszak Flora Malamateniou Andriana Prentza Inas Ezz
Cracow University of Economics, Poland University of Nicosia, Cyprus Curtin University, Australia University of Coimbra, Portugal Virginia Commonwealth University, USA University of the Aegean, Greece University of Bradford, UK California State University, USA University of Lisbon, Portugal University of South-Eastern Norway, Norway Delft University of Technology, The Netherlands University of Gdańsk, Poland Al Hosn University, UAE University of Science and Technology of China, China Harbin Institute of Technology, China University of Sfax, Tunisia Télécom SudParis, France Télécom SudParis, France University of Economics in Katowice, Poland University of Piraeus, Greece University of Piraeus, Greece Sadat Academy for Management Sciences (SAMS), Egypt
viii
Organization
Ibrahim Osman Przemysław Lech Euripidis N. Loukis Mariusz Grabowski Małgorzata Pańkowska António Trigo Catarina Ferreira da Silva Aggeliki Tsohou Paweł Wołoszyn Sofiane Tebboune Fletcher Glancy Aurelio Ravarini Wafi Al-Karaghouli Ricardo Jimenes Peris Federico Pigni Paulo Henrique de Souza Bermejo May Seitanidi Sevgi Özkan Demosthenis Kyriazis Karim Al-Yafi Manar Abu Talib Alan Serrano Steve Jones Tillal Eldabi Carsten Brockmann Ella Kolkowska Grażyna Paliwoda-Pękosz Heidi Gautschi Janusz Stal Koumaditis Konstantinos Chinello Francesco Pacchierotti Claudio Milena Krumova Klitos Christodoulou Elias Iosif Charalampos Alexopoulos Przemysław Lech
American University of Beirut, Lebanon University of Gdańsk, Poland University of the Aegean, Greece Cracow University of Economics, Poland University of Economics in Katowice, Poland Coimbra Business School, Portugal ISCTE – Lisbon University Institute, Portugal Ionian University, Greece Cracow University of Economics, Poland Manchester Metropolitan University, UK Miami University, USA Università Carlo Cattaneo, Italy Brunel University London, UK Universidad Politécnica de Madrid (UPM), Spain Grenoble Ecole de Management, France Universidade Federal de Lavras, Brazil University of Kent, UK Middle East Technical University, Turkey University of Piraeus, Greece Qatar University, Qatar Zayed University, UAE Brunel University London, UK Conwy County Borough, UK Ahlia University, Bahrain Capgemini, Germany Örebro University, Sweden Cracow University of Economics, Poland IMD Business School, Switzerland Cracow University of Economics, Poland Aarhus University, Denmark Aarhus University, Denmark University of Rennes, France Technical University of Sofia, Bulgaria University of Nicosia, Cyprus University of Nicosia, Cyprus University of the Aegean, Greece University of Gdańsk, Poland
Contents
Big Data and Analytics Towards Designing Conceptual Data Models for Big Data Warehouses: The Genomics Case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . João Galvão, Ana Leon, Carlos Costa, Maribel Yasmina Santos, and Óscar Pastor López Automating Data Integration in Adaptive and Data-Intensive Information Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . João Galvão, Ana Leon, Carlos Costa, Maribel Yasmina Santos, and Óscar Pastor López Raising the Interoperability of Cultural Datasets: The Romanian Cultural Heritage Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ilie Cristian Dorobăț and Vlad Posea An Inspection and Logging System for Complex Event Processing in Bosch’s Industry 4.0 Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carina Andrade, Maria Cardoso, Carlos Costa, and Maribel Yasmina Santos DECIDE: A New Decisional Big Data Methodology for a Better Data Governance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamed Mehdi Ben Aissa, Lilia Sfaxi, and Riadh Robbana Towards the Machine Learning Algorithms in Telecommunications Business Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Moisés Loma-Osorio de Andrés, Aneta Poniszewska-Marańda, and Luis Alfonso Hernández Gómez
3
20
35
49
63
79
Blockchain Technology and Applications Blockchain Technology for Hospitality Industry . . . . . . . . . . . . . . . . . . . . . Abhirup Khanna, Anushree Sah, Tanupriya Choudhury, and Piyush Maheshwari
99
Blockchain in Smart Energy Grids: A Market Analysis . . . . . . . . . . . . . . . . Evgenia Kapassa, Marinos Themistocleous, Jorge Rueda Quintanilla, Marios Touloupos, and Maria Papadaki
113
x
Contents
Leadership Uniformity in Raft Consensus Algorithm . . . . . . . . . . . . . . . . . . Elias Iosif, Klitos Christodoulou, Marios Touloupou, and Antonios Inglezakis Positive and Negative Searches Related to the Bitcoin Ecosystem: Relationship with Bitcoin Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ifigenia Georgiou, Athanasia Georgiadi, and Svetlana Sapuric
125
137
LOKI Vote: A Blockchain-Based Coercion Resistant E-Voting Protocol . . . . Marwa Chaieb and Souheib Yousfi
151
Blockchain for Smart Cities: A Systematic Literature Review . . . . . . . . . . . . Ifigenia Georgiou, Juan Geoffrey Nell, and Angelika I. Kokkinaki
169
Blockchain in Digital Government: Research Needs Identification . . . . . . . . . Demetrios Sarantis, Charalampos Alexopoulos, Yannis Charalabidis, Zoi Lachana, and Michalis Loutsaris
188
An Exploratory Study of the Adoption of Blockchain Technology Among Australian Organizations: A Theoretical Model . . . . . . . . . . . . . . . . . . . . . . Saleem Malik, Mehmood Chadhar, Madhu Chetty, and Savanid Vatanasakdakul
205
Digital Government Analyzing a Frugal Digital Transformation of a Widely Used Simple Public Service in Greece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sophia Loukadounou, Vasiliki Koutsona, and Euripidis Loukis
223
Why are Rankings of ‘Smart Cities’ Lacking? An Analysis of Two Decades of e-Government Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mariusz Luterek
238
Citizens’ Perceptions of Mobile Tax Filing Services . . . . . . . . . . . . . . . . . . Tinyiko Hlomela and Tendani Mawela Knowledge Graphs for Public Service Description: Τhe Case of Getting a Passport in Greece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Promikyridis Rafail and Tambouris Efthimios
256
270
Digital Services and Social Media e-Commerce Websites and the Phenomenon of Dropshipping: Evaluation Criteria and Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jacek Winiarski and Bartosz Marcinkowski
289
Contents
xi
E-Learning Improves Accounting Education: Case of the Higher Education Sector of Bahrain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abdalmuttaleb M. A. Musleh Al-Sartawi
301
Influence of Website Design on E-Trust and Positive Word of Mouth Intentions in E-Commerce Fashion Websites . . . . . . . . . . . . . . . . . . . . . . . Pedro Manuel do Espírito Santo and António Trigo
316
When Persuasive Technology Gets Dark? . . . . . . . . . . . . . . . . . . . . . . . . . Tobias Nyström and Agnis Stibe
331
Influential Nodes Prediction Based on the Structural and Semantic Aspects of Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nesrine Hafiene, Wafa Karoui, and Lotfi Ben Romdhane
346
Determinants of the Intention to Use Online P2P Platforms from the Seller’s Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nuno Fortes, Adriana Pires, and Pedro Manuel do Espírito Santo
360
Social Media Impact on Academic Performance: Lessons Learned from Cameroon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Josue Kuika Watat, Gideon Mekonnen Jonathan, Frank Wilson Ntsafack Dongmo, and Nour El Houda Zine El Abidine
370
Emerging Computing Technologies and Trends for Business Process Management Towards Applying Deep Learning to the Internet of Things: A Model and a Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Samaa Elnagar and Kweku-Muata Osei-Bryson HapiFabric: A Teleconsultation Framework Based on Hyperledger Fabric . . . Hossain Kordestani, Kamel Barkaoui, and Wagdy Zahran
383 399
Enterprise Systems Evaluating the Utility of Human-Machine User Interfaces Using Balanced Score Cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saulo Silva and Orlando Belo
417
Enterprise Systems, ICT Capabilities and Business Analytics Adoption – An Empirical Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . Niki Kyriakou, Euripidis Loukis, and Michail Marios Chatzianastasiadis
433
Evaluation of Cloud Business Intelligence Prior to Adoption: The Voice of Small Business Enterprises in a South African Township . . . . . Moses Moyo and Marianne Loock
449
xii
Contents
Healthcare Information Systems Hospital Information Systems: Measuring End-User Satisfaction . . . . . . . . . . Fotis Kitsios, Maria Kamariotou, Vicky Manthou, and Afroditi Batsara Performance Evaluation of ANOVA and RFE Algorithms for Classifying Microarray Dataset Using SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sulaiman Olaniyi Abdulsalam, Abubakar Adamu Mohammed, Jumoke Falilat Ajao, Ronke S. Babatunde, Roseline Oluwaseun Ogundokun, Chiebuka T. Nnodim, and Micheal Olaolu Arowolo Telemedicine in Shipping Made Easy - Shipping eHealth Solutions. . . . . . . . Eleni-Emmanouela Koumantaki, Ioannis Filippopoulos, Angelika Kokkinaki, Chrysoula Liakou, and Yiannis Kiouvrekis
463
480
493
Information Systems Security and Information Privacy Protection Game-Based Information Security/Privacy Education and Awareness: Theory and Practice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stylianos Karagiannis, Thanos Papaioannou, Emmanouil Magkos, and Aggeliki Tsohou Big Data Analytics in Healthcare Applications: Privacy Implications for Individuals and Groups and Mitigation Strategies. . . . . . . . . . . . . . . . . . Paola Mavriki and Maria Karyda A Multiple Algorithm Approach to Textural Features Extraction in Offline Signature Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jide Kehinde Adeniyi, Tinuke Omolewa Oladele, Noah Oluwatobi Akande, Roseline Oluwaseun Ogundokun, and Tunde Taiwo Adeniyi Modified Least Significant Bit Technique for Securing Medical Images . . . . . Roseline Oluwaseun Ogundokun, Oluwakemi Christiana Abikoye, Sanjay Misra, and Joseph Bamidele Awotunde A New Text Independent Speaker Recognition System with Short Utterances Using SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rania Chakroun and Mondher Frikha
509
526
541
553
566
Innovative Research Projects Artificial Intelligence for Air Safety. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rajesh Gandadharan Pillai, Poonam Devrakhyani, Sathvik Shetty, and Deepak Munji
577
Contents
A Creative Information System Based on the SCAMPER Technique . . . . . . . Rute Lopes, Pedro Malta, Henrique Mamede, and Vitor Santos Using Knowledge Graphs and Cognitive Approaches for Literature Review Analysis: A Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Samaa Elnagar and Kweku-Muata Osei-Bryson
xiii
595
607
IT Governance and Alignment The Influence of Cloud Computing on IT Governance in a Swedish Municipality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parisa Aasi, Jovana Nikic, Melisa Li, and Lazar Rusu
623
Cultural Barriers in Digital Transformation in a Public Organization: A Case Study of a Sri-Lankan Organization . . . . . . . . . . . . . . . . . . . . . . . . Lazar Rusu, Prasanna B.L. Balasuriya, and Ousman Bah
640
Strategic Alignment During Digital Transformation . . . . . . . . . . . . . . . . . . . Gideon Mekonnen Jonathan and Josue Kuika Watat
657
Management and Organisational Issues in Information Systems A Chief Information Officer (CIO) Framework for Managing the Fourth Industrial Revolution (4IR): An Exploratory Research Synthesis . . . . . . . . . . Joseph George and Grant Royd Howard
673
A Change and Constancy Management Approach for Managing the Unintended Negative Consequences of Organizational and IT Change . . . . . . Grant Royd Howard
683
Evaluating the Impacts of IoT Implementation on Inter-organisational Value Co-creation in the Chinese Construction Industry. . . . . . . . . . . . . . . . Zhen Sun and Sulafa Badi
698
Enhancing Decision-Making in New Product Development: Forecasting Technologies Revenues Using a Multidimensional Neural Network . . . . . . . . Marie Saade, Maroun Jneid, and Imad Saleh
715
The Effects of Outsourcing on Performance Management in SMEs . . . . . . . . Eisa Hareb Alneyadi and Khalid Almarri
730
The Attitude of Consumer Towards a Brand Source: Context of UAE . . . . . . Omer Aftab and Khalid Almarri
742
Assessing the Success of the University Information System: A User Multi-group Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mariusz Grabowski, Jan Madej, and Adam Sagan
754
xiv
Contents
IS Project Management Success in Developing Countries. . . . . . . . . . . . . . . João Varajão, António Trigo, Isabel Moura, and José Luís Pereira
769
An Iterative Information System Design Process Towards Sustainability . . . . Tobias Nyström and Moyen Mustaquim
781
Extensive Use of RFID in Shipping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anna Karanika, Ioannis Filippopoulos, Angelika Kokkinaki, Panagiotis Efstathiadis, Ioannis Tsilikas, and Yiannis Kiouvrekis
796
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
807
Big Data and Analytics
Towards Designing Conceptual Data Models for Big Data Warehouses: The Genomics Case João Galvão1(&) , Ana Leon2 , Carlos Costa1 , Maribel Yasmina Santos1 , and Óscar Pastor López2 1
ALGORITMI Research Centre, University of Minho, Guimarães, Portugal {joao.galvao,carlos.costa,maribel}@dsi.uminho.pt 2 Research Center on Software Production Methods (PROS), Universitat Politècnica de València, Valencia, Spain {aleon,opastor}@pros.upv.es
Abstract. Data Warehousing applied in Big Data contexts has been an emergent topic of research, as traditional Data Warehousing technologies are unable to deal with Big Data characteristics and challenges. The methods used in this field are already well systematized and adopted by practitioners, while research in Big Data Warehousing is only starting to provide some guidance on how to model such complex systems. This work contributes to the process of designing conceptual data models for Big Data Warehouses proposing a method based on rules and design patterns, which aims to gather the information of a certain application domain mapped in a relational conceptual model. A complex domain that can benefit from this work is Genomics, characterized by an increasing heterogeneity, both in terms of content and data structure. Moreover, the challenges for collecting and analyzing genome data under a unified perspective have become a bottleneck for the scientific community, reason why standardized analytical repositories such as a Big Genome Warehouse can be of high value to the community. In the demonstration case presented here, a genomics relational model is merged with the proposed Big Data Warehouse Conceptual Metamodel to obtain the Big Genome Warehouse Conceptual Model, showing that the design rules and patterns can be applied having a relational conceptual model as starting point. Keywords: Big Data Warehousing modeling
Big data modelling Conceptual
1 Introduction Analytical contexts have been highly influenced by Big Data where new challenges arise both in terms of data modeling approaches and technological concerns that must be considered. Traditional Data Warehousing (DWing) systems lost the capacity to handle data with different characteristics, such as high data volumes, produced at high speed and considering different data varieties. To overcome those challenges, as organizations still need structured data repositories supporting decision making tasks, data warehouses are now implemented using Big Data technologies [1, 2]. © Springer Nature Switzerland AG 2020 M. Themistocleous et al. (Eds.): EMCIS 2020, LNBIP 402, pp. 3–19, 2020. https://doi.org/10.1007/978-3-030-63396-7_1
4
J. Galvão et al.
Due to its novelty, research in Big Data Warehousing (BDWing) has been quite scarce with some works based on unstructured approaches and use case technologydriven solutions [2–5]. The work of [2] overcomes these practices by proposing a structured approach that includes guidelines for the design and implementation of Big Data Warehouses (BDWs). In this paper, the conceptual modeling of BDWs is formalized defining a Big Data Warehouse Conceptual Metamodel (BDWCMT) with the constructs made available in the data modeling approach of [2], and proposes a method that includes patterns and rules to guide practitioners from the BDWCMT to the design of a specific Big Data Warehouse Conceptual Model (BDWCM). As demonstration case, and due to the complexity of this application domain, the proposed modeling method is applied to Genomics with the aim to implement the Big Genome Warehouse System. This is intended to be implemented using Big Data tools and technologies in the Hadoop Ecosystem, using Hive as the main storage technology, as this is considered the de facto standard for DWing in Big Data. This physical implementation in Hive must consider background knowledge inherited from the Big Genome Warehouse Conceptual Model (BGWCM) and from the Human Genome Conceptual Model (HGCM). Both models comply with specific constructs that follow UML Class Diagrams Metamodels, assuming that a system is represented by a model that conforms to a metamodel [6]. The HGCM is the result of a research work on a complex domain where the use of conceptual models has been proved to be a feasible solution for the integration of data coming from heterogeneous and disperse set of genomic sources. One of the most challenging problems in the genomic domain is the identification of DNA variants that could be a potential cause of disease. The huge amounts of available data, characterized by their heterogeneity, either in terms of content and structure, as well as the problems for collecting and analyzing them under a unified perspective has become a bottleneck for the scientific community. In [7], the authors face this problem by presenting a conceptual model that provides the required unified perspective to collect, structure and analyze the key concepts of the domain under a well-grounded ontological basis. Using this background knowledge of the HGCM with the key concepts in this application domain, namely the main identified identities and their relationships, the BGWCM here modeled must conform to the BDWCMT. The aim of this paper is to show how to move from the BDWCMT to the BGWCM using the knowledge explicitly available in the HGCM. A simplification of the HGCM is used in this paper intended to solve a specific task and ease the validation process of this approach. The details about the physical implementation of the BGWCM in Hive is out of the scope of this paper. The method for data modeling of BDWs proposed in this paper follows an iterative and goal-driven approach that performs a map between the conceptual model of the domain (HGCM) and the BDWCMT. This method considers the main data modeling constructs and proposes the data modeling rules and the data modeling patterns for implementing BDWs. This work is evaluated with the identification of the BGWCM. This paper is structured as follows. Section 2 presents the related work. Section 3 formalizes the BDWCMT, its main constructs and their characteristics. Section 4 addresses the proposed data modeling rules and patterns, which are instantiated to the Genomics case. Section 5 outlines the presented work and future work.
Towards Designing Conceptual Data Models for Big Data Warehouses
5
2 Related Work Data models are essential in information systems design and development as they ensure that data needs are properly considered [4]. In a traditional organizational environment, relational data models are quite popular and are strictly considering the business requirements. However, in an organizational context making use of Big Data, the ability to process data increases with the use of flexible schemas, and thus the data modeling methods change significantly [8, 9], as the database schemas can change during application runtime according to the analytical needs of the organization [4, 9]. Taking this into consideration, BDWs are significantly different from traditional data warehouses, since schemas must be based on new logical models that allow more flexibility and scalability, hence the emergence of new design and modeling proposals for BDWs [2]. In Big Data, there are multiple challenges when addressing multidimensional data, namely the capability to ensure schema-less or dynamic schema changes, huge number of dimensions and cardinality, recommendations for automatic partitioning and materialization, or real-time processing [3, 10]. The works of [4] and [11] propose an almost automatic design methodology using the key-value model to represent a multidimensional scheme at the logical level, instead of applying the traditional star/snowflakes schemes. Moreover, a multidimensional model is provided by a graph-oriented representation as the basis for the conceptual design of this methodology, aiming at the construction of attribute trees representing facts related to the integrated data source and automatically remodeling these trees based on the restrictions resulting from the requirements analysis phase. Still in the NoSQL realm, the work of [5] proposes three types of translations of a conceptual model to a columnar model, showing the implementation of columnar data warehouses in NoSQL. Additionally, there are works focused on OLAP-oriented technologies for Big Data, being Hive a popular example. The work of [12] proposes a set of rules/guidelines for transforming a traditional dimensional model [13] into a Hive tabular data model for BDWs, adjusting the table’s grain to the domain requirements. A context-independent design and implementation approach for BDWs has been addressed in [2, 14] where several design patterns for modeling performant BDWs were evaluated, targeting advancing decision making with huge amounts of data, collected at high velocity and with different degrees of heterogeneity. In these, a data modeling method is proposed, supporting mixed and complex analytical workloads (e.g., streaming analysis, ad hoc querying, data visualization, data mining). Research in this area is still relatively ambiguous and yet at an early stage, lacking common approaches [15], reason why this paper proposes a more straightforward rationale for modelling BDWs, supported by the constructs of [2].
6
J. Galvão et al.
Fig. 1. The BDW’s conceptual metamodel
3 The Big Data Warehouse Conceptual Metamodel In [2], the set of constructs for modelling a BDW were proposed without any formalization in a conceptual metamodel that clearly states how those constructs are organized and how they complement each other in this data system. Extending the work of [2], those constructs are in this paper used to propose the BDWCMT, a metamodel that formalizes the elements used to model a BDW (Fig. 1). Each object in a BDW can be classified as an Analytical Object (AO), a Complementary Analytical Object (CAO), a Materialized Object (MO) or a Special Object (SO). The AOs are subjects of interest in an analytical context, for instance Sales or Products. Usually, objects start as AOs, but as the data modelling approach proceeds, they can turn into CAOs if they are shared by several AOs and if they comply with the characteristics and guidelines presented afterwards in this section. In these cases, the AOs outsource the descriptive families to the CAOs. In a parallelism with Kimball’s Dimensional Modeling [13], this type of object is similar to the conformed dimensions of a dimensional data model but capable of combining the concept of aggregated facts for dimensions as well, due to the analytical value of CAOs. The MOs store the results of time-consuming queries aiming to improve the BDW performance by providing preaggregated results to the several data consumers, instead of processing heavy aggregations multiple times. Finally, the SOs are of three types: Spatial, Date and Time. They are used to standardize concepts like Date, Time and Space, ensuring that the attributes related to those concepts have the same meaning and format across the BDW. The several objects, excluding the SOs as these are seen as a particular case, include analytical and/or descriptive families, which in their turn include analytical and descriptive attributes, respectively. The attributes include atomic or collection values. The atomic values can be used as (or as part of) the partition, bucketing and/or granularity keys of an object. The records of the objects can be mutable or immutable,
Towards Designing Conceptual Data Models for Big Data Warehouses
7
allowing or avoiding update operations. These several constructs included in the conceptual metamodel are described in Table 1. Extending the descriptive families, outsourced descriptive families allow relationships between AOs and CAOs. These are useful for having a flexible data modelling approach that enhances the performance of the BDW [14].
Table 1. Description of the BDW’s conceptual metamodel constructs Construct Object
Class Represents Characteristics Includes
Analytical Object
Represents Characteristics
Examples Includes
Analytical Family (including Analytical Attributes)
Represents
Characteristics
Descriptive Family (including Descriptive Attributes)
Examples Includes Represents
Characteristics
Description Constructs used in the conceptual modeling of a BDW Highly performant structures to provide analytical value to decision support scenarios Analytical, complementary analytical, materialized, and special (date, time and spatial) objects Isolated object of a subject of interest for analytical purposes Highly denormalized and autonomous structures able to answer queries without the constant need of joins with other data structures Sales, purchases, inventory management, customer complaints, among others Analytical families, descriptive families, records, granularity key, partition key, bucketing (or clustering) key A set of attributes with numeric values that can be analyzed using different descriptive attributes (e.g., grouped or filtered by) Logical representation of a set of indicators or measures (analytical attributes) relevant for analytical purposes. Can include factual (numeric evidence of something) or predictive (an estimative or a prediction of what could happen) attributes Sold quantity, discount value and sold value Analytical families include analytical attributes A set of descriptive values that are used to interpret analytical attributes by different perspectives, using aggregation or filtering operations, for example A descriptive family is a logical representation of a set of attributes usually used to add meaning to a numeric indicator (continued)
8
J. Galvão et al. Table 1. (continued)
Construct
Class
Description
Examples
Customer name, product description and discount type Descriptive families include descriptive attributes The set of values for the attributes of an occurrence of an analytical object Can be mutable (allow updates) or immutable records (forbid updates) The values that characterize the purchase of a product, by a customer, on a store, with a factual and/or predicted quantity Atomic values (integer, float, double, string, or varchar) or collections (complex structures like arrays, maps, or JSON) The level of detail of records to be stored in an analytical object Is defined using one or more descriptive attributes that uniquely identify a record. It may not need to be physically implemented in a data system as a primary key Sales order, product identifier, among others One or more descriptive attributes that uniquely identify a record The physical partitioning scheme applied to the data, fragmenting the analytical objects into more manageable parts that can be accessed individually Is defined using one or more descriptive attributes (although analytical attributes can also be used) that form the partition key Time and/or geospatial attributes are the most useful ones, as data is typically loaded and filtered in hourly/daily/monthly batches for specific regions or countries One or more descriptive attributes that form the partition key The physical clustering applied to the data, grouping records of an analytical object Is defined using one or more descriptive attributes that form the bucketing key Attributes such as products or customers distributing the data by similar volumes (continued)
Includes Record
Represents Characteristics Examples
Includes
Granularity Key
Represents Characteristics
Examples Includes Partition Key
Represents
Characteristics
Examples
Includes Bucketing Key
Represents Characteristics Examples
Towards Designing Conceptual Data Models for Big Data Warehouses Table 1. (continued) Construct
Class Includes
Complementary Analytical Object
Materialized Object
Special Object (Time, Date and Spatial Object)
Description
One or more descriptive attributes that form the bucketing key Represents Object that complements other analytical objects, providing an autonomous structure with analytical value that is used to complement the different analytical perspectives provided by the analytical objects Characteristics Object whose granularity key (whole or part of it) is used by other analytical object, meaning that a join between two or more objects is possible Examples Customer account, product, supplier, among others Includes Analytical families, descriptive families, records, granularity key, partition key, bucketing (or clustering) key Represents Object that includes an aggregation of the records of an analytical or complementary analytical object, based on frequent access patterns to the data Characteristics Enhances the performance of frequent queries by performing a pre-aggregation of the data and the pre-computing time-consuming joins between large objects Examples Views on any analytical, complementary analytical and special objects Includes Can be created based on any analytical or complementary analytical objects Represents Objects that include several temporal and/or spatial attributes that complement the analytical objects (or complementary analytical objects) Characteristics Use standard time, date and spatial representations in autonomous objects, avoiding the increase of the size of the analytical or complementary analytical objects Examples Time: hour, minute, second; Date: day, month, year; Spatial: city, country Includes Descriptive families, records, and Granularity keys
9
10
J. Galvão et al.
Table 2 summarizes a set of guidelines and good practices proposed in [2] for the use of outsourced descriptive families and nested attributes taking into consideration the domain requirements, helping practitioners to identify contexts where the same can be useful.
Table 2. Guidelines for outsourced descriptive families and nested attributes Construct Outsourced Descriptive Family
Nesting Attributes (in a Collection)
Guidelines The descriptive family is frequently included in other analytical objects The descriptive family has low cardinality, i.e., its distinct records will form a low volume CAO that easily fits into memory, enabling the capability to perform map/broadcast joins in SQLon-Hadoop engines The data ingestion frequency of the resulting CAO is equivalent to the other AOs it is related to The CAO resulting from the outsourced descriptive family can provide analytical value by itself The records of the CAOs formed by the outsourced descriptive families are recommended to be immutable Avoid nested attributes in a collection if there is the need to perform heavy aggregations on that data Avoid nested attributes in a collection when using filtering operations based on nested values Nested attributes included in a collection are not meant to grow rapidly Estimate the collection initial size and its potential growth before adopting nested attributes
4 The Big Genome Warehouse Conceptual Model The method presented in this paper is based on rules and design patterns that aim to gather the information of a certain application domain mapped in a relational conceptual model, merging it with a BDWCMT in order to obtain a BDWCM. In this work, the method is presented and demonstrated in the Genomics application domain. 4.1
Data Modeling Rules
The data modeling rules are based on a goal-driven approach that identifies the analytical value of the entities present in the conceptual model of the domain and characterizes the data volume and querying frequency of those entities, as this information is later used for applying the data modeling patterns. The three data modeling rules are:
Towards Designing Conceptual Data Models for Big Data Warehouses
11
R1. Entities with High Analytical Value. Identification of the main entities of the domain, pointing those queried for decision support, providing the main business or analytical indicators. In this process, it is relevant to consider that the approach is goaldriven, so an entity may have a relevant analytical value in a particular domain or business process, being identified as a key entity by this rule, but in other contexts it may only be used for providing contextual information, such as who or what. In the data modelling perspective, these entities usually receive multiple relationships with the M cardinality (M:1, many-to-one), integrating several concepts of the domain. R2. Entities with High Cardinality. Characterization of entities with high data volume, helping in the process of identifying the entities that are candidates for outsourced descriptive families, since high cardinality entities are good candidates to denormalization processes, avoiding joins with huge amounts of data. This high cardinality classification of the entities cannot be exclusively based on row counting processes, requiring additional knowledge from the domain. The data engineer with the help of the domain expert should estimate data growing rates based on a deep knowledge of the application domain. R3. Entities with Frequent Access Patterns. Characterization of the entities with frequent access patterns which, combined with R2, point entities that are candidates for MOs, increasing the overall performance of the BDW system. Taking into consideration the HGCM available at [7] and the constructs and guidelines presented above, the data modeling rules were applied classifying a subset of the entities of this domain attending to R1, R2 and R3 (Fig. 2).
Fig. 2. The human genome conceptual model. Adapted from [7].
12
4.2
J. Galvão et al.
Data Modeling Patterns
The data modeling patterns take into consideration that a BDW is built with the goal of supporting decision-making tasks and that those tasks highly depend on the identified goals or main analytical activities. This data modeling method has the flexibility that is needed in a Big Data context, allowing the evolution of the models when: i) new business processes/data sources are identified; ii) new data is available for the existing processes/data sources; or iii) new data requirements change the classification of the entities in terms of design rules. As described in Sect. 3, one of the main constructs in the BDWCMT is the concept of AOs. These are highly denormalized and autonomous structures able to answer queries without the constant need of joining different structures. Often based on flat structures, for better performance [14], these completely or mostly flat structures significantly increase the storage size of the BDW, a problem that has even more impact when multiple AOs share the same descriptive families. To face this balance between data volume and processing performance, the proposed data modeling patterns allow for the identification of data models that are: i) highly flexible, as the data engineers have instruments that guide the modeling process, without limiting the human decisions; ii) highly performant, identifying objects that answer the main domain questions considering both data volume and performance concerns; and iii) highly relevant, providing different analytical views on the data under analysis. The design patterns take into consideration the need to identify the different objects in the BDW, their type (AOs, CAOs, MOs or SOs), and the descriptive and analytical families included in those objects. Considering a traditional relational context in which data is highly normalized and each entity details a specific set of attributes with some level of detail, a BDW uses the same data but denormalizes the data structures as much as possible, without compromising the BDW sustainability in terms of storage space or its usability in terms of performance. In this process, data at different levels of detail can be stored, making available objects that may answer more detailed queries, while others can support aggregated and very performant answers to more general questions (Fig. 3).
Fig. 3. Levels of detail for the several entities
In Fig. 3, the three entities available in the domain are possible AOs for the BDW and all of them could be included in this data repository using a similar data model (highly normalized). However, the BDW must be aligned with the analytical queries and must consider storage and performance concerns. As an example for a BDW, two possible data models are depicted in Fig. 4.
Towards Designing Conceptual Data Models for Big Data Warehouses
13
Fig. 4. Examples of possible data models
In Fig. 4, Example 1 includes Chromosome as an AO that outsourced a descriptive family, included in a CAO named Assembly and nested the descriptive family of Gene into a collection, in accordance with the Chromosome granularity. Each object has its own granularity and the denormalization must respect that granularity. Example 2 includes a fully denormalized AO called Gene with Chromosome and Assembly as denormalized descriptive families of this object. Based on the assumption that all the entities present in a data model, such as a relational-based one, are candidate or possible objects in a BDW, the design patterns are: P1. Analytical Objects. Entities classified as of type R1 are identified as possible AOs due to their high analytical value. P2. Complementary Analytical Objects. Entities can be outsourced to CAOs if they comply with the best practices summarized in Table 2, if they are not classified as of type R2 and if they maintain relationships of cardinality 1:M (one-to-many) with more than one entity of the domain. P3. Descriptive Families. Entities not identified as or by the design patterns P1, P2 are candidates to be denormalized as descriptive families of AOs or nested to a collection of AOs, in accordance to their granularity. In Fig. 4, Example 1, Chromosome includes Gene as a collection, as a chromosome includes multiples genes, whereas in Example 2 Gene denormalizes Chromosome and Assembly, as a gene maintains a unitary relationship with a chromosome and an assembly. P4. Special Objects. Entities including temporal and/or spatial attributes point the need for SOs that include the calendar, temporal or spatial descriptive attributes relevant in the application domain. P5. Materialized Objects. Entities of type R3 can be, in addition to the previous patterns, labeled as possible MOs with aggregates usually used in analytical tasks.
14
J. Galvão et al.
Considering the defined patterns and the domain knowledge expressed in the HGCM presented in Fig. 2, which already includes the classification of the entities considering the data modeling rules, it is now possible to integrate the specific characteristics of a BDW with the domain knowledge of the human genome in order to propose the conceptual model of the Big Genome Warehouse (BGWCM). Figure 5 presents a synopsis of the approach that guides practitioners to apply the method and obtain the BGWCM. The entities identified in the domain knowledge are mapped against themselves in a matrix, to identify their common relationships. It also maps the design rules and patterns with the entities of the domain. The application of the data modeling patterns gives a first overview of the main objects of the BDW, which are refined in successive iterations as the data modeling method proceeds.
Fig. 5. Applying the design rules and patterns to the HGCM (Color figure online)
In the first step, the application of pattern P1 allows the identification of 6 analytical objects ( Variation_Databank, Statistical_Evidence, Variation, Variation_Phenotype, Frequency, Gene). The second step identifies the CAOs, starting by choosing the entities that have more than one 1:M relationships to other entities. In this case, the possible CAOs are Databank, Variation and Chromosome, but only Databank and Variation are classified as CAOs ( Databank, Variation) since Chromosome alone cannot provide analytical value, complying with the best-practices presented in Table 2. Note that the classification as can change if the size of the object makes joins inefficient.
Towards Designing Conceptual Data Models for Big Data Warehouses
15
Although Databank was not classified by R1, domain experts may show interest in knowing the databanks that are not used in the study of a variation. This is possible with a query that checks records of Databank not included in Variation_Databank, for instance. In this step, one object previously classified as an AO is now reclassified as a CAO ( Variation) due to its use by other objects. With the identification of AOs and CAOs, the entities without these classifications are candidates to be denormalized to the AOs or CAOs previously identified (P3), in accordance to their granularity. The fourth step identifies the entities that have relationships with the SOs, namely Date, Time or Spatial. In this, Bibliography_Reference, Assembly and Databank_Version are identified as having relationships with the Date object (P4). In P5, and due to its frequent access pattern, one object is identified as possible candidate to an additional materialized object that answers frequent queries of the application domain, Variation Aggregates. Following the design patterns, the BGWCM is obtained (Fig. 6). With P1 and P2, 5 AOs and 2 CAOs ( Variation_Databank, Statistical_Evidence, Variation_Phenotype, Frequency, Gene, Databank, Variation) were identified. Now, the relationships of the entities are analyzed.
Fig. 6. The big genome warehouse conceptual model (Color figure online)
16
J. Galvão et al.
Starting by Databank, this entity has relationships with Variation_Databank, Bibliography_Reference and Databank_Version, linking Databank and Variation_Databank. Taking into consideration the domain knowledge, Databank_Version is an entity with a set of properties of Databank which, also, complies with the nested attributes best practices presented in Table 2. For this, Databank_Version is nested and included in the model as a collection of the Databank. These decisions are left to the data engineer, as the approach is meant to be flexible enough to accommodate the data analysis requirements in the data model that best suits those analytical needs, while complying with the defined data modeling rules and patterns. Variation_Databank does not have any additional relationship, besides the ones inherited from the application domain, outsourcing the descriptive families of 0 AND SCx > 0 AND SCy > 0). 5.1
Analysis and Discussion of Results
The intersection between the manually mapped graph and the Similarity Graph is presented in Fig. 4. The result is a set of subgraphs containing 31 nodes and 36 relationships, extending the one presented in Fig. 3, by adding relationships with full lines that represent the relationships that were manually and automatically detected, and the dashed lines that represent the relationships that were only manually detected. Using were identified, even in a complex context were the names of the attributes are very different. Take, for example, the name of the variant: Variant name, snpid, SNPS and ID. By comparing the manually detected relationships (36) with those automatically detected (28), a match rate of 77.7% was achieved, representing a noteworthy result for a first approach. Other particularly difficult attributes, such as those associated with the similar and approved using the domain restrictions. This approach, the relevant entities phenotype, were partially identified. The first analysis of the graph points that some of the missed relationships could be inferred by transitivity. For example, if Variant name is similar to SNPS and SNPS is similar to ID, would not be Variant name similar to ID? Also, after sharing the results with the domain expert, the pairs (Reference, Minor allele (ALL)) and (Alt, Minor allele (ALL)) that were not manually identified, were considered required to do the integration of the different datasets. To better understand the identified false positives, further analyses were made in other to identify the False Positive Challenges (FPC). In this context, CM (Content Measures) stands for all content similarity measures considered in this work (JI, SCx and SCy). All pairs with HS = 100 and CM = 0 automatically classified were also manually classified, but there are cases with HS values around 70%–80% that alone cannot be used to say that two attributes are similar. For example, the pair of attributes Disease and diseaseType with HS = 78% and CM = 0 do not match, as one is the name of the disease and the other its type (FPC1). Also, comparisons with attributes of low cardinality tend to increase the number of false positives (FPC2). Higher thresholds for the metrics reduce the number of the False Positives, but also increase the number of unmatched pairs (FPC3). Furthermore, the analysis of results showed the Unmatched Pair Challenges (UPC) identified in this domain: i) some attributes contain more information than needed. For instance, the attribute STRONGEST SNP-RISK ALLELE has values like rs144573434-T, including both an alternative allele and a variant identifier. This situation was noticed by the human domain expert when performing the manual mapping who identified that the attribute is referring to the allele (UPC1); ii) the range of values for the attributes are different. For example, the variant identifiers of two datasets associated to different types of diseases or chromosomes (UPC2); iii) the free writing of attributes, with values that do not match (UPC3); iv) the lack of patterns or standards to represent the data that needs prefixes or any other coding (UPC4). A set of suggestions for future work will be made next in other to overcome the identified challenges: (FPC1) The addition of a new dimension of analysis for the HS, like the use of a semantic analysis using word dictionaries, as those could improve the
Automating Data Integration
Fig. 4. Graph with manual and automatic mapping relationships
31
32
J. Galvão et al.
reliability of the HS; (FPC2) Include in the Content Similarity some basic statistics such as the number of district values, frequency distribution, among others; (FPC3) Identify relevant thresholds for filtering the obtained results; (UPC1) Apply rules for data cleaning; (UPC2) Analyze the syntax of the possible values for the attributes with Frequent Pattern Mining techniques (for instance, if two attributes have different values like rs123456 and rs654321, they do not automatically match, but they share the same prefix/syntax, rs, meaning that they are likely similar); (UPC3) Identify additional string content measures suitable for Big Data contexts, able to analyze if there is a match between the intersection of the characters of two strings, detecting similarities between two different strings with the same meaning (like lateonset Alzheimer disease and Alzheimer disease); (UPC4) Identify a set of rules for data cleaning and transformation using frequent pattern detection. Those suggestions for possible improvements could be applied to the proposed method without the need to adapt it or extend it, just by adding those new metrics or rules. 5.2
Conclusions
This paper highlighted the challenge of the manual effort needed for data integration tasks, namely in highly dynamical domains due to the variability of the available repositories. This work proposes a method capable of automating data integration tasks in adaptive and data-intensive information systems. The instantiation of the proposed method was implemented in Apache Spark to support Big Data contexts. The evaluation was made by comparing the manual and automatic mapping of the attributes present in four datasets from a particularly complex and dynamic domain: genomics. Besides being satisfactory and showing a high matching rate, the results highlighted some challenges that need to be further addressed, like the occurrence of false positives and the threshold that can be considered to automatically have a certain degree of confidence on the obtained results. Some improvements were identified for future work, aiming to increase the match rate, highlighting the concept of transitivity that infer the missed relationships, the use of word dictionaries to have a semantic dimension of analysis and the syntax analysis. With these future improvements this method will be applied in other contexts, such as manufacturing, to better understand how the method handles data from other domains. Acknowledgements. This work has been supported by FCT – Fundação para a Ciên-cia e Tecnologia within the Project Scope: UID/CEC/00319/2019, the Doctoral scholarship PD/BDE/135100/2017 and European Structural and Investment Funds in the FEDER component, through the Operational Competitiveness and Internationalization Programme (COMPETE 2020) [Project nº 039479; Funding Reference: POCI-01-0247-FEDER-039479]. We also thank both the Spanish State Research Agency and the Generalitat Valenciana under the projects DataME TIN2016-80811-P, ACIF/2018/171, and PROMETEO/2018/176. Icons made by Freepik, from www.flaticon.com.
Automating Data Integration
33
References 1. Krishnan, K.: Data Warehousing in the Age of Big Data. Newnes (2013) 2. Vaisman, A., Zimányi, E.: Data warehouses: next challenges. In: Aufaure, M.-A., Zimányi, E. (eds.) eBISS 2011. LNBIP, vol. 96, pp. 1–26. Springer, Heidelberg (2012). https://doi. org/10.1007/978-3-642-27358-2_1 3. Costa, C., Santos, M.Y.: Evaluating several design patterns and trends in big data warehousing systems. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 459–473. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91563-0_28 4. Bellahsene, Z., Bonifati, A., Duchateau, F., Velegrakis, Y.: On Evaluating Schema Matching and mapping. In: Bellahsene, Z., Bonifati, A., Rahm, E. (eds.) Schema Matching and Mapping, pp. 253–291. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-64216518-4_9 5. Santos, M.Y., Costa, C., Galvão, J., Andrade, C., Pastor, O., Marcén, A.C.: Enhancing big data warehousing for efficient, integrated and advanced analytics - visionary paper. In: Cappiello, C., Ruiz, M. (eds.) CAiSE Forum 2019. LNBIP, vol. 350, pp. 215–226. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-030-21297-1_19 6. Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching. Ten Years Later. PVLDB 4, 695–701 (2011) 7. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 49–58. Morgan Kaufmann Publishers Inc., San Francisco (2001) 8. Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y.: A comparison study on similarity and dissimilarity measures in clustering continuous data. PLoS ONE 10, e0144059 (2015). https://doi.org/10.1371/journal.pone.0144059 9. Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, pp. 916–927. IEEE Computer Society, Washington, DC (2009). https://doi.org/10.1109/ICDE.2009.111 10. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Phys. Doklady 10, 707 (1966) 11. Jaccard, P.: Etude comparative de la distribution florale dans une portion des Alpes et du Jura. Impr. Corbaz, Lausanne (1901) 12. Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the FellegiSunter Model of Record Linkage [microform]/William E. Winkler. Distributed by ERIC Clearinghouse, [Washington, D.C.] (1990) 13. Zhu, E., Nargesian, F., Pu, K.Q., Miller, R.J.: LSH ensemble: internet-scale domain search. Proc. VLDB Endow. 9, 1185–1196 (2016). https://doi.org/10.14778/2994509.2994534 14. Banek, M., Vrdoljak, B., Tjoa, A.M.: Using ontologies for measuring semantic similarity in data warehouse schema matching process. In: 2007 9th International Conference on Telecommunications, pp. 227–234 (2007). https://doi.org/10.1109/CONTEL.2007.381876 15. Deb Nath, R.P., Hose, K., Pedersen, T.B.: Towards a programmable semantic extracttransform-load framework for semantic data warehouses. In: Proceedings of the ACM Eighteenth International Workshop on Data Warehousing and OLAP, pp. 15–24. ACM, New York (2015). https://doi.org/10.1145/2811222.2811229 16. Abdellaoui, S., Nader, F.: Semantic data warehouse at the heart of competitive intelligence systems: design approach. In: 2015 6th International Conference on Information Systems and Economic Intelligence (SIIE), pp. 141–145 (2015). https://doi.org/10.1109/ISEI.2015. 7358736
34
J. Galvão et al.
17. El Hajjamy, O., Alaoui, L., Bahaj, M.: Semantic integration of heterogeneous classical data sources in ontological data warehouse. In: Proceedings of the International Conference on Learning and Optimization Algorithms: Theory and Applications, pp. 36:1–36:8. ACM, New York (2018). https://doi.org/10.1145/3230905.3230929 18. Maccioni, A., Torlone, R.: KAYAK: a framework for just-in-time data preparation in a data lake. In: Krogstie, J., Reijers, H.A. (eds.) CAiSE 2018. LNCS, vol. 10816, pp. 474–489. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91563-0_29 19. Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of the 2016 International Conference on Management of Data, pp. 2097–2100. ACM, New York (2016). https://doi.org/10.1145/2882903.2899389
Raising the Interoperability of Cultural Datasets: The Romanian Cultural Heritage Case Study Ilie Cristian Dorobăț(&) and Vlad Posea “Politehnica” University of Bucharest, Bucharest, Romania [email protected], [email protected]
Abstract. By means of Digital Libraries, entire archives can be made available to users at a click away; but due to the fact that the representation as accurate and as wide as possible of the events in which Cultural Heritage plays an essential role in understanding the past, the metadata aggregators must use data models that satisfy the demands of information. We present the workflow behind the eCHO Framework, a framework which allows users, on the one hand, to sanitize, normalize and interconnect data represented according to LIDO XML Schema, and on the other hand, to take advantage of the representation of metadata through the event-centric approach of Europeana Data Model. Eventually, it is presented the applicability of the use of this framework in the context of identifying the time periods in which took place the most important events in which the Romanian Cultural Heritage have been involved in, statistics that can be extended to the purpose of identifying the lifestyle of the population. Keywords: Cultural heritage
Digital libraries Linked data LIDO EDM
1 Motivation Nowadays, when users need more accessible sources of information, digitalization is gaining more and more interest in any domain. Although digital libraries cannot be considered as a substitute for heritage collections, they are an important bridge between cultural heritage institutions and information consumers. The usefulness of digital libraries can be found in the behavior of regular users who needs to be informed about some certain cultural heritage objects (CHOs) that have caught their attention, as well as experts in the field, who need more complex and more accessible information sources. By identifying the benefits of digitalization of the administrated heritage, cultural institutions have directed their resources towards adopting Linked Data. Unfortunately, for producing Linked Data, the smaller institutions depend on data aggregators such as Europeana [1]. Europeana1 is the largest digital library in Europe encompassing over 60 million descriptions of CHOs hosted by different European cultural institutions, aggregating metadata acquired from more than 3,000 museums, galleries, libraries and 1
https://www.europeana.eu/portal/en.
© Springer Nature Switzerland AG 2020 M. Themistocleous et al. (Eds.): EMCIS 2020, LNBIP 402, pp. 35–48, 2020. https://doi.org/10.1007/978-3-030-63396-7_3
36
I. C. Dorobăț and V. Posea
archives, hence offering an impressive collection of metadata regarding artworks, artefacts, books, films and music from the National Thesaurus of the European Union member states. For this, it uses an own ontology, the Europeana Data Model (EDM), based on which metadata are represented as a graph, being accessible not only through a user-friendly interface, but also via a programmatic one, by means of a SPARQL endpoint. For representing metadata, EDM provides two approaches namely: i) object-centric approach in which the main focus is on the cultural object itself; ii) event-centric approach which allows the emphasize of the relationships between the CHO’s described. Sadly, Europeana officially supports only the first approach [2], diminishing the power of digital representation of the CHOs, which, by their specific feature, carry historical connections which might be useful for the understanding of the past. Due to that a wide part of the European cultural institutions (National Digital Library of Finland2 [3], Athena Plus3, German Digital Library4, Digi Cult5, etc.) which have digitalized their collection have used the Lightweight Information Describing Objects XML Schema (LIDO) [4], Enhancing the Digital Representation of Cultural Heritage Objects Framework (eCHO)6 has been developed to facilitate the migration of the smaller institutions to Linked Data, rendering the analysis process more simplistic, focused more on querying the resources than on creating connections. The current research captures the study done for extending the level of digitalization of cultural institutions which host CHOs, beginning from the presentation of the most relevant projects developed in this direction, followed in Sect. 3 by presenting the research direction. Section 4 is destinated for presenting the used datasets, Sect. 5 is dedicated to the presentation of the workflow behind the eCHO Framework, being followed by a short discussion of how data can be represented through the event-centric approach of EDM. Section 7 describes the most important challenges encountered during the development, followed by presenting of the analysis of the time periods in which the most important events in which Romanian CHOs have been involved, while the final section has been reserved for the presentation of the final remarks and of the direction of further development.
2 Related Work During CIDOC 2012 Enriching Cultural Heritage conference, Tsalapati et al. presented a scientific study [5] which, on the one hand highlights the premises and stages identified by the authors for facilitating the transition of the datasets structured according to LIDO into Linked Data, and on the other hand exposes the sets of preliminary experiments regarding the interlinking of internal resources with external
2 3 4 5 6
https://www.digime.fi/en. http://www.athenaplus.eu. https://www.deutsche-digitale-bibliothek.de. http://www.digicult-verbund.de. https://github.com/iliedorobat/enriching-cultural-heritage-metadata.
Raising the Interoperability of Cultural Datasets
37
data sources like DBpedia or Eurostat. During the study, authors have analyzed the possibility of the direct mapping of the LIDO data structure in properties and classes specific to certain reputable ontologies as CIDOC CRM7 and EDM. Thus, starting from the Linked Data principles, in order to translate a dataset represented through a regular data structure, it is mandatory that the defined resources receive a unique identifier which can be extracted from the already existing data, or, when necessary, can be created a new one. The process continues with the identification of the LIDO elements which can be mapped down following the chosen data model, and for those which it is not possible, it has been suggested the defining of new properties and classes specific to the LIDO namespace through the extending of the already existing ones. In addition, for a more detailed representation of the resources, it is also conducted an experiment of mapping these elements in instances of the corresponding resources of the DBpedia knowledge base. In [1, 4] the authors describe the direction that the Amsterdam Museum adopted towards the implementation of a data model which facilitates the interlinking of hosted collections with other sources of data. Hence, starting from its own digital data management system, the Adlib Museum8, made the transition towards the EDM. As result, datasets benefited of an improvement regarding the quality of data by the increase of the consistency and interoperability between datasets supplied by different institutions. Metadata Interoperability system (MINT)9 provides services for harvesting, mapping and translation of metadata. The aggregation process begins with loading the metadata records in XML or CSV serialization, through one of the HTTP, FTP or OAIPMH protocol available. After finalizing the loading of datasets, the users will use the visual mapping editor for mapping the records. The visual editor has also a navigation system both in the structure and data of the input scheme, as well as in the target one. CARARE10 is an aggregation service developed as an answer to the need of translation of the datasets regarding monuments, buildings, landscape areas and their representations, in a unique format, using as representation method the CARARE Metadata Schema. The system consists in two main components, namely: i) the MINT services used for data mapping and ingestion; ii) the Monument Repository (MORE). The latter is a service destined to store metadata aggregated by providers, facilitating the automatization of the metadata translation in the format used by Europeana for making available for harvesting via an OAI-PHM target [6].
3 Problem Statement The digitalization of collections is a continuous process which allows cultural institutions migrate physical expositions to the virtual environment. For this, cultural institutions must follow and update datasets to the changes from the standard of the domain.
7
http://www.cidoc-crm.org. https://www.axiell.com/solutions/product/adlib. 9 http://mint-projects.image.ntua.gr/museu. 10 https://pro.carare.eu/doku.php?id=start. 8
38
3.1
I. C. Dorobăț and V. Posea
Research Questions
i. Which is the solution to the complete automatization of the translation process of metadata represented through LIDO into Linked Data? ii. Which is the solution to the normalization of timestamps? iii. How can we conserve the events in which cultural goods have been involved in using EDM as reference model? iv. How can the data represented though LIDO be correlated to the new representation model? 3.2
Paper’s Contribution
The present paper aims at offering an overview of the way in which, in the digitalization context, cultural heritage can benefit of the advantages of data migration into Linked Data. Furthermore, the main challenges encountered are analyzed, and in the end, it is presented a practical example of the use of the eCHO Framework for the classification of cultural objects according to different events in which they have been involved and according to the century. During this process, data are sanitized, normalized, and interconnected; a specific element of this framework being represented by the use of the event-centric approach for the representation of the events in which cultural goods have been involved. Thus, unlike the Europeana portal, in which digital representation are targeted towards describing the cultural objects, users will also have the possibility of conserving the events in which a determined cultural object has been involved. 3.3
Research Limitations
An aspect which must be taken into consideration is that the eCHO Framework allows temporarily only the normalization of timestamps expressed in the Romanian language. Therefore, for the normalization of timestamps in other languages, either another model must be added, or this process must be externalized through a service. Furthermore, the support offered by the framework is only in the direction of translating the records represented through LIDO into Linked Data, using the EDM’s event-centric approach.
4 Data Source INP is a Romanian National Public Entity whose objectives, according to the Government Decision no. 593/2011, are to administrate historical monuments to manage their restorative process, to ensure the legislative framework for the protection of historical monuments, as well as to create national databases for archaeological, cultural heritage, intangible cultural heritage and for the associated information resources.
Raising the Interoperability of Cultural Datasets
39
The present research uses the datasets made available by INP on the public Romanian open data portal data.gov.ro11. These datasets are represented according to LIDO XML Schema, describing CHOs from a wide range of domains such as Archaeology; Art; Decorative art; Documents; Ethnography; The history of science and technology; History; Metalinguistics; Numismatics; Natural science.
5 Methodology Overview The present section is reserved to the presentation of the eCHO workflow, followed by the motivation of the implementation of the Linked Data. Figure 1 depicts the process of translation of the metadata from LIDO to Linked Data, beginning with the parsing and storage of the metadata. For this, a standalone component was developed, the LIDO Parser12, so that this component be used independently not only in this process of translation, but also in any other operations and analysis which users might need. Once the metadata has been successfully parsed and stored, the process continues with the data curation. In this stage, the calendar dates and the time periods are normalized to a common pattern, the data used for creation of URIs is sanitized, and there are established links between concepts and other vocabularies. Finally, the prepared data is mapped to EDM and users will be able to charge and query the generated set of triplet’s subject-predicate-object in the semantic data store.
Fig. 1. The eCHO framework workflow [7].
Even though LIDO is a standard which the online museum objects content creators can use it to represent metadata for describing the museum objects, facilitating the sharing and aggregation of collections and by providing a machine-readable solution for representing museum objects, an ontology offers more advantages in the representation and reuse of data, namely [8, 9]:
11 12
https://data.gov.ro/dataset?organization=institutul-national-al-patrimoniului. https://github.com/iliedorobat/LIDO-Parser.
40
I. C. Dorobăț and V. Posea
a) the represented knowledge may include semantic structures which detail and define their meaning; b) the resources may be researched by using a SPARQL endpoint; c) concepts defined in other ontologies may be reused.
6 Knowledge Base Representation To represent the knowledge base, EDM makes available two approaches, namely: the object-centric and the event-centric. Like its name, the first approach highlights the description of the CHOs using a series of statements which provide direct links between the described CHO and its characteristics. The latter approach provides a higher level of representation of knowledge by changing the perspective of the description of the CHOs towards the characterization of the various events in which the CHOs have been involved in. Also, EDM allows the coexistence between the two approaches, which determines the increase of the level of detailing. From a practical point of view, the generation of RDF graphs involve the generation of every RDF graph entity as instances of the classes which EDM uses for representing objects from the real-world. Specifically, for the semantic representation of the real-world objects, EDM considers the differences between the physical object and its digital representations, providing the following three core classes: a) edm:ProvidedCHO for describing the characteristics of the physical objects; b) edm:WebResource for describing the virtual representation of the represented physical object; c) ore:Aggregation for connecting the physical objects to their digital representations. Likewise, for ensuring a higher degree of detailing, EDM provides the following set of contextual classes as well, through which wider descriptions can be made: a) edm:Agent for describing a private citizen; b) foaf:Organization for describing a legal person (an institution, a governmental or non-governmental organization); c) edm:Event which implements the event-centric approach itself is used for describing an event in which an actor or an object has been involved in; d) edm:Place for describing spatial places; e) edm:TimeSpan for describing time periods and dates; f) skos:Concept for describing various concepts as thesauri, classifications, etc.
7 Data Challenges Even though LIDO is a very popular standard in the field, which is used for the digital representation of the cultural objects from different domains such as architecture, history, history of technology, etc., the management of the metadata might be cumbersome as an XML schema is nothing else than document syntax constraint
Raising the Interoperability of Cultural Datasets
41
specifications and does not provide a method of representation of the connections between the resources. Therefore, the transition from the representation of data in XML format, to the representation based on ontology, is a natural stage for the increase of the degree of data representation. Nevertheless, as in any other operation of improvement of the process of data representation through their translation using a data structure superior to the one previously used, during the translation process, different situations have been encountered which needed paying high attention to. Amongst the most important such of situations, hereafter called challenges, we list: i) timestamp normalization; ii) assigning the URIs for resources; iii) identifying similar resources; iv) handling the cardinality constraints; v) extending the EDM vocabulary. 7.1
Timestamp Normalization
A particular case of URIs definition is encountered in the lido:displayDate record type, through which are represented calendar dates and time periods associated to the events in which the CHOs have been involved in. Although, ideally, processing this type of data should not involve significant efforts, in practice, due to the fact the initial data collection was done by human operators without any nomenclator, these data have different shapes. An overview can be found in Table 1, which depicts the five types of time periods identified, starting with values which do not express in any way any type of time period, and continuing with timestamps.
Table 1. Types of time periods and their forms. Type of time period Example of time periods* unknown 189-45; dinastia xxv; nesemnat; grupa a iv-a; 1(838); 173 [1] etc. epoch pleistocen; epoca de bronz; renaștere etc. 1881-08-31; 1857 mai 10 etc.; datea YMD: 09.11.1518; 1 noiembrie 1624 etc.; DMY: ianuarie 632 etc. MY: timespansb centuries: s:; sec; sec.; secol; secolele; etc. millenniums: mil; mil.; mileniul; mileniului; mileniile yearsc a interval, simple date, separated by dash, dot or semicolon. b interval, simple timespan, Arabic numbers, Roman numbers, parts of timespan*: ½; ¼; primul sfert; prima jumatate, etc. c 2–4 digits. * the values are mentioned in the reference language – Romanian language.
As it can be noticed, in the case of calendar dates, these can be found in the format year-month-day (YMD), or day-month-year (DMY), or month-year (MY); all these formats having different forms according to the used separator and to the method the month is expressed. The biggest challenge is represented by the processing of time
42
I. C. Dorobăț and V. Posea
periods expressed in centuries and millenniums because these forms of representation of time periods are found in various formats. For example, there are cases when for their representation Roman numbers, as well as Arabic numbers are used, or the terms “century” and “millennium” are not used uniformly, different forms as “sec”, “secol”, “secolele”, etc. being used (all these forms are mentioned in the reference language – Romanian language). Thus, in order to treat these cases, the use of regular expressions (regexes) has been embraced, through which similar data structures can be identified and regulated so that, the URIs will use only one shape of expressing timespans. Figure 2 depicts an example of regex, namely the regex used to identify chunks which describe periods of time expressed in centuries, using three different colors to highlight each portion of code. The blue section of code identifies the sequences of characters through which the “century” word can be expressed in Romanian (e.g.: “sec”, “secol”, “secole”, “secolele”, etc.) and the presence of the prefix “al”, which is used in Romanian to express ordinal numbers. The green section depicts the part of the regex that recognizes both the punctuation marks which could be accidentally added by users, and the interval separator “-”. The last section identifies the Roman numerals, the suffix “lea” used in Romanian to express ordinal numbers and the sequences __AD__ (Anno Domini), respectively __BC__ (Before Christos). The latter is the result of preprocessing of the input using another set of rules which can identify this information. Thus, we can identify a large series of chunks such as “sec. iv a hr.”, “sec. iv p. chr.”, “sec. iv d chr.”, “secolul iv”, “secolele iv - vii”, etc. (“a hr.”, “p.hr.” and “d hr.” are the unprocessed values of the sequences __AD__, respectively __BC__). This step of pre-processing is necessary to avoid the use of regexes in the querying data storages for the following considerations: i) the operations which involve regexes are known as being time consuming, especially considering that for the normalization of timestamps a large set of regexes need be applied; ii) users might omit the use of some regexes, which would lead to the alterations of the result.
( (sec[\w]*) ([\., ]+(al[\. ]+){0,1})* ) [\.,;\?!\- ]* ( ( ?