Cinderella's Stick: A Fairy Tale for Digital Preservation 331998487X, 9783319984872

This book explains the main problems related to digital preservation using examples based on a modern version of the wel

139 8

English Pages 270 [254] Year 2018

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Testimonials
Preface
Style and Structure of the Book
How to Use This Book
Beyond This Book
Disclaimer
Acknowledgments
About the Authors
Contents
Chapter 1: A Few Words About Digital Preservation and Book Overview
1.1 A Few Words About Digital Preservation
1.2 Historical Notes
1.3 Standardization
1.4 Analyzing Digital Preservation and Book Overview
1.4.1 The Notion of Pattern
1.4.2 The Notion of Digital Preservation Pattern
1.4.3 The Patterns in This Book
1.4.4 Detailed Listing of Patterns and Roadmap
1.5 Links and References
Chapter 2: The Fairy Tale of Cinderella
2.1 The Plot
Chapter 3: Daphne (A Modern Cinderella)
3.1 Episode
Chapter 4: Reading the Contents of the USB Stick
4.1 Episode
4.2 Technical Background
4.2.1 Storage Media
4.2.2 Durability of Storage Media
4.2.3 Accessing Storage Media
4.2.4 Cloud Storage
4.2.5 Cases of Bit Preservation
4.3 Pattern: Storage Media-Durability and Access
4.4 Questions and Exercises
4.5 Links and References
4.5.1 Readings
4.5.2 Tools and Systems
Chapter 5: First Contact with the Contents of the USB Stick
5.1 Episode
5.2 Technical Background
5.2.1 Metadata in General
5.2.2 File System
5.2.3 File Name Extensions
5.2.4 File Signature
5.2.5 Files´ Metadata
5.2.6 Metadata Extraction, Transformation, and Enrichment
5.3 Pattern: Metadata for Digital Files and File Systems
5.4 Questions and Exercises
5.5 Links and References
5.5.1 Readings
5.5.2 Tools and Systems
5.5.3 Other Resources
Chapter 6: The File Poem.html: On Reading Characters
6.1 Episode
6.2 Technical Background
6.2.1 Character Encoding
6.2.2 HTML
6.2.3 Character Semantics
6.2.4 Parsing
6.3 Pattern: Text and Symbol Encoding
6.4 Questions and Exercises
6.5 Links and References
6.5.1 Readings
6.5.2 Tools and Systems
6.5.3 Other Resources
Chapter 7: The File MyPlace.png: On Getting the Provenance of a Digital Object
7.1 Episode
7.2 Technical Background
7.2.1 Formats for Images
7.2.2 Exif Metadata
7.2.3 PDF
7.2.4 Provenance
7.3 Pattern: Provenance and Context of Digital Photographs
7.4 Questions and Exercises
7.5 Links and References
7.5.1 Readings
7.5.2 Tools and Systems
Chapter 8: The File todo.csv: On Understanding Data Values
8.1 Episode
8.2 Technical Background
8.2.1 NetCDF
8.2.2 Semantic Web
8.2.3 Linked Open Data (LOD)
8.2.3.1 Linked Data in HTML
8.2.3.2 On Producing RDF and Linked Data
8.2.4 Other Technologies (and Their Comparison with Semantic Technologies)
8.3 Pattern: Interpretation of Data Values
8.4 Questions and Exercises
8.5 Links and References
8.5.1 Readings
8.5.2 Tools and Systems
8.5.3 Other Resources
Chapter 9: The File destroyAll.exe: On Executing Proprietary Software
9.1 Episode
9.2 Technical Background
9.2.1 Executable Files
9.2.2 Termination, Decidability, Tractability
9.2.3 Code Injection
9.2.4 Antivirus Software
9.2.5 Software Emulation and Virtual Machines
9.3 Pattern: Safety and Dependencies of Executables
9.4 Questions and Exercises
9.5 Links and References
9.5.1 Readings
9.5.2 Tools and Systems
Chapter 10: The File MyMusic.class: On Decompiling Software
10.1 Episode
10.2 Technical Background
10.2.1 Constructing and Executing Computer Programs (Compilers, Interpreters, and Decompilers)
10.2.1.1 Compilers
10.2.1.2 Interpreters
10.2.1.3 Decompilers
10.2.2 Java (Programming Language)
10.2.3 Maven (Software Build Automation Tools)
10.3 Pattern: Software Decompiling
10.4 Questions and Exercises
10.5 Links and References
10.5.1 Readings
10.5.2 Tools and Systems
Chapter 11: The File yyy.java: On Compiling and Running Software
11.1 Episode
11.2 Technical Background
11.2.1 Software Runtime Dependencies
11.2.2 Software Documentation
11.2.3 IP Addresses and DNS
11.3 Pattern: External Behavioral (Runtime) Dependencies
11.4 Questions and Exercises
11.5 Links and References
11.5.1 Readings
11.5.2 Tools and Systems
11.5.3 Other Resources
Chapter 12: The File myFriendsBook.war: On Running Web Applications
12.1 Episode
12.2 Technical Background
12.2.1 WAR Files
12.2.2 Cloud Computing
12.2.3 The Case of MIT Scratch
12.3 Pattern: The Execution of a Web Application
12.4 Questions and Exercises
12.5 Links and References
12.5.1 Readings
12.5.2 Tools and Systems
Chapter 13: The File roulette.BAS: On Running Obsolete Software
13.1 Episode
13.2 Technical Background
13.2.1 Amstrad 464 and Commodore 64
13.2.2 BASIC
13.2.3 Pascal
13.2.4 The Aging of Programming Languages
13.3 Pattern: Software Written in an Obsolete Programming Language
13.4 Questions and Exercises
13.5 Links and References
13.5.1 Readings
13.5.2 Tools and Systems
Chapter 14: The Folder myExperiment: On Verifying and Reproducing Data
14.1 Episode
14.2 Technical Background
14.2.1 HTML and Remotely Fetched Images
14.2.2 Web Archiving and Web Citation
14.2.3 Proposals for Changing the Scientific Publishing Method
14.2.4 Trustworthy Digital Repositories
14.2.5 The Data-Information-Knowledge-Wisdom Hierarchy
14.2.6 Gödel´s Incompleteness Theorems
14.3 Pattern: Reproducibility of Scientific Results
14.4 Questions and Exercises
14.5 Links and References
14.5.1 Readings
14.5.2 Other Resources
Chapter 15: The File MyContacts.con: On Reading Unknown Digital Resources
15.1 Episode
15.2 Technical Background
15.2.1 Format Recognition: JHOVE
15.2.2 Preservation-Friendly File Formats
15.2.3 Object Serialization and Storage
15.3 Pattern: Proprietary Format Recognition
15.4 Questions and Exercises
15.5 Links and References
15.5.1 Readings
15.5.2 Tools and Systems
Chapter 16: The File SecretMeeting.txt: On Authenticity Checking
16.1 Episode
16.2 Technical Background
16.2.1 Technologies Related to Authenticity
16.2.1.1 Checksums
16.2.1.2 Digital Signatures
16.2.1.3 HTTPS
16.2.1.4 Web Server-Side Authentication and Client-Side Authentication
16.2.1.5 Quantum Cryptography
16.2.1.6 Bitcoin
16.2.2 Processes for Authenticity Assessment
16.2.3 Copyright and Licensing
16.3 Pattern: Authenticity Assessment
16.4 Questions and Exercises
16.5 Links and References
16.5.1 Readings
16.5.2 Tools and Systems
Chapter 17: The Personal Archive of Robert: On Preservation Planning
17.1 Episode
17.2 Technical Background
17.2.1 Moore´s Law
17.2.2 Storage Space and Kolmogorov Complexity
17.2.3 Compression-Related Risks
17.2.4 Preservation Planning
17.2.5 On Selecting What to Preserve
17.2.6 Value of Information
17.2.7 Data Management Plan
17.2.8 Backup and Data Replication Against Hardware Media Failures
17.2.9 Version Control
17.2.10 Blog Preservation
17.3 Pattern: Preservation Planning
17.4 Questions and Exercises
17.5 Links and References
17.5.1 Readings
17.5.2 Tools and Systems
17.5.3 Projects
Chapter 18: The Meta-Pattern: Toward a Common Umbrella
18.1 Episode
18.2 Technical Background
18.2.1 Patterns and Task Performability
18.2.2 Interoperability Strategies
18.2.3 Migration, Emulation, and Dependency Management
18.2.4 Requirements on Reasoning Services
18.2.5 Modeling Tasks and Their Dependencies
18.2.5.1 The Research Prototype Epimenides
18.2.6 General Methodology for Applying the Dependency Management Approach
18.2.6.1 Layering Tasks
18.2.7 Case Study: 5-Star LOD and Task Performability
18.2.8 Case: Blog Preservation
18.2.9 On Information Identity
18.3 The Big Picture
18.3.1 The FAIR Data Principles for Scientific Data
18.3.2 Systems for Digital Preservation
18.4 Questions and Exercises
18.5 Links and References
18.5.1 Readings
18.5.2 Tools and Systems
Chapter 19: How Robert Eventually Found Daphne
19.1 Episode
Chapter 20: Daphne´s Dream
20.1 Episode
20.2 Questions and Exercises
Chapter 21: Epilogue
21.1 Synopsis of Episodes
Index
Recommend Papers

Cinderella's Stick: A Fairy Tale for Digital Preservation
 331998487X, 9783319984872

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Yannis Tzitzikas · Yannis Marketakis

Cinderella’s Stick A Fairy Tale for Digital Preservation

Cinderella’s Stick

Yannis Tzitzikas • Yannis Marketakis

Cinderella’s Stick A Fairy Tale for Digital Preservation

Yannis Tzitzikas Computer Science Department University of Crete Vassilika Vouton, Heraklion, Greece

Yannis Marketakis Institute of Computer Science Foundation for Research and Technology – Hellas Vassilika Vouton, Heraklion, Greece

Institute of Computer Science Foundation for Research and Technology – Hellas Vassilika Vouton, Heraklion, Greece

ISBN 978-3-319-98487-2 ISBN 978-3-319-98488-9 https://doi.org/10.1007/978-3-319-98488-9

(eBook)

Library of Congress Control Number: 2018956320 © Springer Nature Switzerland AG 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. © Cover illustration: Korina Doerr This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Testimonials

“An out of the box approach to describe the main issues generated by the obsolescence of the digital material and its surroundings as well as the methods and actions for digital preservation. The exploration of the digital preservation space is given through a nice fairy tale that facilitates the understanding of advanced concepts and complicated computer science methods. The chapters of the book correspond to correlated patterns—particular digital preservation problems and the corresponding policies for their resolution—creating paths—sets of policies—to confront issues in an integrated way. A very creative combination of homogeneity, modularity and fiction!” Christos Papatheodorou (Professor, Department of Archives, Library Science, and Museology, Ionian University) “A wide range of digital preservation actions are covered in the book in a comprehensive and creative way. Comprehensive, as it provides in-depth coverage of computer science solutions for a wide range of digital preservation problems. Creative, as it uses the imagery of the long and winding road of a fairy tale towards a happy ending.” René van Horik (Data Archiving and Networked Services—DANS-KNAW) “It is a well-structured, easy-to-read book, ideal for understanding basic terms and aspects of digital preservation, even suitable for people with other than technical background. I really liked the concept of pattern!” Katerina Lenaki (University of Crete Library)

v

vi

Testimonials

“A good start for those who want to get a global view of Digital Preservation and get basic information about where they can find further material.” Panos Georgiou (University of Patras Library & Information Center) “Where a fairy tale of the past meets a modern version of it, where contemporary Digital Preservation topics addressed to scientists and researchers touch on the everyday life of the average user of new technologies, a narrative with elements of the myth that could nevertheless be true, unfolds with examples the technical background to solve the problems that arise. An enjoyable reading, undisturbed interest, stimulates curiosity and sheds plenty of light thanks to its original style and its perfect structure.” Tonia Dellaporta (French Language and Literature Teacher)

To Titos and Tonia Yannis Tzitzikas To Katerina and Marianna Yannis Marketakis

Preface

This book aims to explain the main problems related to what is called digital preservation through examples in the context of a fairy tale. Digital preservation is the endeavor of preserving digital material against loss, corruption, hardware/software technology changes, and changes in the knowledge of the community. It is important not only for the long run. The better we understand and deal with the problem of digital preservation, the better the interoperability of digital content that we enjoy today. The book is addressed to those who would like to understand these problems, even if they lack the technical background. However, the book also aims to explain (to some degree) the technical background, to provide references to the main approaches (or solutions) that currently exist for tackling these problems, and to provide questions and exercises appropriate for computer engineers and scientists. To this end, it includes examples and links related to state-of-the-art technologies from various areas including metadata, Linked Data and Semantic Web, emulation, software engineering, cryptography, blockchain, information identity, intellectual property rights, knowledge representation, and reasoning. This book can be useful to: • Persons involved in the management and curation of digital content in digital libraries, archives, and museums. The examples provided can give them a taste of the problems and risks, as well as pointers to existing technologies or strategies that can mitigate them. • Engineers who use (or design) tools or information systems for digital preservation. The questions and exercises in this book can be an opportunity for practice and even a source of inspiration. They are related to many areas of computer science. • Engineers and researchers of computer science for explaining the main issues at stake in digital preservation.

ix

x

Preface

• Software designers and engineers for aiding them to understand the digital preservation problem, hoping that this will raise awareness and will positively affect the design of future information systems. • Citizens in general due to the increase in the volume and the diversity of the digital objects in our personal archives that already comprise digital artifacts related to almost every aspect of our lives. We believe that the examples in this book can give the reader a basic understanding of the related issues and risks, and this understanding could help them plan actions for better preserving their digital artifacts.

Style and Structure of the Book The book starts by giving a modern version of the well-known fairy tale “Cinderella.” This story is used for gluing together the examples that are given in the chapters and making the reading more interesting. The structure of the book is modular. Each chapter consists of two parts: the episode and the technical background. The episodes narrate the story of the modern version of the fairy tale in chronological order, exactly as in the fairy tale. Apart from the story itself, each episode is related to one or more digital preservation problems, which are discussed in the technical background section of the chapter. For revealing the more general and abstract formulation of these problems, we use the notion of pattern. Each pattern has a name, a short description of the problem, a narrative describing one attempt to solve the problem, a description of what could have been done for avoiding or just alleviating this problem, some lessons drawn, and, finally, links to related patterns described in other chapters of the book. In addition, at the end of each chapter, the reader will find links, references, and exercises. In comparison to other books on digital preservation, this one aims at being short and concise, with concrete examples, while providing links and references for those readers who want to learn more about each topic. We understand that many of the technologies that are mentioned in the technical sections will change and evolve over time. However, we believe that the questions raised in this book and the factorization of the general problem through tasks will remain largely stable.

How to Use This Book Beyond those mentioned in the previous section, the reader can also use the index (at the end of the book) to easily locate a particular concept or technology. If the reader’s objective is just to grasp the problems related to digital preservation, they can ignore the questions and exercises that are given in each chapter. However, a computer engineer or scientist can also try to answer the questions and solve the exercises. Some of these exercises refer to particular digital files, which the

Preface

xi

reader can find in the website of the book. These exercises are marked with the symbol . Most programmatic exercises presuppose knowledge of the Java programming language. Finally, the reader can visit the website of the book also for more technical material and up-to-date publications, research results, and other findings.

Beyond This Book The contents of Cinderella’s “real USB stick” are given on the book’s website (http://www.cinderella-stick.com) maintained by the authors, where the reader can also find links to various tools, some of which are referenced in the book, and updated content. We gladly welcome further feedback, corrections, and suggestions on the book, which may be sent to all the authors at [email protected].

Disclaimer This book, apart from technical material, contains a fairy tale. The latter is a work of fiction and the names, characters, organizations, institutes, businesses, places, events, and incidents that are mentioned in the fairy tale are either the products of the author’s imagination or used in a fictitious manner. Any resemblance to actual organizations, institutes, businesses, persons living or dead, or actual events is purely coincidental. The book contains links to related resources and sites in the World Wide Web. The authors and the publisher are not responsible for the durability and accuracy of these links, nor for the contents or operation of any site or resource referenced.

Acknowledgments Warm thanks to all those who have contributed to this book, specifically to: • Yannis Kargakis (FORTH-ICS) who has been involved in the discussions about this book from the beginning (June 2014). Yannis has contributed in several phases of the preparation of this book and he has coauthored three chapters. • Nicolas Spyratos (Professor Emeritus of Computer Science at the University of Paris-South, France) for his valuable comments on making the structure of the book more clear from the beginning. • Jeffrey van der Hoeven and Barbara Sierman (KB National Library of the Netherlands) for carefully reviewing the entire book and suggesting useful links to include. • Christos Papatheodorou (Professor, Department of Archives, Library Science, and Museology, Ionian University) for his positive and warm feedback.

xii

Preface

• René van Horik (Senior Project Manager and Researcher DANS-KNAW, the Netherlands) for carefully reviewing the manuscript and suggesting issues that are worth mentioning. • Katerina Lenaki (University of Crete Library, Master on Public Administration, Open Data trainer) for reading the manuscript and suggesting clarifications and improvements. • Panos Georgiou (University of Patras Library & Information Center) for reading the manuscript, suggesting improvements, and encouraging us. • Nikos Minadakis (FORTH-ICS) for providing us feedback and suggestions. • Noni Rizopoulou (Instructor and Course Coordinator of Technical Communication in English, Computer Science Department, University of Crete) for proofreading the manuscript. • Alison Manganas (FORTH-ICS) for her suggestions in improving the language. • Tonia Dellaporta (French Language and Literature Teacher) for reviewing the manuscript and suggesting a few improvements. • Nikos Tzitzikas for various general comments and his positive feedback. • Korina Doerr (FORTH-ICS) for the illustration of the cover page. Finally, we would like to thank Ralf Gerstner (Springer) for his help and valuable suggestions. We would like to thank FORTH for providing us a stimulating research environment and for the opportunity it gave us to participate in very interesting EU projects on digital preservation and related topics. We would also like to thank our families for their support while writing this book and for the many hours they’ve let us spend working on this book, mainly during weekends and vacations, and for our countless conversations that have been a great inspiration.

About the Authors Yannis Tzitzikas is Associate Professor of Information Systems in the Computer Science Department of the University of Crete (Greece), and Affiliated Researcher in the Information Systems Lab (ISL) at FORTH-ICS (Greece) where he coordinates the Semantic Access and Retrieval group (http://www.ics.forth.gr/isl/sar). He completed his undergraduate and graduate studies (MSc, PhD) in the Computer Science Department at the University of Crete and has been ERCIM postdoctoral fellow at ISTI-CNR (Pisa, Italy) and at VTT Technical Research Centre of Finland, and postdoctoral fellow at the University of Namur (Belgium). His research focuses on Semantic Data Management, Exploratory Search, and Digital Preservation. Over the last years, he has had an active role in several EU projects (KP-Lab, iMarine, BlueBRIDGE), including the digital preservation-related projects CASPAR, SCIDIP-ES, and APARSEN NoE, where he was the leader of the work package on interoperability. He has published more than 120 papers in refereed international conferences and journals, including prestigious journals and venues (ACM Transactions on the Web, VLDB Journal, IEEE Transactions on Knowledge and Data Engineering, JIIS, JDAPD, ISWC), and he has received two best paper awards.

Preface

xiii

Yannis Marketakis works as an R&D Engineer in the Information System Laboratory at FORTH-ICS (Greece). He received a BSc in computer science and an MSc in Information Systems from the Computer Science Department of the University of Crete (Greece). His main interests include: information systems, conceptual modeling, knowledge representation using Semantic Web technologies, data integration, and object-oriented languages. He has been involved in several EU and national projects (including iMarine, BlueBRIDGE, VRE4EIC, and others) as well as in the digital-preservation-related EU projects CASPAR and SCIDIP-ES. He has participated as an author in more than 35 scientific publications.

Contents

1

A Few Words About Digital Preservation and Book Overview . . . . 1.1 A Few Words About Digital Preservation . . . . . . . . . . . . . . . . . . 1.2 Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Analyzing Digital Preservation and Book Overview . . . . . . . . . . 1.4.1 The Notion of Pattern . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 The Notion of Digital Preservation Pattern . . . . . . . . . . 1.4.3 The Patterns in This Book . . . . . . . . . . . . . . . . . . . . . . 1.4.4 Detailed Listing of Patterns and Roadmap . . . . . . . . . . 1.5 Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 4 5 6 6 7 8 11

2

The Fairy Tale of Cinderella . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 The Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13 14

3

Daphne (A Modern Cinderella) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15 15

4

Reading the Contents of the USB Stick . . . . . . . . . . . . . . . . . . . . . . 4.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Storage Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Durability of Storage Media . . . . . . . . . . . . . . . . . . . . 4.2.3 Accessing Storage Media . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Cloud Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Cases of Bit Preservation . . . . . . . . . . . . . . . . . . . . . . 4.3 Pattern: Storage Media—Durability and Access . . . . . . . . . . . . . 4.4 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Tools and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . .

21 21 22 23 23 24 24 25 26 26 27 27 28

xv

xvi

Contents

First Contact with the Contents of the USB Stick . . . . . . . . . . . . . . 5.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Metadata in General . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 File Name Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 File Signature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Files’ Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.6 Metadata Extraction, Transformation, and Enrichment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Pattern: Metadata for Digital Files and File Systems . . . . . . . . . . 5.4 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Tools and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.3 Other Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29 29 31 31 32 33 33 34

6

The File Poem.html: On Reading Characters . . . . . . . . . . . . . . . . . 6.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Character Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Character Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Pattern: Text and Symbol Encoding . . . . . . . . . . . . . . . . . . . . . . 6.4 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Tools and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.3 Other Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39 39 43 43 44 44 45 46 46 47 47 47 47

7

The File MyPlace.png: On Getting the Provenance of a Digital Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Formats for Images . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Exif Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Pattern: Provenance and Context of Digital Photographs . . . . . . . 7.4 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Tools and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . .

49 49 51 51 51 53 53 56 57 58 58 59

5

34 36 36 37 37 38 38

Contents

xvii

The File todo.csv: On Understanding Data Values . . . . . . . . . . . . . 8.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 NetCDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Linked Open Data (LOD) . . . . . . . . . . . . . . . . . . . . . . 8.2.3.1 Linked Data in HTML . . . . . . . . . . . . . . . . 8.2.3.2 On Producing RDF and Linked Data . . . . . . 8.2.4 Other Technologies (and Their Comparison with Semantic Technologies) . . . . . . . . . . . . . . . . . . . . . . . 8.3 Pattern: Interpretation of Data Values . . . . . . . . . . . . . . . . . . . . . 8.4 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Tools and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.3 Other Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61 61 62 63 63 65 65 67

9

The File destroyAll.exe: On Executing Proprietary Software . . . . . 9.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Executable Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Termination, Decidability, Tractability . . . . . . . . . . . . . 9.2.3 Code Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.4 Antivirus Software . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.5 Software Emulation and Virtual Machines . . . . . . . . . . 9.3 Pattern: Safety and Dependencies of Executables . . . . . . . . . . . . 9.4 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Tools and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . .

83 83 86 86 87 87 88 89 90 91 91 91 92

10

The File MyMusic.class: On Decompiling Software . . . . . . . . . . . . . 95 10.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 10.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 10.2.1 Constructing and Executing Computer Programs (Compilers, Interpreters, and Decompilers) . . . . . . . . . 97 10.2.1.1 Compilers . . . . . . . . . . . . . . . . . . . . . . . . . 97 10.2.1.2 Interpreters . . . . . . . . . . . . . . . . . . . . . . . . 97 10.2.1.3 Decompilers . . . . . . . . . . . . . . . . . . . . . . . 98 10.2.2 Java (Programming Language) . . . . . . . . . . . . . . . . . 98 10.2.3 Maven (Software Build Automation Tools) . . . . . . . . 99 10.3 Pattern: Software Decompiling . . . . . . . . . . . . . . . . . . . . . . . . . 102 10.4 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 10.5 Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

8

70 77 78 78 78 80 81

xviii

Contents

10.5.1 10.5.2

Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Tools and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 103

11

The File yyy.java: On Compiling and Running Software . . . . . . . 11.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 Software Runtime Dependencies . . . . . . . . . . . . . . . 11.2.2 Software Documentation . . . . . . . . . . . . . . . . . . . . . 11.2.3 IP Addresses and DNS . . . . . . . . . . . . . . . . . . . . . . 11.3 Pattern: External Behavioral (Runtime) Dependencies . . . . . . . 11.4 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.2 Tools and Systems . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.3 Other Resources . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

105 105 107 107 108 108 109 110 110 110 110 111

12

The File myFriendsBook.war: On Running Web Applications . . . 12.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 WAR Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.3 The Case of MIT Scratch . . . . . . . . . . . . . . . . . . . . 12.3 Pattern: The Execution of a Web Application . . . . . . . . . . . . . 12.4 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5 Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5.2 Tools and Systems . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

113 113 115 115 116 117 119 120 121 121 121

13

The File roulette.BAS: On Running Obsolete Software . . . . . . . . . 13.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.1 Amstrad 464 and Commodore 64 . . . . . . . . . . . . . . 13.2.2 BASIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.3 Pascal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.4 The Aging of Programming Languages . . . . . . . . . . 13.3 Pattern: Software Written in an Obsolete Programming Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5 Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.2 Tools and Systems . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

123 123 125 125 126 127 128

. . . . .

129 130 131 131 131

14

The Folder myExperiment: On Verifying and Reproducing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 14.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 14.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Contents

xix

14.2.1 14.2.2 14.2.3

16

. 135 . 136 . 137 . 139 . . . . . . .

140 141 142 142 143 143 144

The File MyContacts.con: On Reading Unknown Digital Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.1 Format Recognition: JHOVE . . . . . . . . . . . . . . . . . . . 15.2.2 Preservation-Friendly File Formats . . . . . . . . . . . . . . 15.2.3 Object Serialization and Storage . . . . . . . . . . . . . . . . 15.3 Pattern: Proprietary Format Recognition . . . . . . . . . . . . . . . . . . 15.4 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5 Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5.2 Tools and Systems . . . . . . . . . . . . . . . . . . . . . . . . . .

147 147 149 149 149 150 151 152 152 152 152

14.3 14.4 14.5

15

HTML and Remotely Fetched Images . . . . . . . . . . . Web Archiving and Web Citation . . . . . . . . . . . . . . Proposals for Changing the Scientific Publishing Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2.4 Trustworthy Digital Repositories . . . . . . . . . . . . . . . 14.2.5 The Data–Information–Knowledge–Wisdom Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2.6 Gödel’s Incompleteness Theorems . . . . . . . . . . . . . . Pattern: Reproducibility of Scientific Results . . . . . . . . . . . . . . Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5.2 Other Resources . . . . . . . . . . . . . . . . . . . . . . . . . . .

The File SecretMeeting.txt: On Authenticity Checking . . . . . . . . . . 16.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2.1 Technologies Related to Authenticity . . . . . . . . . . . . . 16.2.1.1 Checksums . . . . . . . . . . . . . . . . . . . . . . . . 16.2.1.2 Digital Signatures . . . . . . . . . . . . . . . . . . . 16.2.1.3 HTTPS . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2.1.4 Web Server-Side Authentication and Client-Side Authentication . . . . . . . . . . 16.2.1.5 Quantum Cryptography . . . . . . . . . . . . . . . 16.2.1.6 Bitcoin . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2.2 Processes for Authenticity Assessment . . . . . . . . . . . . 16.2.3 Copyright and Licensing . . . . . . . . . . . . . . . . . . . . . . 16.3 Pattern: Authenticity Assessment . . . . . . . . . . . . . . . . . . . . . . . 16.4 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5 Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.2 Tools and Systems . . . . . . . . . . . . . . . . . . . . . . . . . .

155 155 156 156 156 157 159 159 160 160 160 161 162 163 163 163 164

xx

17

18

Contents

The Personal Archive of Robert: On Preservation Planning . . . . . 17.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.1 Moore’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.2 Storage Space and Kolmogorov Complexity . . . . . . . 17.2.3 Compression-Related Risks . . . . . . . . . . . . . . . . . . . 17.2.4 Preservation Planning . . . . . . . . . . . . . . . . . . . . . . . 17.2.5 On Selecting What to Preserve . . . . . . . . . . . . . . . . 17.2.6 Value of Information . . . . . . . . . . . . . . . . . . . . . . . 17.2.7 Data Management Plan . . . . . . . . . . . . . . . . . . . . . . 17.2.8 Backup and Data Replication Against Hardware Media Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.9 Version Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.10 Blog Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3 Pattern: Preservation Planning . . . . . . . . . . . . . . . . . . . . . . . . 17.4 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5 Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5.2 Tools and Systems . . . . . . . . . . . . . . . . . . . . . . . . . 17.5.3 Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

165 165 171 172 172 173 174 175 176 178

. . . . . . . . .

180 181 181 184 184 185 185 187 187

The Meta-Pattern: Toward a Common Umbrella . . . . . . . . . . . . . . 18.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.2.1 Patterns and Task Performability . . . . . . . . . . . . . . . . 18.2.2 Interoperability Strategies . . . . . . . . . . . . . . . . . . . . . 18.2.3 Migration, Emulation, and Dependency Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.2.4 Requirements on Reasoning Services . . . . . . . . . . . . . 18.2.5 Modeling Tasks and Their Dependencies . . . . . . . . . . 18.2.5.1 The Research Prototype Epimenides . . . . . . 18.2.6 General Methodology for Applying the Dependency Management Approach . . . . . . . . . . . . . . . . . . . . . . . 18.2.6.1 Layering Tasks . . . . . . . . . . . . . . . . . . . . . 18.2.7 Case Study: 5-Star LOD and Task Performability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.2.8 Case: Blog Preservation . . . . . . . . . . . . . . . . . . . . . . 18.2.9 On Information Identity . . . . . . . . . . . . . . . . . . . . . . 18.3 The Big Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.3.1 The FAIR Data Principles for Scientific Data . . . . . . . 18.3.2 Systems for Digital Preservation . . . . . . . . . . . . . . . . 18.4 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

189 189 190 191 191 193 195 196 197 200 201 202 207 209 213 215 217 219

Contents

18.5

xxi

Links and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 18.5.1 Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 18.5.2 Tools and Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 223

19

How Robert Eventually Found Daphne . . . . . . . . . . . . . . . . . . . . . . 225 19.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

20

Daphne’s Dream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 20.1 Episode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 20.2 Questions and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

21

Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 21.1 Synopsis of Episodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

Chapter 1

A Few Words About Digital Preservation and Book Overview

1.1

A Few Words About Digital Preservation

We live in a digital world. Nowadays, everyone works and/or communicates using computers and smart devices. We communicate digitally using emails and voice platforms; we read electronic newspapers; we capture lots of photographs and videos of our family and friends in digital form; we listen to digitally encoded music; we use computers for almost all of our activities, from daily shopping to complex computations and experiments. In brief, modern society and economy are increasingly dependent on an overwhelming quantity of digitally available information alone. As a result, the world produces huge amounts of digital information (as Fig. 1.1 illustrates). Moreover, information that previously existed in analogue form (i.e., in paper) is now largely digitized. Inevitably, the volume and the diversity of the digital objects that libraries, archives, and companies maintain constantly increase. The same is also true for “personal archives” of citizens (which comprise photographs, videos, various digital artifacts related to their studies, work, hobbies, etc.). It is therefore important to ensure that these digital objects remain functional, usable, and intelligible in the future. However, as Heraclitus pointed out “Everything flows, nothing stands still.” Consequently, the preservation of digital information within an unstable and rapidly evolving technological (and social) environment is not a trivial problem. We can, consequently, describe the main objective of digital preservation in one sentence: Digital material has to be preserved against loss, corruption, hardware/ software technology changes, and advances in the knowledge of the community.

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_1

1

2

1 A Few Words About Digital Preservation and Book Overview

Fig. 1.1 Overview of information that is produced and exchanged every minute (The data were collected in late 2017 and were taken from James, Josh. Data Never Sleeps 5.0. Domo. 2018-05-03. URL:https://www.domo.com/blog/data-never-sleeps-5/. Accessed 2018-05-03. (Archived by WebCite® at http://www.webcitation.org/6z8YtB8ao))

As noted in (Chen 2001), we can identify a paradox: “On the one hand, we want to maintain digital information intact as it was created; on the other, we want to access this information dynamically and with the most advanced tools.” It is not hard to recognize that the notion of time and evolution is important and this is why digital preservation has been termed as “interoperability with the future.” Preserving only bits of digital objects is not enough. We should also try to preserve their integrity, accessibility, provenance, authenticity, and intelligibility (by human or artificial agents). A longer definition of digital preservation, as it is commonly defined in encyclopedias, follows: In library and archival science, digital preservation is a formal endeavor to ensure that digital information of continuing value remains accessible and usable. It involves planning, resource allocation, and application of preservation methods and technologies. It combines policies, strategies, and actions to ensure access to reformatted and “born-digital” content, regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time. Unfortunately we are not that good in digital preservation. Even scientific data generated in the context of expensive space programs get lost or cannot be decoded, while more than half of the citizens that responded to a survey said that they had lost files on their personal computer. A collection of real stories that provide evidence about why digital preservation is important or about “what could go wrong” is available in the website of Barbara Sierman, titled “Atlas of Digital Damages.”1 1 The URL of this website, as well as other similar resources, can be found in the references at the end of the chapter.

1.2 Historical Notes

1.2

3

Historical Notes

The Antikythera mechanism (Freeth et al. 2005), the world’s first computer, is an ancient Greek analogue computer and orrery used to predict astronomical positions and eclipses for calendar and astrological purposes.2 It contained at least 37 gear wheels enabling it to follow the movements of the moon and the sun through the zodiac, to predict eclipses; to model the irregular orbit of the moon, where the moon’s velocity is higher in its perigee than in its apogee; and also to track the four-year cycle of the ancient Olympic Games. The instrument is believed to have been designed and constructed by Greek scientists, and detailed imaging of the mechanism suggests that it dates back to 150–100 BC. The artifact was recovered on May 17, 1901, from the Antikythera wreck off the Greek island of Antikythera. It took more than 2000 years to find and assemble, and only with the latest technology (x-ray tomography and high-resolution surface scanning) did we manage (around 2012) to understand what the mechanism was doing and how. Correspondingly, one could argue that if we just preserve the bits of our current and past digital artifacts (data or software), then we will be able in the future to access and fully understand all these artifacts with the aid of the technology that we will have at that time (similarly to what happened with the Antikythera mechanism). Unfortunately, this cannot be so. The current method for storing and representing digital artifacts is immeasurably more sensitive and vulnerable than the bronze gears of ancient Greeks or their marble inscriptions. And we should not forget that the writings from the ancient Greek philosophers are available today, thanks to the successive handmade copies until the arrival of the printing press and not because we have found and we have preserved the original manuscripts. As regards digitally encoded information, we could say that the need to preserve it arose as soon as modern computers appeared. Digital preservation was recognized as an important issue in the 1990s. Just indicatively, Fig. 1.2 shows the frequency of the term “digital preservation” in five million books (published up to 2008) as 0.00000160% 0.00000140% 0.00000120% 0.00000100% 0.00000080% 0.00000060% 0.00000040% 0.00000020% 0.00000000% 1970

digital preservation

1975

1980

1985

1990

1995

2000

2005

Fig. 1.2 Occurrence of the term “digital preservation” up to 2008

2 An interesting video from BBC about this mechanism: https://www.youtube.com/watch?v¼g_ Z0eGit-mI

4

1 A Few Words About Digital Preservation and Book Overview

computed by the Google Ngram Viewer (© 2013 Google). However, we should note that other related keywords are used more frequently, such as “metadata” and “interoperability.” Nowadays, we produce around 2.5 quintillion (1018) bytes of data every day, and this amount is constantly increasing. To emphasize on the volume of data that is produced every day, consider that Boeing 787 aircraft generates approximately half a terabyte of data per flight, Airbus A350 aircraft produces approximately 2.5 terabytes per day, and Airbus A380 aircraft is equipped with 10,000 sensors in each wing.3 Furthermore, by the time it took us to read this sentence, NASA has collected approximately 1.7 gigabytes of data from all the currently active missions. This wealth of data is of paramount importance and it should be properly preserved for future exploitation (consider the case of the data generated and lost, from the NASA Viking Landers sent to Mars4). The average person’s level of interaction with digital resources is increasing as well, specifically it is expected to increase 20-fold in the next 10 years, reaching to one human-data interaction every 18 seconds, on average. To conclude, the amount of digital resources that are produced nowadays is huge, and their preservation is very significant to our society as we are increasingly dependent on digital resources.

1.3

Standardization

It is not hard to see that standardization is highly related to the issue of digital preservation. Standardization is the process of developing technical standards based on the consensus of different parties that include standard organizations, firms, users, interest groups, and governments for aiding compatibility and interoperability. A technical standard is an established norm or requirement related to technical systems, usually in the form of a formal document that establishes uniform engineering or technical criteria, methods, processes, and practices. Standards could be distinguished as de facto standards (i.e., those followed by informal convention or dominant usage), de jure standards (i.e., those that are parts of legally binding contracts, laws, or regulations), and voluntary standards (those available for people to consider for use). Standards became highly important at the onset of the Industrial Revolution, which precipitated the need for interchangeable parts and high-precision machine tools. The first industrially practical screw-cutting lathe was developed in 1800 by the English Engineer Henry Maudslay, and it allowed the standardization of screw thread sizes. In 1946, delegates from 25 countries agreed to create the International Organization for Standardization (ISO). In general, each country or economy has a

3

More details about these facts can be found in Marr (2009), Finnegan (2013), and Skytland (2012) More information about this incident can be found in the “Atlas of Digital Images” found in the links and references at the end of this chapter.

4

1.4 Analyzing Digital Preservation and Book Overview

5

recognized National Standards Body, which can be a public or private sector organization, e.g., the American National Standards Institute (ANSI). For information exchange, there are several specifications that govern the operation and interaction of devices and software on the Internet, but they are not always referred to as formal standards; e.g., the W3C (World Wide Web Consortium) publishes “Recommendations” and the IETF (Internet Engineering Task Force) publishes “Requests for Comments” (RFCs). Apart from the hundreds (or thousands) of standards that govern how our digital ecosystem operates (we will encounter some of these standards in the next chapters), there are also “high-level” standards related to digital preservation (like OAIS, PREMIS, and others). An indicative list of standards (both generic ones that can support digital preservation as well as digital-preservation-related ones) can be found in the links and references at the end of this chapter.

1.4

Analyzing Digital Preservation and Book Overview

We can analyze the problem of digital preservation according to: (a) The type of digital artifacts (e.g., documents, HTML pages, images, software, source code, and data) (b) The task to curate (e.g., read, edit, run)

Fig. 1.3 Types of digital artifacts and tasks

6

1 A Few Words About Digital Preservation and Book Overview

An indicative list of types is shown in Fig. 1.3 (left), while an indicative list of tasks is shown in Fig. 1.3 (right). A type–task pair, e.g., “Video–Render,” corresponds to a digital preservation objective, in our example to the ability to preserve the ability to render videos. Obviously, there are dependencies between the listed tasks, e.g., we cannot render a video if we have failed to preserve its bits, i.e., if the objective related to the pair “Video–Retrieve Bits” has not been achieved. One approach to analyze the problem of digital preservation could be to analyze all combinations of types and tasks, i.e., to analyze |Types| * |Tasks| number of cases. However, not all cases make sense; for instance, the task “Run” is applicable to software, not to textual files. Moreover, not all cases are distinct. This is the rationale for approaching the problem through the notion of pattern. A pattern is a frequently occurring problem, and, essentially, it corresponds to one or more type–task pairs.

1.4.1

The Notion of Pattern

The notion of pattern originated as an architectural concept by Christopher Alexander. In the 1990s, Kent Beck and Ward Cunningham began experimenting with the idea of applying patterns to programming. Design patterns gained popularity in computer science, and we now have patterns in software design, interface design, secure design patterns, service-oriented architecture, etc. For instance, in software engineering, a design pattern is a general repeatable solution to a commonly occurring problem in software design. A design pattern is not a finished design that can be transformed directly into code. It is a description or template for how to solve a problem that can be used in many different situations.

1.4.2

The Notion of Digital Preservation Pattern

In this book, we introduce and use the notion of digital preservation pattern. A digital preservation pattern, hereafter just pattern, corresponds to a commonly occurring problem in digital preservation. The term “commonly occurring problem” refers to distinct type–task pairs as described previously (i.e., the problem of properly representing the textual symbols that appear in text documents). Specifically, each pattern has an identifier, a short description of the problem, the type–task pair, and a narrative describing one attempt to solve the problem (related to the plot of the fairy tale), a description of what could have been done for avoiding or just alleviating the problem, some lessons learnt, and, finally, links to related patterns that appear in other chapters of the book. Moreover, each pattern is accompanied by relevant technical background, links to related resources or tools, and questions and exercises. Instead of analyzing the problem of digital preservation in a top-down (and inevitably superficial) manner, we believe that the notion of pattern allows the reader to encounter frequently occurring problems through concrete cases and to reflect on these.

1.4 Analyzing Digital Preservation and Book Overview

7

The absence of a top-down conceptual analysis is offset by the last part of the book where a meta-pattern is introduced that “embraces” patterns through a taskbased perspective of digital preservation.

1.4.3

The Patterns in This Book

A list of patterns that appear in this book is given in Table 1.1 and an overview is given in Fig. 1.4. The table describes the indicative types and tasks, as well as the Table 1.1 Index of the patterns Pattern Id and chapter P1 (Chap. 4)

Types of digital artifacts Any digital object

Tasks on digital artifacts Retrieve bits

P2 (Chap. 5)

Any digital object

Retrieve the metadata

P3 (Chap. 6)

Text

P4 (Chap. 7)

Image, video, text

P5 (Chap. 8) P6 (Chap. 9)

Data collection Software

P7 (Chap. 10) P8 (Chap. 11)

Software Software

P9 (Chap. 12) P10 (Chap. 13)

Web application Software (obsolete)

Represent the textual symbols Get provenance information Perceive the data Run, discover software dependencies, verify execution safety Decompile Compile, find, and retrieve dependencies Execute Run, perceive

P11 (Chap. 14) P12 (Chap. 15)

Scientific article

P13 (Chap. 16) P14 (Chap. 17) MP (Chap. 18)

Digital object with proprietary or unknown format Text

Collection of digital objects Any digital object

Reproduce scientific results/experiments Recognize, view the contents

Pattern’s name Storage media: durability and access Metadata for digital files and file systems Text and symbol encoding Provenance and context of digital photographs Interpretation of data values Executables: safety, dependencies Software decompiling External behavioral dependencies Web application execution Understand and run software written in an obsolete programming language Reproducibility of scientific result (Proprietary) Format recognition

Understand the semantics and the context, assert authenticity All

Preservation planning

All

Meta-pattern

Authenticity assessment

8

1 A Few Words About Digital Preservation and Book Overview

Fig. 1.4 Overview of the patterns

corresponding chapter of the book that elaborates this pattern. Patterns are presented in a progressive manner in that patterns in the first chapters are simpler and become complex or high level in subsequent chapters.

1.4.4

Detailed Listing of Patterns and Roadmap

In this section, we provide a detailed listing of the patterns that are presented in this book. The reader may omit this section, and consult it at a later stage while reading the book or after reading it. It can be used as a roadmap for readers and practitioners. The detailed listing of the patterns described in this book is given in Table 1.2, where each row corresponds to one pattern. The first column gives the pattern identifier, the second contains the problem’s name, the third column shows the corresponding artifact (in most cases a digital file) and the chapter number, and the fourth column (Related Patterns) shows the identifiers of other related patterns. Patterns can be organized in a graph-like structure, in the sense that some patterns have, as prerequisite, a task related to a previous pattern. To avoid repetition, only the “immediate” previous (prerequisite) and next (subsequent) patterns are listed in the table; briefly, they are referred to as “iPrevious” and “iNext,” respectively. The

1.4 Analyzing Digital Preservation and Book Overview

9

Table 1.2 A detailed index of the patterns Id P1

Problem name Storage media: Durability and access

Chapter – File/ Artifact Chap. 4. The entire USB stick

P2

Metadata for digital files and file systems

Chap. 5. The file system of the USB stick

P3

Text and symbol encoding

Chap. 6. Poem.html

P4

Provenance and context of digital photographs

Chap. 7. MyPlace.png

P5

Interpretation of Data values

Chap. 8. Todo.csv

P6

Executables: safety, dependencies

Chap. 9. destroyAll.exe

P7

Software decompiling

Chap. 10. MyMusic.class

P8

External behavioral dependencies Web application execution

Chap. 11. yyy.java

P9

P10

Chap. 12. myFriendsBook. war Chap. 13. roulette.BAS

Understand and run obsolete software Reproducibility of scientific result

Chap. 14. myExperiment

P12

(Proprietary) Format recognition

Chap. 15. MyContacts.con

P13

Authenticity assessment

Chap. 16. SecretMeeting.txt

P11

Related patterns iPrevious: iNext: P2, P6, P14 iPrevious: P1 iNext: P3, P4, P5, P14 iPrevious: P2 iNext: P5, P11 iPrevious: P2 iNext: P11, P13 iPrevious: P2, P3 iNext: P12 iPrevious: P1 iNext: P7, P8, P9, P10, P12 iPrevious: P6 iNext: P11 iPrevious: P6 iNext: iPrevious: P6 iNext: iPrevious: P6 iNext: iPrevious: P3, P4, P7 iNext: iPrevious: P3, P4, P5, P6 iNext: iPrevious: P4 iNext: P14

Desired task Read bits

Read file system

Read meaningful symbols

Where (location)? When (date)? Who (actor)? Why (goal)?

Read meaningful attribute-value pairs Is it safe? Get dependencies

Get source code

Get information about the assumed ecosystem and expected behavior Get dependencies

Run it. Get required dependencies

Provenance Questions, redo, compare with what it is reported; trust questions (is it real/valid/authentic?) Find spec and software about this format

Provenance questions (P4’ tasks plus), Trust questions (is it real/ valid/authentic?) (continued)

10

1 A Few Words About Digital Preservation and Book Overview

Table 1.2 (continued) Id P14

Problem name Preservation planning

MP

Meta-pattern

Chapter – File/ Artifact Chap. 17. The personal archive of Robert Chap. 18. It applies to various files/ artifacts

Related patterns Desired task iPrevious: Select what should be preserved, P1, P2, what are the required actions, what P13 is their cost iNext: It generalizes several of the previous patterns

Fig. 1.5 Graphic representation of the dependencies among the presented patterns

last column of the table describes the desired task in each pattern; this task-based perspective of digital preservation will be made more clear in Chap. 18, where a meta-pattern is introduced. Figure 1.5 illustrates all patterns and their relations, where an arrow Px ! Py means that pattern Py presupposes pattern Px. Finally, Chap. 18 attempts to generalize and describe a more general “meta-pattern.” Chapter 18 is the most technical

1.5 Links and References

11

chapter of the book and it is intended for those who would like to deepen their knowledge on the subject.

1.5

Links and References

Probably the most cited paper in the area of Digital Preservation is the following: • Headstrom, M. (1997). Digital preservation: A time bomb for digital libraries. Computers and the Humanities, 31(3), Springer. Other References • Chen, S. S. (2001). The paradox of digital preservation. Computer, 34(3), pp. 24-28. • Freeth, T., Bitsakis, Y., Moussas, X., Seiradakis, J. H., Tselikas, A., Mangou, H., et al., (2006). Decoding the ancient Greek astronomical calculator known as the Antikythera Mechanism. Nature, 444(7119), p. 587. • Reinsel, D., Gantz, J., & Rydning, J. (2017). Data age 2025: The evolution of data to life-critical. Don’t Focus on Big Data; Focus on the Data That’s Big. Series of Conferences Related to digital preservation Topics and works related to digital preservation appear in various conferences and journals. Below we list a few: • • • •

International Conference on Theory and Practice of Digital Libraries (TPDL) International Conference on Digital Preservation (iPres) series of conferences International Journal on Digital Libraries (IJDL) – Springer Berlin Heidelberg International Journal of Digital Curation (IJDC) – Digital Curation Centre

There are several books on digital preservation: • Gladney, H. (2007). Preserving digital information. Springer Science & Business Media. • Giaretta, D. (Ed.). (2011). Advanced Digital Preservation. Springer. • Jones, M., & Beagrie, N. (2008). Preservation Management of Digital Materials: The handbook. (This book is now maintained and updated by the Digital Preservation Coalition (DPC) and it is accessible through https://www.dpconline.org/ docs/digital-preservation-handbook/299-digital-preservation-handbook/file) • Brown, A. (2013). Practical Digital Preservation: A How-To Guide for Organizations of Any Size. Facet Publishing. • Palfrey, J. (2015). BiblioTech: Why Libraries Matter More Than Ever in the Age of Google. Basic Books. Standards for Digital Preservation

12

1 A Few Words About Digital Preservation and Book Overview

• ISO 14721:2012 Space data and information transfer systems – Open archival information system (OAIS) – Reference model https://www.iso.org/standard/ 57284.html • ISO/TR 18492:2005 Long-term preservation of electronic document-based information https://www.iso.org/standard/38716.html • ISO 16363:2012 Space data and information transfer systems – Audit and certification of trustworthy digital repositories https://www.iso.org/standard/ 56510.html • ISO 20652:2006 Space data and information transfer systems – Producer-archive interface – Methodology abstract standard https://www.iso.org/standard/39577. html • ISO 15489-1:2016 Information and documentation – Records management – Part 1: Concepts and principles https://www.iso.org/standard/62542.html • METS Metadata Encoding and Transmission Standard http://www.loc.gov/stan dards/mets/ • PREMIS Data Dictionary for Preservation Metadata http://www.loc.gov/stan dards/premis/ Collection of Real-World Stories • Atlas of Digital Damages, Barbara Sierman. http://www.atlasofdigitaldamages. info • Marr, B. (2015). That’s data science: Airbus puts 10,000 sensors in every single wing!. Data Science Central. 2015-04-09. URL: https://www.datasciencecentral. com/profiles/blogs/that-s-data-science-airbus-puts-10-000-sensors-in-every-sin gle. Accessed May 31, 2018. (Archived by WebCite® at http://www.webcitation. org/6zot1Nw2o) • Finnegan, M. (2013). Boeing 787s to create half a terabyte of data per flight, says Virgin Atlantic. Computer World UK. March 6, 2013. URL: https://www. computerworlduk.com/data/boeing-787s-create-half-terabyte-of-data-per-flightsays-virgin-atlantic-3433595/. Accessed May 31, 2018. (Archived by WebCite® at http://www.webcitation.org/6zotU67Sc) • Skytland, N., (2012). What is NASA doing with big data today?. Open NASA. 2012-10-04. URL: https://open.nasa.gov/blog/what-is-nasa-doing-with-big-datatoday/. Accessed May 31, 2018. (Archived by WebCite® at http://www. webcitation.org/6zouBdVg8) • Ju, S. B. (2008). Dedicated efforts to preserve the annals of the Joseon Dynasty. Koreana. URL: http://koreana.kf.or.kr/pdf_file/2008/2008_AUTUMN_E016. pdf. Accessed: June 1, 2018. (Archived by WebCite® at http://www. webcitation.org/6zqWr3R5j) Courses on Digital Preservation • A catalog of various courses on digital preservation is available in the APARSEN deliverable D43.2 Report on Launch of Digital Preservation Training Portal for VCoE (urn:nbn:de:101-2014051612). https://doi.org/10.5281/zenodo.1256425

Chapter 2

The Fairy Tale of Cinderella

Cinderella is one of the most popular fairy tales, and the majority of parents have narrated it to their children in their childhood. It has many versions and different names in different countries. More or less, all versions talk about a little girl that gets mistreated by her stepmother and stepsisters, until she finds her Fairy Godmother, who magically transforms her into a beautiful princess who goes to the prince’s ball, attracts his attention, and then hurries away from the palace. After that, the prince starts looking for her everywhere in the country, using as a clue the only thing that she left behind: one of her glass slippers. Of course, that is the most well-known version of the Cinderella story, but for sure that is not the only one. The earliest known version of the Cinderella story is probably the ancient Greek tale of Rhodopis1 recorded by Strabo in the first century BC. There are many more variants of the story all around the world. Some of them are: Ye Xian (China), Shakuntala (India), Tấm and Cám (Vietnam), among others. The first written version of the story was published in Naples by Giambattista Basile in 1634 (known as Cenerentola). The story was later retold, along with others, by Charles Perrault in 1697 (known as Cendrillon) and the Brothers Grimm in 1812 (known as Aschenputtel). Below we provide the short version of Cinderella by Charles Perrault (as described in Wikipedia2), which is the most popular, as we presently know it and as it has been captured in cinema. Readers that know the fairy tale can skip this chapter.

1 Wikipedia. Rhodopis. URL: https://en.wikipedia.org/wiki/Rhodopis. Accessed: 2018-05-03. (Archived by WebCite® at http://www.webcitation.org/6z8ZL2m3K) 2 Wikipedia. Cinderella. URL: https://en.wikipedia.org/wiki/Cinderella. Accessed: 2018-05-03. (Archived by WebCite® at http://www.webcitation.org/6z8ZRlsR8)

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_2

13

14

2.1

2 The Fairy Tale of Cinderella

The Plot

Once upon a time, there was a wealthy widower who took a proud and haughty woman as his second wife. She had two daughters, who were equally vain and selfish. The gentleman had a beautiful young daughter, a girl of unparalleled kindness and sweet temper. The man’s daughter was forced into servitude, where she was made to work day and night doing menial chores. After the girl’s chores were done for the day, she would curl up near the fireplace in an effort to stay warm. She would often arise covered in cinders, giving rise to the mocking nickname “Cinderella” by her stepsisters. Cinderella bore the abuse patiently and dared not tell her father, who would have scolded her. One day, the Prince invited all the young ladies in the land to a royal ball, planning to choose a wife. The two stepsisters gleefully planned their wardrobes for the ball, and taunted Cinderella by telling her that maids were not invited to the ball. As the sisters departed for the ball, Cinderella cried in despair. Her Fairy Godmother magically appeared and immediately began to transform Cinderella from house servant to the young lady she was by birth, all in an effort to get Cinderella to the ball. She turned a pumpkin into a golden carriage, mice into horses, a rat into a coachman, and lizards into footmen. She then turned Cinderella’s rags into a beautiful jeweled gown, complete with a delicate pair of glass slippers. The Godmother told her to enjoy the ball, but warned that she had to return before midnight, when the spells would be broken. At the ball, the entire court was entranced by Cinderella, especially the Prince. At this first ball, Cinderella remembered to leave before midnight. Back home, Cinderella graciously thanked her Godmother. She then greeted the stepsisters, who had not recognized her earlier and who talked of nothing but the beautiful girl at the ball. Another ball was held the next evening, and Cinderella again attended it with her Godmother’s help. The Prince had become even more infatuated, and Cinderella in turn became so enchanted by him she lost track of time and left only at the final stroke of midnight, losing in her haste one of her glass slippers on the steps of the palace. The Prince chased her, but outside the palace, the guards saw only a simple country girl leave. The Prince pocketed the slipper and vowed to find and marry the girl to whom it belonged. Meanwhile, Cinderella kept the other slipper, which did not disappear when the spell was broken. The Prince tried the slipper on all the women in the kingdom. When the Prince arrived at Cinderella’s home, the stepsisters tried in vain to win over the prince. Cinderella asked if she might try, while the stepsisters taunted her. Naturally, the slipper fit perfectly, and Cinderella produced the other slipper for good measure. Cinderella’s stepfamily pleaded for forgiveness, and Cinderella agreed. Cinderella married the Prince and her stepsisters were married to two handsome gentlemen of the royal court, and they lived happily ever after.

Chapter 3

Daphne (A Modern Cinderella)

3.1

Episode

In our modern version of the fairy tale, Cinderella is a young undergraduate student of computer science, called Daphne. The metaphor for Cinderella’s lost shoe is a . . . USB stick. But first things first. . . Daphne is an undergraduate student of the Computer Science Department (CSD) at the University of Crete. She is currently in the fourth year of her studies, and since she is in the last semester, she decided to visit the Harvetton University, one of the most prestigious universities, for 5 months to attend some courses there, through a student exchange program. She arrived in Harvetton in late January and she was planning to return to Crete in the middle of June. Shortly after arriving, she joined the XLab of the Institute of Computer Science and Mathematics, a well-known institute whose members included several students, specifically two post-doctoral, ten PhD students and only two undergraduate students. Daphne was an outstanding student, her grades from CSD were exceptional and this was the main reason Harvetton accepted her application. Although she was very interested in the activities of the host lab, she never undertook anything significant, possibly because she was the newest member in the lab and she was going to leave in a few months. She participated in the plenary group meetings, but she was assigned only tasks of minor difficulty or importance. She participated in the testing and evaluation procedures of the various tools developed by other members of the lab; she provided feedback to ongoing papers and reports. In other words Daphne was the “servitor” of the lab. In the first days of April, the Chief Executive Officer (CEO) of MicroConnect (the biggest computer company worldwide) announced his retirement. The new CEO would be selected through a competition and should be a female person, something that MicroConnect had never done before. By the end of May, MicroConnect would organize the competition and the winner would run MicroConnect for the next

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_3

15

16

3 Daphne (A Modern Cinderella)

3 years. If the outcome of this 3-year period was positive then the CEO could continue running the company. The news spread quickly across every university. All graduate and undergraduate female students wanted to try their luck and all the top-ranked universities wanted one of their students to be the next CEO of MicroConnect. In order to hold the number of registrations for the competition to a reasonable size, MicroConnect announced that only persons with at least one diploma in computer science, or a related field, who were at most 30 years old, could apply for the competition. When Daphne heard the news about the competition, she thought that it would be a great chance to test her skills. She loved challenges, but she couldn’t register for the competition since she had not obtained her degree yet. The irony was that although she had fulfilled all requirements for graduation, she hadn’t applied for the graduation because she could visit Harvetton only as undergraduate student. “Too bad that I cannot participate in the competition, I wish I could” she thought. In a few days almost every female student of the institute had already registered for the competition and had started their preparation. Some labs even organized study groups for preparing their members for the upcoming competition. So apart from her courses and her work in the lab, Daphne should also take care of the tasks of her lab fellows now, because they had to prepare for the competition. The competition was going to last 12 h, starting from 12:00 on May 22 and ending at 23:59. The morning of that day, May 22, Daphne incidentally met Grace. She was the manager of the administration office at Harvetton and she had been appointed to be responsible for the organization of the competition on behalf of MicroConnect. Daphne initially met Grace when she first came to Harvetton university and wanted to enroll in her course there. She went to Grace’s office to get her credentials for the e-services of the university. They started talking and Grace was amazed by the fact that Daphne was from Crete. Grace used to visit the Greek island almost every summer and soon enough they started organizing their future excursions in the southern parts of Crete for the upcoming summer. That morning Grace noticed that Daphne was kind of sad and asked her if there was something wrong. Daphne explained that she was very sad because she was only being assigned low-priority, tedious, and of low-importance tasks in the lab. She felt underprivileged because she could not do anything creative and prove that she had the skills and really deserved being an equal member of the lab. And now it was that competition where almost all the other members of the lab were going to participate and she could not. Daphne told Grace that it is a pity she cannot register for the competition, since that would be a great opportunity for her to test her skills. Grace told her that she had some things to do and asked Daphne to wait for her for a while. After 15 min she returned with a card that looked like a credit card. “Take this card. You can use it to take part in the competition. I registered you in the competition using a name that doesn’t exist. This will allow you to join the competition in an examination center far away from here; therefore, probably no one will recognize you. However, the code in the card will be valid only until one hour after the end of the competition. After that it will automatically be removed from the system, like it never existed, so that no one will ever realize what happened. Please make sure that

3.1 Episode

17

you leave that place by 01:00, otherwise your card will no longer be valid and you will not be able to open the doors to get out. Don’t forget it because I will also be in trouble. Do you understand?” she asked her. Daphne was left astonished. She couldn’t believe that such a great chance was given to her. She thanked Grace and promised her that she would leave the examination room right after the end of the competition. It was already half past ten in the morning. The competition was starting in one and a half hours and she knew that she should be in the examination room 30 min earlier. Since she was totally unprepared for the competition, she decided to copy various folders from her laptop to her USB stick (containing material from her studies and research) so that she could check them if needed during the competition and went directly to the subway. Thirty minutes before the start of the competition Daphne was already in the examination room and was sitting at a computer. She plugged the USB stick into the computer and started processing the contents of the copied folders from her laptop, so she would know where to find everything, should she need to do so during the examination. At 12 o’clock she received the topics of the examination. The topics seemed familiar to her. She had studied similar topics during her studies at the university and she had also done some related exercises. “Luckily I have my notes with me” she thought and started working. Ten minutes before the deadline of the competition (23:50), she submitted her solution to the online system. After each submission, an automatic check would be performed for validating the submitted answers. The first stage of this check would last only a few minutes and the competitors would be able to see their results. The solutions that would pass this first check would proceed to the second round of checking, which would last 12 h. The top ten solutions of the second round would enter the third evaluation round: an evaluation by a committee that comprised the current CEO of MicroConnect Robert Johnson, the MicroConnect technical director James Geek, and the computer science professor of Berkeley Dimitris Papachristos. This committee had been appointed for announcing the winner of the competition. “Grace told me that my card is valid until 01:00. I have some time until then so I’ll wait to see if my solution passes the first evaluation round” Daphne thought and waited for the first round results. After 15 min, a light in front of her seat turned green, an indication that her solution passed the first evaluation round. The supervisor in the examination room was Bob McCarthy, a senior developer of MicroConnect’s technical team. He saw the green light and started talking to her about the topic. He congratulated her and started asking her for information about her name, education, if and where she works, and much more. Daphne was avoiding answering and started talking about the competition. Whenever Bob was asking something about her she was not answering and she was asking irrelevant information about the timetable of the subway amongst other things. After a while Bob wishes her good luck for the next evaluation rounds and leaves. Daphne notices that the clock shows 00:55 and remembers what Grace had told her. She starts packing her stuff quickly, because her card would be useless after 5 min and she will not be able to open the door to leave, and starts walking toward the exit. She takes the card

18

3 Daphne (A Modern Cinderella)

out of her pocket and uses it to open the door. A brief sound and a green-colored indication unlocks the door. At that time she noticed that in her hurry she had left her USB stick connected to the computer she was sitting at; “I do not have enough time to go back and get it; it is already 1 o’clock. Never mind, I had only copied stuff from my laptop, so I will not miss anything if I leave it here” she thought and started walking toward the subway. Daphne is at her home. Despite her fatigue, she feels happy and delighted with what she managed to deliver. Although at first the competition’s topic seemed easy, soon she realized that it was not that simple. She had to use her imagination to reduce the problem to one for which a fast algorithm existed, an algorithm that she had actually implemented in the context of a university course. She had found the corresponding code in her USB stick, but she had to improve that code a lot in order to make it applicable for the problem at hand; she had written that code when she was in her second year of her studies and the structure of the code was poor. She feels that the algorithm that she submitted is correct. Moreover, and since the description of the topic stated that the sought solution would be tested on largerscale problems, the solution that she implemented was not consuming much memory. However, she is not sure whether her algorithm is the fastest possible. How could she be sure of that? Several times during her studies she was surprised by efficient algorithms that could solve problems that at first glance looked very difficult and time-consuming to solve. Suddenly, a thought strikes her that makes her feel anxious; “And what if they discover my identity through the contents of my stick? No way! Who will care about a forgotten USB stick? Besides, better solutions than mine could have been submitted. So many women up to 30 years old participated to the competition, women with PhDs and research on related topics.” However, despite her realism, Daphne could not refrain from dreaming that her solution was the best and that in some magical way the organizers would discover her identity and find her. Two days later, the news about the competition spread like wildfire. The organizers of the competition announce that only a few solutions passed the second evaluation round and among them one solution has reached the maximum evaluation score. The press release is also mentioning that the organizers would first contact the winners and then they would announce their names publicly. What the organizers did not say is that they could not find the contact information of the competitor that reached the highest score. They start checking the contact details of the competitors that passed the second evaluation round. All of them are valid except for the one that reached the highest score. It seems like that competitor never existed. The only clue that they found was a USB stick that was connected to the computer that was used for submitting the winning solution. Moreover, there were a few people that clearly remembered a girl sitting on that particular computer, and Bob was one of them. The organizers feel desperate. Since they have to announce the winner soon, they have to find who had submitted the winning solution. They decide to follow a rather desperate method. They leak that they have decided who the winner is and that the

3.1 Episode

19

Fig. 3.1 Cinderella’s glass slipper versus Daphne’s USB stick

winner had actually forgotten her USB stick connected to the computer in the examination room. The rumors spread quickly. The next day, the telephones in the contact center of the company wouldn’t stop ringing. By the end of that day, more than 100 participants had called and were claiming that they had forgotten their USB stick in the examination room. The rumors did not reach Daphne. She had scheduled a few days off in California to visit some relatives who had migrated there a few decades ago. She would stay with them for one week given that it would be her first visit to California. That week, Daphne completely forgot about the contest. She was touring all the time and she was barely connected to the Internet. Consequently, she did not read anything about MicroConnect’s leak. Robert Johnson, the CEO of MicroConnect, didn’t know what to do. He personally checked the solution of the competition and he was amazed. “I definitely have to find and meet her” he thought. He asked his partners to bring him the USB stick that was found connected to the computer in the examination room and decided to take a look at its contents to see if he would be able to find something about the girl. In our modern version of the fairy tale, Daphne’s USB stick is analogous to Cinderella’s glass slipper (Fig. 3.1).

Chapter 4

Reading the Contents of the USB Stick

4.1

Episode

May 25 Robert opens his laptop and connects the USB stick. He waits for a while but nothing happens. He tries again to reconnect it with no luck. “That’s strange” he thought. “I haven’t upgraded my laptop for 2 years. This could be the reason. Probably the USB protocol has changed and my laptop does not support it. I should discuss this with the technical department.” Although Robert was the CEO of one of the largest software companies, he did not upgrade his equipment like a maniac; he replaced his laptop only when he encountered a compatibility problem or slow responses. He calls his secretary immediately asking her to summon the head of the technical department. “Mary, please call Scott from the technical department for me and tell him to come to my office as soon as possible. Please tell him to bring a new laptop.” Scott is the senior engineer in MicroConnect and head of the technical Research and Development department of the company. He is one of the oldest members of the company with expertise on both hardware and software systems. He is usually the person that demonstrates new products to Robert, before they are made available on the market. Fifteen minutes later, Scott is in Robert’s office. Robert explains to him the situation, and Scott plugs the USB stick to one port of the new laptop. “It usually takes some seconds to recognize it but it should have done so by now. It is possible that the USB stick needs more power than the port can supply. Hang on a sec. I will bring a self-powered USB hub that takes its power from an external power supply unit.” Scott brings such a hub, connects its power supply cable with the electricity socket and plugs the other cable to the USB port. The LED indicates that the hub is powered. Then he plugs the stick to one of the hub’s ports. He waits for a few © Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_4

21

22

4 Reading the Contents of the USB Stick

seconds but nothing happens. “Hmm, this could be serious; the stick could be defective or destroyed. It is an old stick after all, look at its faded color.” While unplugging it from the USB port, he holds the stick somewhat carelessly and it loses its alignment with the plug. But then a familiar beep sounds. The laptop has just recognized the stick! A smile appears on their faces. “It doesn’t fit perfectly in the USB port and that’s the reason it was not recognized. If you try to pull it out carefully, you’ll notice that it works,” Scott said. Without further ado, Robert unplugs the USB stick from Scott’s laptop and connects it to his own computer. He tries to move it a little bit and pulls it the way Scott directed. He notices that the USB stick is now successfully recognized by his own laptop as well. “Luckily it’s working,” he says, relieved. “Scott, it would be better if we copy the contents of this USB stick onto a couple of others. Just to be sure because I really do not trust them at all.” Indeed Robert didn’t trust USB sticks because of his personal experiences. Around 10 years ago he had been in the Caribbean on vacation with his family where he copied all the photos they had taken with his digital camera onto a USB stick, just to free some space in the camera and then placed it in the sailing bag. When the family came back from their vacation he looked for the USB stick and couldn’t find it anywhere. He then found it by chance, a couple of years later when he decided to go sailing. His happiness couldn’t be expressed. Unfortunately, his happiness turned rather quickly into melancholy when he found out that the contents of the USB stick had been corrupted. He had lost forever all of the photos he had taken. Indeed, Daphne’s USB stick was very old and overworked. She bought it when she had started her studies at the university. Initially, she was using it for storing her personal documents, photos, and music, but soon enough she started using it for her projects at the university. Although she had plenty of USB sticks, and some of them had much more capacity, this one was her lucky one. She always carried it with her and that’s why she had put it on her key ring. Scott explains to him that the best and safest alternative is to upload them in a cloud storage service. Robert is convinced and they upload all the contents of the USB stick on GlobalDrive (a cloud storage service by MicroConnect).

4.2

Technical Background

In general, all digital storage media have a bounded life. The life of storage media are cut short by several factors. One is media durability, e.g., Robert’s lost photographs (see next section for more details). Another is media usage and handling, e.g., Daphne’s USB stick was very old and overworked. Finally, another factor is media obsolescence, meaning that quite often it is not easy to find the hardware and software that is required to access an old storage medium. An introduction to storage media is given in Sect. 4.2.1. A discussion about the durability of storage media is given in Sect. 4.2.2. The complexity that lies in accessing storage media is

4.2 Technical Background

23

briefly described in Sect. 4.2.3 for the case of a USB stick. Modern approaches for storage that hide this complexity from users are described in Sect. 4.2.4. Finally, cases that exhibit complexity of bit preservation are described in Sect. 4.2.5.

4.2.1

Storage Media

Storage media refers to the technological products used to store digital content. The main characteristic of storage media is that they are nonvolatile, meaning that they do not require constant power to preserve the stored information. Nowadays, there are several options for storing digital content. Storage media can be categorized with respect to many aspects: portability, ability to rewrite, random versus sequential access, etc. In essence, we can classify physical storage media into three main categories: (a) magnetic storage, (b) optical storage, and (c) flash memory. Magnetic storage media include some of the oldest solutions for storing digital content, and they remain one of the most common types of storage used with computers. This category includes magnetic tapes, floppy disks, hard drives, etc. Optical storage media is another solution for storing digital material. They use lasers and lights as a method for reading and writing data. This category includes CD-ROMs, DVDs, Blu-ray discs, etc. Flash memory storage media rely on integrated circuits that do not need continuous power to retain the data. This type of storage media is becoming more popular and has already started to replace magnetic storage solutions, as they become a more efficient and reliable solution. Types of flash memory storage are memory sticks and cards, solid-state drives, etc.

4.2.2

Durability of Storage Media

Different types of storage media are based on different technologies and, therefore, have different characteristics, including durability. The durability of a storage media depends on a number of factors, e.g., the quality with which the media was manufactured, the environment of the storage media (temperature, humidity, radiation, magnetic fields, etc.), or even the quality of the device that will be used in order to read or write from the media. Table 4.1 lists some well-known storage media and for each one it shows the year of introduction and expected life-span for the cases of regular use and infrequent use (or use with extreme care).

24

4 Reading the Contents of the USB Stick

Table 4.1 Life-span of storage media Storage media Reel audio tape Hard drive (HDD) Cassette tape Super 8 film VHS tape 500 floppy disk 3.500 floppy disk Mini disk Compact disk Digital tape ZIP disk DVD CD-RW DVD-R Blu-ray disc Solid-state hard drive SSD Memory card USB FLASH drive

4.2.3

Year of introduction 1930 1956 1965 1965 1971 1976 1982 1982 1982 1987 1994 1995 1997 1999 2001 1997

Years of regular use 10 34 10 70 5 2 2 15 3 10 2 30 3 30 100 51

Years (if unused or used with extreme care) 20 100 20 100 15 30 15 50 100 30 10 100 100 100 150 100

1999 2000

2–5 years 10

115 75

Accessing Storage Media

To access storage media, special hardware and software are required. Let’s examine the case of USB sticks. Universal Serial Bus (USB) is an industry standard developed in the mid-1990s that defines the cables, connectors, and communications protocols used in a bus for connection, communication, and power supply between computers and electronic devices. It was designed to standardize the connection of computer peripherals (including keyboards, pointing devices, digital cameras, printers, portable media players, disk drives, and network adapters) to personal computers, both to communicate and to supply electric power. It has become commonplace on other devices, such as smartphones, PDAs, and video game consoles. USB has effectively replaced a variety of earlier interfaces, such as serial and parallel ports, as well as separate power chargers for portable devices. Figure 4.1 shows the specifications of a USB plug.

4.2.4

Cloud Storage

Cloud storage offers a centralized way for storing data, and also enables accessing them from any location. Cloud storage providers are responsible for offering

4.2 Technical Background

25

Fig. 4.1 Pinout and general specifications of USB

technological solutions for storing data and making them available whenever the end user requests them. The data can be accessed either through web service APIs (Application Programming Interfaces) or through web-based content management systems. This is actually an abstraction that is offered. To use this option, the user should have an Internet connection and should trust the storage provider. This approach hides the complexity from the users (individual persons or organizations), i.e., the effort in buying storage media, the effort in checking the state of their storage media, the effort in replacing their storage media, and so on. Cloud storage is part of “cloud computing,” which is briefly discussed in Sect. 12.2.2 also.

4.2.5

Cases of Bit Preservation

Bit preservation can be a complex task if the volumes of data that have to be preserved are big. For example, the volumes of data that have to be preserved at CERN (Conseil Européen pour la Recherche Nucléaire) range from hundreds of petabytes (PB ¼ 1015 bytes) to hundreds of exabytes (EB ¼ 1018 bytes) (Berghaus et al. 2016). Bit preservation can be complex even if the data are not big. This is true for data stored in obsolete digital media. A case of preserving data from obsolete digital media, specifically from 8 inch disks, is described by de Vries et al. (2017). Issues with replication and backup for ensuring the preservation of bits are described in Sect. 17.2.8.

26

4.3

4 Reading the Contents of the USB Stick

Pattern: Storage Media—Durability and Access

Pattern ID Problem name The problem Type of digital artifacts Task on digital artifacts What he did to tackle the problem What could have been done to avoid this problem Lesson learnt

Related patterns

4.4

P1 Storage Media: Durability and Access Robert could not access the contents of the USB stick because of a hardware issue Any digital artifact/object Retrieve bits Robert just moved the USB stick slightly to fix the connection between the USB stick and USB port of his computer (a) Use a cap for protecting the USB plug from dirt and bumps and liquids (b) Copy the contents of the USB stick to a newer one (c) Upload the contents of the USB stick on the cloud All digital storage media have a bounded life. The life of storage media are cut short by at least three factors: (a) media durability (e.g., the lost contents of Robert’s USB stick), (b) media usage and handling (e.g. the connection problem was due to the overuse of Daphne’s stick), (c) media obsolescence (e.g. disk drivers for floppy disks of 5¼ inches are hard to find nowadays). Data replication and exploitation of cloud storage services reduces the risk of “losing” digital objects because of media failures iPrevious: – iNext: • P2 (Metadata for digital files and file systems) • P6 (Executables: safety, dependencies) • P14 (Preservation planning)

Questions and Exercises

1. What is the life-span of a USB stick? Which factors affect this life-span? 2. Find what RAID is (in the context of storage media). 3. Find information and compare solid-state drives (SSD) with Blu-ray discs. Find the average life-span of all the storage media that you use for your personal digital material (USB sticks, hard drives, optical discs, etc.). 4. Add three columns “maximum storage space,” “storage space of a typical unit nowadays,” and “price (in US dollars) for the typical unit” to Table 4.1. Then, fill these columns, by searching on the Internet, with the corresponding information. Then add an additional column “US dollars per kilobyte” and fill it (it can be calculated based on the previous two values). By inspecting the derived values, find the storage media with the lowest and highest price per kilobyte. Also consider the years of regular use and add an additional column “US dollars per

4.5 Links and References

5.

6.

7.

8. 9.

27

kilobyte per year.” Review the derived values; find the storage media with the lowest and highest values according to this measure. If you are using cloud storage services, check if your provider offers any kind of guarantee if it fails to retrieve (or if it simply loses) one of your files. Also check if it offers any refund in that case. In addition, find what methods these cloud providers use for protecting the privacy of your files (encryption of the contents of files, hashing of files, etc.). Do you have any obsolete storage media (audio compact cassettes, 8 mm tapes, floppy disks) in your home? If yes, take an inventory in the form of pairs: storage media type, quantity. Select one of these storage media types and then search the Internet for machines that can read such storage media types and convert the stored contents to current digital formats. Write down in a table all the storage media that you possess. Add the year you purchased each one. Afterward, inspect the table for storage media that are close to their duration limit (if there are such media then you should consider replacing them). Do the same as before for your optical storage media. Find the oldest CD or DVD that you have and check if you can read its contents. Search the Internet for methods that check the health of your storage media (hint: some of them can be visually inspected).

4.5 4.5.1

Links and References Readings

Infographics About Storage Media Life-Span • The life-span of storage media. Crash Plan. 2018-05-03. URL: http://www. mikewirthart.com/projects/the-lifespan-of-storage-media/. Accessed May 3, 2018 (Archived by WebCite® at http://www.webcitation.org/6z8ZpPRKR) About USB • Axelson, J. 2015. USB complete: the developer’s guide. Lakeview research LLC. Cases of Bit Preservation • Shiers, J., Berghaus, F. O., Cancio Melia, G., Blomer, J., Dallmeier-Tiessen, S., Ganis, G., et al. (2016). CERN services for long term data preservation (No. CERN-IT-Note-2016-004). • de Vries, D., von Suchodoletz, D., & Meyer, W. (2017). A case study on retrieval of data from 8-inch disks. In 14th International Conference on Digital Preservation, Kyoto, Japan.

28

4 Reading the Contents of the USB Stick

4.5.2

Tools and Systems

Most Popular Cloud Services • • • • • •

Microsoft OneDrive (https://onedrive.live.com/) Dropbox (https://www.dropbox.com/) Google Drive (https://drive.google.com/drive/) Box (https://www.box.com/) iDrive (https://www.idrive.com/) Mega (https://mega.co.nz/)

Chapter 5

First Contact with the Contents of the USB Stick

5.1

Episode

Robert is sitting comfortably and is looking curiously at the USB stick that is now correctly plugged into his laptop. He does not worry anymore about loss of access, since he has already copied all the contents of the USB stick to the cloud. He is looking at the icon on his screen that corresponds to the plugged stick (Fig. 5.1). “It seems that there is a lot of material here!” Robert thinks with excitement. “It will take me only some minutes to find the owner. I will start with the properties of the USB stick.” He right-clicks on the icon and selects the option “Properties.” He stops for a while wondering why this option is the last option of the pop-up menu. “I will think about it later” he ponders and starts inspecting the tabbed card that has just appeared. The first tab provides information about the used and free space measured in bytes; the same as the information displayed in the icon. The second tab contains buttons for activating some administrative tasks like error-checking, defragmentation, and backup. He realizes that these options have nothing to do with the “properties” of the USB stick. The third tab provides information about the hardware and the only “new” information that he can see in that tab is a message that says that “the device is working properly.” The fourth tab contains information about network users that are permitted to view or update the contents of the USB stick. He clicks on the fifth tab, where he finds options for using the plugged device for boosting the speed of his system. “It is getting strange,” he thinks. The last tab contains options for customizing the appearance of the contents of the USB stick. Robert is feeling a bit frustrated. “Where is the information about the owner of the stick? The properties should provide information about the owner of the USB stick, or at least information about the owners of the contained files.” He is feeling defeated

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_5

29

30

5 First Contact with the Contents of the USB Stick

Fig. 5.1 How the operating system shows the plugged USB stick

not only because he hasn’t so far identified the owner of the stick, but also because the operating system has been designed by his company. “I should have been more involved in the design,” he thinks, but then he realizes that it is impossible to try, test, and provide feedback for all the software that his company makes. “I could increase my involvement in the design and testing but then who would work on the vision, the strategy, and the risk management?” he wonders. “I should better rebalance the time I dedicate to creative and management activities” he thinks, and a quote by Heraclitus comes to his mind: “Day by day, what you choose, what you think and what you do is who you become.” Since he did not obtain anything useful from the “Properties” window, he doubleclicks on the icon to open the contents of the USB stick. The operating system then opens a window showing the root level of the hierarchically organized files and folders that are stored in the USB stick (Fig. 5.2). Fig. 5.2 The contents of the root folder of the USB stick

5.2 Technical Background

31

“There is a lot of material here. Where do we begin?” Robert wonders. With his mouse, he selects all files and folders in the root folder and then he right-clicks again and selects the option “Properties.” A window pops up that shows how many folders and files his selection contains. “Let’s keep some notes for discussion with the design team of the operating system,” Robert thinks and he starts writing. “Whenever I plug a USB stick in, I would like to immediately have an executive summary that provides information about the owner of the stick and an overview of its contents, e.g., what kind of material it contains, when it was written, what period it concerns, why this content was written, and where it has been used. Ideally, I would like to have this kind of information in different levels of detail, from abstract level to very detailed level.”

5.2

Technical Background

At first, we introduce the concept of metadata (in Sect. 5.2.1). Then, we describe file systems (in Sect. 5.2.2), the extensions of file names (in Sect. 5.2.3), file signatures (in Sect. 5.2.4), and files’ metadata (in Sect. 5.2.5). Finally, we describe a process for extracting, transforming, and enriching metadata from multiple files (in Sect. 5.2.6).

5.2.1

Metadata in General

In our scenario, Robert, while confronting the unknown USB stick, would like to get answers to various questions: • • • •

What kind of material does the USB contain? When was the material written and/or what period does it concern? Why was this content written, where has it been used? Who has created or processed the material?

This leads us to the notion of metadata. The word metadata is formed from the ancient and modern Greek word μετά, which means “after,” and the Latin word data. In brief, “metadata are data that describe other data.” More precisely, we could also say that metadata are structured and encoded data that describe features of information entities, aiming at their identification, recognition, discovery, evaluation, and management. However, sometimes it is hard to distinguish data from metadata because some data can be both data and metadata. For instance, the title of a document is part of the document but it can also be considered as metadata for the document. Moreover, data and metadata can interchange roles. For example, the text of a poem is data, but if the poem is the lyrics of a song, then the entire poem can be considered as metadata of the song’s sound file. We can also have meta-metadata. This is not only philosophical but also practical as quite often we have to

32

5 First Contact with the Contents of the USB Stick

archive metadata that concern other metadata, e.g., for controlling the provenance of metadata, e.g., when one wants to aggregate two documents. We could, however, attempt to classify metadata to various categories according to various criteria. For instance, we can have metadata that describe the technical details of a file (e.g., a video of 1.1 Mbytes) or metadata that describe the contents of the file (e.g., a video showing a boy surfing). We can have metadata that are stable over time and metadata that may change over time (the title of a file may not change; however, the description of its contents may vary over time). Digital libraries sometimes classify metadata into three categories: • Descriptive metadata—They carry information about the contents of a file, like MARC (Machine-Readable Cataloging) catalog entries and finding aids. They are leveraged in browsing and searching. • Structural metadata—They carry information that connects one file to others to form a logical unit (e.g., information that is used to connect every image of a book with the others, information to connect all files of a software project). • Administrative metadata—Information that is useful for managing a file and for controlling access to it. This includes access rights, copyright, and information about the long-term preservation of the file. As regards the physical storage of metadata, it can be stored separately from the data, or can be embedded in the data. Indeed, there are several file types that support embedded metadata (as, e.g., in PDF and in various image file formats). One benefit of the embedded metadata is that they “live” with the files as a singular entity and “travel” with them. In other cases, metadata are stored separately from the files, as some file systems internally do and other packaging approaches do. The benefit of separately stored metadata, also called detached metadata, is that they consume less space (one can reuse the same metadata for a plethora of files). Moreover, one could change and enrich them (e.g., for harmonizing them or for replacing values with Uniform Resource Identifiers (URIs) whenever this is possible for avoiding ambiguity) without having to extract these metadata from each individual file and then store each of these files again. An approach for extracting embedded metadata and constructing an extensible knowledge base with these metadata is described in the subsequent sections.

5.2.2

File System

A file system controls how data are stored and retrieved in storage devices and storage media. By separating the data into pieces (files), and giving each piece a name (file name), the data in the storage media can be separated and identified. File systems also allow hierarchical organization of files, i.e., in folders where each folder

5.2 Technical Background

33

has a name and it can in turn contain files or other folders. A file system offers access to the contents of the files and folders and also to their metadata. There are plenty of file systems (FAT32, NTFS, ISO 9660, UDF, and others). They vary according to the storage media type (hard disks, optical disks) and the storage devices (e.g., only local or also remote). Some file systems are compatible, and some others are not, in the sense that special converters have to be used for migration from one file system to another (e.g., from FAT to NTFS).

5.2.3

File Name Extensions

File name extension is a set of characters that is placed as a suffix in the file name, usually separated by a dot. File name extensions are used as a kind of metadata to indicate the type of contents a file contains. Windows Operating System (OS) uses such file name extensions to associate files with the programs/applications that can open them. Other operating systems (e.g., Linux) do not support them in the sense that both the extension (the suffix) and the dot character are just another symbol that can be used in the file name. The rules specifying what part of the file name is its extension and how long it can be is specified by each file system. Some operating systems exploit the file name extension by offering the user a context menu containing actions (corresponding to applications that can be invoked taking the current file as parameter) that make sense to apply to the file (e.g., to print). In network contexts, files are treated as streams of bits and instead of file names or file name extensions they are associated with the Internet media type of the stream, also called the MIME (Multipurpose Internet Mail Extensions) type or content type. The particular content type of a stream is indicated in a line that precedes the stream, e.g., “Content-type: text/plain” for textual content. In particular, an Internet media type is a two-part identifier for standardizing file formats across the Internet. IANA is the official authority for Internet media types that maintains the list of current Internet media types. There are tools that identify file types (like JHOVE), which are discussed in Sect. 15.2.1.

5.2.4

File Signature

Whereas the extension of a file is used for indicating the type of file, file signature is the one that determines the actual type of file. The signature is part of the file header and contains a sequence of characters that are unique to a given type. It is usually called a magic number and is actually a short sequence of bytes (usually 2–4), which is placed at the beginning of a file. The magic number of a file (of any type) can be revealed by opening the file using a text editor or a hex editor. In the first case,

34

5 First Contact with the Contents of the USB Stick

Fig. 5.3 The magic number of well-known types of files

the magic number characters are shown with respect to the ISO-8859-1 encoding.1 In the case of a hex editor, the hexadecimal values of the bytes are shown. Figure 5.3 shows the magic numbers (in textual and hexadecimal formats) of three widely used file types: (a) a windows executable file, (b) an image in JPG format, and (c) an mp3 music file.

5.2.5

Files’ Metadata

Typical metadata that file systems keep for each file are: size of the data contained, date and time of last modification, date and time of file creation, date and time of last access, the date and time the file’s metadata was changed, owner user or group ID, access permissions, and other file attributes (e.g., whether the file is read-only, executable, etc.). File systems usually store file names and other metadata of files separately from the contents of the files. Some file systems allow user-defined attributes. Some file systems also maintain multiple past revisions of a file.

5.2.6

Metadata Extraction, Transformation, and Enrichment

Ideally, we would like to have this kind of information in different levels of detail, from abstract level to very detailed level. In other words, the questions in the beginning of this section should be answerable not only for individual files but also for groups of resources, e.g., for folders, sets of folders, an entire file system, even for a network of computers. A tool that can scan an entire file system (or the desired part of it), extract and harmonize the embedded metadata in all encountered files, and create a 1

More details about character encoding can be found in Sect. 6.2.1

5.2 Technical Background

35

PreScan: AutomaƟc creaƟon of ontology-based descripƟons by extracƟng and transforming the embedded metadata of file systems Preservation Scanner extracƟon of

transformaƟon to

provision of

embedded metadata

ontology-based metadata

human-provided metadata CIDOC CRM

CIDOC CRM DIGITAL COD (Core Ontology for Dependencies)

Other domain specific specializations

File System (with files containing documents, images, sheets, mulƟmedia, etc)

Automatically Extracted Metadata

Manually provided Metadata

COD resources (registry of formats and dependencies)

schema layer metadata layer

SemanƟc Repository of Metadata

Fig. 5.4 The system PreScan

knowledge base (using Semantic Technologies) with these metadata is the system PreScan (described in Fig. 5.4). This tool (and such a process in general) requires: (a) configurable scanning processes that can scan the desired parts of a file system (or network), (b) automatic services for format identification and extraction of the embedded metadata, (c) support for human-entered/edited metadata for adding extra metadata to an already scanned file (e.g., one might want to add extra metadata about provenance, digital rights, or the context of the objects), (d) periodic re-scannings for ensuring the freshness of the metadata without losing the human-provided metadata if files change location, (e) exploitation of existing external registries (like Pronom, GDFR) that contain information about file formats and versions of software for each file, and (f) a simple and easy-to-use user interface for controlling the entire process. PreScan is a system that supports the above processes and was developed in the context of the CASPAR project. As regards registries that contain information about file formats, software, and computing environments, apart from Pronom and GDFR that were mentioned earlier, we can mention the work of Thornton et al. (2017), which shows how the infrastructure of Wikidata meets the requirements for a technical registry of metadata.

36

5 First Contact with the Contents of the USB Stick

5.3

Pattern: Metadata for Digital Files and File Systems

Pattern ID Problem name The problem

Type of digital artifacts Task on digital artifacts What could have been done to avoid this problem Lesson learnt

Related patterns

5.4

P2 Metadata for digital files and file systems Robert could not answer questions of the form “what, who, when, why” from the digital files (and the entire collection of digital resources) of the USB stick Any digital artifact/object Retrieve the metadata of the digital object If metadata were available for each digital resource (and collections of digital resources), then Robert could tackle the problem Metadata are required for correctly interpreting the creation and usage context of digital resources It is also useful to have processes that can summarize large quantities of digital resources (including metadata). For example, suppose that we want to view and understand the metadata of thousands or millions of files iPrevious: • P1 (Storage Media: durability and access) iNext: • P3 (Text and symbol encoding) • P4 (Provenance and context of digital photographs) • P5 (Interpretation of data values) • P14 (Preservation planning)

Questions and Exercises

1. Select one file on your computer and try to answer the following questions: a. b. c. d. e. f.

What kind of material does it contain? How to render its contents (what application would you use)? When was it created? What period does it concern? Why was this content written? Where has it been used?

2. How can you find the “owner” of a digital folder? What if the folder contains files of difference provenance? How would you like such information to be summarized? 3. How credible are the metadata that are embedded in the files? Try changing the owner or author of a digital file and the date of its last edit. 4. Change the extension of a file to a known extension and try to open it:

5.5 Links and References

37

a. Using the appropriate application for the new extension (if the extension is.pdf try to open it using Adobe Acrobat) b. Using the application of the previous (the original) extension 5. Experiment with tags. Add some tags in your files (from properties ! Details ! Tags) and then try to search for files using the values of the tags (search using tag:value). This task is available for Windows OS users. 6. Select a file from your personal collection and save it using different file systems. Check its size in each file system (Hint: you can use a USB stick and try formatting it with different file systems).

5.5 5.5.1

Links and References Readings

About File-Systems • Agrawal, N., Bolosky, W. J., Douceur, J. R., & Lorch, J. R. (2007). A five-year study of file-system metadata. ACM Transactions on Storage (TOS), 3(3), p. 9. About File Signatures and Extensions • The global registry of file signatures. https://filesignatures.net About Metadata • Furrie, B. (2000). Understanding MARC bibliographic: Machine-readable cataloging. Cataloging Distribution Service, Library of Congress in collaboration with the Follett Software Company. https://www.loc.gov/marc/umb/ • Greenberg, J. (2003). Metadata and the World Wide Web. Encyclopedia of Library and Information Science, 3, pp. 1876–1888. • Greenberg, J. (2005). Understanding metadata and metadata schemes. Cataloging & Classification Quarterly, 40(3–4), pp. 17–36. • Bargmeyer, B. E., & Gillman, D.W. (2000). Metadata standards and metadata registries: An overview. In International Conference on Establishment Surveys II. Buffalo, New York. About PreScan • Marketakis, Y., Tzanakis, M., & Tzitzikas, Y. (2009). PreScan: towards automating the preservation of digital objects. In Proceedings of the international conference on management of emergent digital ecosystems (p. 60). ACM. About Registries of Metadata Related to Digital Preservation • Thornton, K., Cochrane, E., Ledoux, T., Caron, B., & Wilson, C. (2017). Modeling the Domain of Digital Preservation in Wikidata. In 14th International conference on digital preservation (iPRES 2017). Kyoto, Japan.

38

5.5.2

5 First Contact with the Contents of the USB Stick

Tools and Systems

About Pronom • https://www.nationalarchives.gov.uk/PRONOM/Default.aspx About GDFR • http://library.harvard.edu/preservation/digital-preservation_gdfr.html About PreScan • http://www.ics.forth.gr/isl/PreScan/ About Wikidata • https://www.wikidata.org/

5.5.3

Other Resources

Series of Conferences and Journals • Journal of Library Metadata • International Journal of Metadata, Semantics and Ontologies About Internet Media Types (MIME types) • https://www.iana.org/assignments/media-types/media-types.xhtml About File Systems • http://www.forensics.nl/filesystems • https://en.wikipedia.org/wiki/Comparison_of_file_systems About File Extensions • https://www.file-extensions.org About Related EU Projects • EU project CASPAR (Cultural, Artistic and Scientific knowledge Preservation, for Access and Retrieval), FP6-2005-IST-5, Project ID: 033572, 2006–2009, http://cordis.europa.eu/project/rcn/92920_en.html • EU project SHAMAN (Sustaining Heritage Access through Multivalent ArchiviNg), FP7-ICT-2007-1, Project ID: 216736, 2007–2011, http://cordis.europa.eu/ project/rcn/85468_en.html

Chapter 6

The File Poem.html: On Reading Characters

6.1

Episode

The properties menu did not reveal any information about the owner of the USB stick, or its contents. Therefore, Robert decides to take a look at the contents of the USB stick, hoping that he will be able to find files among the contents that will allow him to discover the owner. People usually store various files that contain their personal information in USB sticks, including their photos and videos, their contacts, their notes, and many more, and Robert hopes that he could find some of these resources there. The first file that Robert decides to open is a file with the name “poem.html”. He knows the extension of the file very well. It is a file in HTML format, the standard markup language for creating web pages, which has been designed by the inventor of the World Wide Web (WWW) Sir Tim Berners-Lee. HTML files are textual files that contain instructions, in the form of HTML tags, about the proper rendering of their contents in web browsers. Robert remembers how simple web pages were during the early days of the Internet. Graphics and multimedia contents were limited and web page designers organized the various contents of websites using HTML tags. As years passed by, web pages have became more modern; however, HTML is still in place. Due to its name, he realized that this file contains a poem that could produce generic information about the owner of the USB stick (e.g., their origin, their culture). He opens the file with a web browser and sees a series of unknown and strange characters (Fig. 6.1). He knows that it is an issue related with the encoding of the characters and as a result the web browser cannot properly “translate” the written symbols. He decides to see the source code of this file in order to understand what is happening. Robert opens the HTML file not with his browser but with a text editor this time, which is installed in the operating system of his computer. He can see only a series of This chapter has been co-authored with Yannis Kargakis. © Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_6

39

40

6 The File Poem.html: On Reading Characters

Fig. 6.1 What Robert saw when he opened poem.html with a browser

question marks “???” and some obscure symbols. He inspects the HTML tags and realizes that the HTML tag “meta” is missing from the HTML file. This tag defines the encoding (charset) that should be used for a proper rendering (i.e., the mapping of bits to recognizable, for humans, symbols). Since he is determined to read the contents of the HTML file, he tried a quite paradoxical method: he tries to use all the possible encodings that his browser supports. Browsers only offer some common encoding, but in fact there are hundreds of different ones. Fortunately, the browser that he uses includes the following encoding: “Greek ISO-8859-7”, which is the correct encoding of the HTML file. Eventually, the 12th encoding he selects is the appropriate one and Robert sees the content of the HTML file (Fig. 6.2). He recognizes that these are Greek symbols, so probably this is the right encoding. The text layout suggests that it is indeed a poem, but Robert does not know Greek. Having no idea about what the poem says, Robert tries an automatic translation service to find the title of the poem (Ithaka) in English. Then he searches on the web and he finds a translation of the poem in English by Edmund Keeley/Philip Sherrard. The translation is shown in Fig. 6.3.

6.1 Episode

41

Fig. 6.2 The contents of poem.html rendered using the symbol encoding “Greek ISO-8859-7”

As the title of the HTML file says, the writer of this poem is Constantine P. Cavafy. Robert searches on the web and he finds that Cavafy is an important Greek poet. Robert thinks that maybe the owner of the USB stick is from Greece. For this reason he requests from his secretary to bring him the list with all the participants of the contest, the list that contains the full details of each person including their country of origin. This information has been derived from the system that managed the applications and it is in the form of an Excel file. He searches the list but there are no Greek participants. Grace had done good work. Robert is confused; he cannot understand what is happening, so he continues with the next file on the USB stick.

42

6 The File Poem.html: On Reading Characters

Ιthaka by Constantine P. Cavafy As you set out for Ithaka hope the voyage is a long one, full of adventure, full of discovery. Laistrygonians and Cyclops, angry Poseidon—don’t be afraid of them: you’ll never find things like that on your way as long as you keep your thoughts raised high, as long as a rare excitement stirs your spirit and your body. Laistrygonians and Cyclops, wild Poseidon—you won’t encounter them unless you bring them along inside your soul, unless your soul sets them up in front of you. Hope the voyage is a long one. May there be many a summer morning when, with what pleasure, what joy, you come into harbors seen for the first time; may you stop at Phoenician trading stations to buy fine things, mother of pearl and coral, amber and ebony, sensual perfume of every kind— as many sensual perfumes as you can; and may you visit many Egyptian cities to gather stores of knowledge from their scholars. Keep Ithaka always in your mind. Arriving there is what you are destined for. But do not hurry the journey at all. Better if it lasts for years, so you are old by the time you reach the island, wealthy with all you have gained on the way, not expecting Ithaka to make you rich. Ithaka gave you the marvelous journey. Without her you would not have set out. She has nothing left to give you now. And if you find her poor, Ithaka won’t have fooled you. Wise as you will have become, so full of experience, you will have understood by then what these Ithakas mean. Fig. 6.3 The poem “Ithaka” by Constantine P. Cavafy

6.2 Technical Background

6.2

43

Technical Background

Here we focus on characters. At first, we discuss character encoding (in Sect. 6.2.1), and then we make a short introduction to HTML (in Sect. 6.2.2), although we will see more on HTML in the following chapters. Subsequently, we stress the issue of character semantics (in Sect. 6.2.3) and introduce the notion of parsing (in Sect. 6.2.4).

6.2.1

Character Encoding

Character encoding is actually an association table that maps different codes to specific characters. Modern encoding systems map the characters into specific numbers and define how these numbers are encoded as a stream of bytes. The first character encoding standard was ASCII (American Standard Code for Information Interchange), which managed to encode 128 specific characters into a 7-bit array. The characters included numbers (0–9), lowercase and uppercase letters (a–z, A–Z), punctuation marks and special control codes. Newer character encodings managed to support more characters (i.e., ISO-8859-1 supported 256 different character codes). More recent encodings (i.e., UTF-8) manage to support almost all the characters and symbols. For example, various encodings of ‘Φ’ (Phi), the 21st letter of the Greek alphabet (the one that is used for the golden ratio and on other occasions in math and science), are shown in Fig. 6.4. To display the contents of an HTML page correctly, we must know the character set used on a web page. For this reason, W3C recommends that the character

Appearance

Φ

Unicode number HTML -code HTML Entity (named) UTF-8 (hex) UTF-8 (binary) UTF-16 (hex) C/C++/Java source code

U+03A6 Φ Φ 0xCE 0xA6 (cea6) 11001110:10100110 0x03A6 (03a6) "\u03A6"

Fig. 6.4 Some encodings of Phi (Φ)

44

6 The File Poem.html: On Reading Characters

encoding always be declared in a web document using specific meta-element within the head tag of the document. Such a declaration will guarantee that a browser is able to recognize the encoding and properly interpret the contents of a web document. This approach is very important from a digital preservation perspective, since it reduces the risk of misinterpreting data.

6.2.2

HTML

HTML stands for HyperText Markup Language, which is the standard markup language used in the web. It was first released in 1993 and its latest version is HTML 5 (2014). It is a document file format whose crucial feature for the explosion of the web was its ability to contain links pointing to other web resources, e.g., web pages (identified and addressable through URLs). Apart from text and links, an HTML page can contain images and other kinds of objects, as well as scripts written in scripting languages like JavaScript. An HTML page can also contain Cascading Style Sheets (CSS) that define the intended look and layout of text and other material. Web browsers (like Chrome, Firefox) read HTML documents and render them into visible web pages. They can also run the embedded scripts. Since the web has been central to the development of the Information Age and is the tool billions of people use daily, in the following chapters we discuss various topics related to Web and HTML, specifically: • How structured data can be embedded in HTML pages (in Sect. 8.2.3.1) • Issues related to the preservation of the ability to deploy web applications (in Sect. 12.2.1). • HTML and remotely fetched images (in Sect. 14.2.1) • Web archiving and citability (in Sect. 14.2.2) • Web log (blog) and website preservation (in Sect. 17.2.10) • Issues related to the identity of the information expressed in HTML (in Sect. 18.2.9)

6.2.3

Character Semantics

Even if the encoding of a symbol is clear, we may not be able to understand the symbols. An example from the nondigital world follows: The Phaistos Disc (also spelled Phaistos Disk or Phaestos Disc) is a disk of fired clay from the Minoan palace of Phaistos on the Greek island of Crete, possibly dating back to the middle or late Minoan Bronze Age (2nd millennium BC). It is about 15 cm (5.9 in. in diameter and covered on both sides with a spiral of stamped symbols. Its purpose and meaning, and even its original geographical place of manufacture, remain disputed, making it one of the most famous mysteries of archaeology. The disk is now on display at the archaeological museum of Heraklion. The disk has 45 distinct symbols that were

6.2 Technical Background

45

No

Sign

Count

01

11

02

19

03

2

15

1

27

15

Fig. 6.5 Phaistos disk

numbered by Arthur Evans from 01 to 45; some are shown on the right side in Fig. 6.5. The same problem can occur in the digital world. A sequence of bits, say 101100, with no information about what symbol it encodes and to which language this symbol belongs to, cannot be understood. The relationship between the digital representation of a symbol and the intended sensory impression is discussed later (Sect. 18.2.9).

6.2.4

Parsing

A web browser parses the HTML page for rendering the content represented in the page. Parsing is the process of analyzing a string of symbols, either in natural language, or computer languages, conforming to the rules of formal grammar. Parsing is therefore indispensable for interpreting digital content in general, i.e., for understanding the format of a digital file and subsequently reading the encoded symbols and structure. This is true for text as we have seen in this chapter, for images (as we shall see in Chap. 7), and for software code (as we shall see in subsequent chapters). Parsing can be considered as the opposite of serialization (described in Sect. 15.2.3).

46

6.3

6 The File Poem.html: On Reading Characters

Pattern: Text and Symbol Encoding

Pattern ID Problem name The problem

Type of digital artifacts Task on digital artifacts What could have been done to avoid this problem Lessons learnt

Related patterns

6.4

P3 Text and symbol encoding Robert cannot understand the contents of the HTML file because the file does not contain any information about the symbol encoding that the HTML browser should use for rendering these symbols correctly Text Represent the textual symbols The HTML file could contain (in the HTML header) information about the symbol encoding that should be used for getting the intended symbols Without knowing the symbols we cannot have any kind of information. In the current example, each symbol is associated with an image that allows a human that knows the alphabet to identify what alphabet symbol the image refers to iPrevious: • P2 (Metadata for digital files and file systems) iNext: • P5 (Interpretation of data values) • P11 (Reproducibility of scientific results)

Questions and Exercises

1. Find how many natural language encodings currently exist. 2. Check how many different encodings exist for the symbols of the Greek language. 3. Store one web page locally that contains non-English text, e.g. the page http://fr. wikipedia.org/wiki/Constantin_Cavafy. Open the saved file using a text editor, find the tag that specifies the encoding and change it to one encoding of the Chinese language. Open the update file with a web browser. If you have done it right, you will see the contents of the file as a series of Chinese characters. 4. Compare the English translation of this poem with the results of an automatic translation service, like Google Translate. 5. Check whether there are still inscriptions from human languages that have not yet been deciphered.

6.5 Links and References

6.5 6.5.1

47

Links and References Readings

About Character Sets • Character sets by IANA (Internet Assigned Numbers Authority). http://www. iana.org/assignments/character-sets/character-sets.xhtml

6.5.2

Tools and Systems

About Java Programs and Unicode • https://docs.oracle.com/javase/tutorial/i18n/text/unicode.html About Character encoding converter tools • http://string-functions.com/encodedecode.aspx

6.5.3

Other Resources

About HTML • http://www.w3.org/html/ About the Encoding of Symbols • http://unicode.org/

Chapter 7

The File MyPlace.png: On Getting the Provenance of a Digital Object

7.1

Episode

“I have not been very lucky with poetry but let’s see what else we have here,” Robert thought. At that time, he notices an image file with file name “MyPlace.png”. He opens the file, using the default (in his operating system) application for images and gets what is shown in Fig. 7.1. Robert was hoping to see a photograph from a landmark of the place where the mysterious girl lives or comes from. Instead the photograph shows a small, and quite trivial, part of an old castle. It does not remind Robert of anything; therefore, he tries to search for additional information about this photo. By right-clicking the image and selecting Properties, he attempts to see the metadata of the photo, i.e., those embedded in the file with name MyPlace.png. In addition, he decides to use specialized tools to extract the exif information of the photograph, hoping that there will be something useful there, such as the geographical coordinates of the photo. Unfortunately, he realizes that the file does not contain any such metadata. A new idea comes to Robert’s mind that makes him smile. He will venture to find a similar photo on the web. He visits a search engine for images, selects the image on the USB stick, uploads it, and then makes a search for similar images. However, the search engine cannot find any relevant results. The results (shown in Fig. 7.2) seem entirely irrelevant. “This picture seems to have been processed by someone, perhaps by the mysterious young woman. If the file contained information about its provenance I would be able to find the original photo, which in turn would allow me to recognize the location,” Robert assumes. Robert suddenly remembers the service of cloud storage used by GlobalDrive, which allows viewing previous versions of a file. How easy it would be to find the original file if the operating system itself, even in USB sticks, held all such information. “I should probably bring this issue up for discussion with the design

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_7

49

50

7 The File MyPlace.png: On Getting the Provenance of a Digital Object

Fig. 7.1 The file MyPlace.png

Fig. 7.2 The results of searching for similar images

team of the new version of the OperatingSystem,” he contemplates. Unable to find something useful from the image, Robert decides to continue his investigation by inspecting the next file on the USB stick.

7.2 Technical Background

7.2

51

Technical Background

At first, we describe the formats used for digital images (in Sect. 7.2.1); next we focus on Exif metadata, which are used by photo cameras and related software (in Sect. 7.2.2); then we discuss PDF, a widely used format for portable documents that can include text, images, and metadata (in Sect. 7.2.3); and finally we discuss the general problem of provenance (in Sect. 7.2.4).

7.2.1

Formats for Images

Image file formats are standardized formats for storing digital images. An image is essentially a grid of pixels, each pixel represented by bits indicating its color. There are dozens of image file types (PNG, JPEG, and GIF are the most commonly used in the context of the Internet). Table 7.11 summarizes the main features of the most common image file formats. An image file format may store an image in the form of uncompressed or compressed raster data or in vector format. Raster image formats (like JPEG/JFIF, Exif, TIFF, GIF, BMP, PNG) describe each individual pixel of a grid, while vector image formats (like CGM, SVG, 3D vector formats) contain a geometric description that can be rendered smoothly at any desired display size (however, they are rasterized at display time). The used image compression algorithms can be lossless or lossy. With a lossless compression algorithm (like those in TIFF, GIF, BMP) we can decompress the compressed image and get the original. With the lossy compression algorithm (like those in JPEG), we can achieve a smaller file size, but with decompression we cannot get the original image. An example of size reduction of images in TIFF format (that uses lossless compression) is studied in May and Davies (2016), while an example of multimedia degradation due to lossy compression is given in Sect. 17.2.3. There are also Metafile formats that can include both raster and vector information (they are a kind of intermediate formats), like WMF and EMF. Page description language refers to formats used to describe the layout of a printed page containing text, objects, and images. Examples are PostScript, PDF (described in brief later in this chapter), and PCLa.

7.2.2

Exif Metadata

Exif (Exchangeable image file format) is a file standard incorporated in the JPEGwriting software used in most digital cameras. It aims at representing and 1 Based on the table of common image file formats from Cornell University (see the references at the end of this chapter for more details).

Metadata support

Standard/ Proprietary Web support

Native since Microsoft Internet Explorer 3, Netscape Navigator Free-text comment field

Plug-in or external application

Basic set of labeled tags

De facto standard

Lossless: LZW

.gif 1–8 bit bitonal, grayscale, or color

GIF (Graphics Interchange Format)

De facto standard

.tif, .tiff 1-bit bitonal; 4- or 8-bit grayscale or palette color; up to 64-bit color Uncompressed Lossless: ITU-T.6, LZW, etc. Lossy: JPEG

Extension(s) Bit-depth(s)

Compression

TIFF (Tagged Image File Format)

Name

Table 7.1 Common image file formats

ISO/IEC 15444 parts 1–6, 8–11 Plug-in

Basic set of labeled tags

Free-text comment field

.jp2, .jpx, .j2k, .j2c supports up to 214 channels, each with 1–38 bits; gray or color Uncompressed Lossless/Lossy: Wavelet

JP2–JPX/JPEG 2000

JPEG: ISO 10918-1/2 JFIF: de facto standard Native since Microsoft Internet Explorer 2, Netscape Navigator 2

Lossy: JPEG

JPEG (Joint Photographic Expert Group)/ JFIF (JPEG File Interchange Format) .jpeg, jpg, .jif, .jfif 8-bit grayscale; 24-bit color

Basic set of labeled tags plus user-defined tags

Native since Microsoft Internet Explorer 4, Netscape Navigator 4.04, (but still incomplete)

ISO 15948 (anticipated)

.png 1–48-bit; 1/2/4/8-bit palette color or grayscale, 16-bit grayscale, 24/48-bit truecolor Lossless: Deflate, an LZ77 derivative

PNG (Portable Network Graphics)

Basic set of labeled tags

.pdf 4-bit grayscale; 8-bit color; up to 64-bit color support Uncompressed Lossless: ITU-T.6, LZW. JBIG Lossy: JPEG De facto standard Plug-in or external application

PDF (Portable Document Format)

52 7 The File MyPlace.png: On Getting the Provenance of a Digital Object

7.2 Technical Background

53

standardizing the exchange of images with image metadata between digital cameras and software for viewing and editing. The supported metadata include information about date, camera name, camera settings like shutter speed and exposure, used compression, color information, and others. The Exif format also includes standard tags for location information. Nowadays, many cameras and smart phones are equipped with a GPS receiver that is exploited for storing location information in the Exif header when we shoot a picture. All these are embedded metadata, i.e., they are embedded within the image file itself. There are various tools for managing this information, like ExifTool and Opanda.

7.2.3

PDF

Portable Document Format (PDF) is a widely used file format for representing documents in a manner independent of application software, hardware, and operating systems. It is currently the de facto standard for fixed-format electronic documents. It has been an open standard since 2008 (ISO 32000-1:2008). A PDF file encapsulates a complete description of a fixed-layout flat document, including the text, fonts, graphics, and other information needed to display it. Mainly it combines three technologies: (a) a subset of the PostScript page description programming language, for generating the layout and graphics; (b) a font-embedding/replacement system to allow fonts to travel with a documents; and (c) a structured storage system to bundle these elements and any associated content into a single file, applying data compression where appropriate. PDF files can contain metadata; they can contain a set of key/value fields such as author, title, subject, and creation and update dates. In addition, it is possible to add XML-standards-based extensible metadata to PDFs as used in other file formats, enabling metadata to be attached to any stream in a document, such as information about embedded illustrations, using an extensible schema. Note that PostScript is a page description language that runs in an interpreter to generate an image. It can handle graphics as well as standard features of programming languages like if and loop commands. PDF is based on a simplified version of PostScript without flow of control features.

7.2.4

Provenance

Provenance (which is also known as pedigree or lineage) is the origin or the source from which something comes, and the history of subsequent versions and owners (also known in some fields as chain of custody). Provenance information is well understood in the context of art where it refers to the documented history of an art object. The documented evidence of provenance for an object of fine art or an antique can help to establish that it has not been altered and is not a forgery or a

54

7 The File MyPlace.png: On Getting the Provenance of a Digital Object

reproduction, and helps prove the ownership of the object. Furthermore, the quality of the provenance can make a significant difference to its selling price in the market. Provenance of digital objects can be of great importance as well. The provenance information of digital objects refers to the processes that led to the creation of the digital objects as well as the corresponding documentation. An indicative example demonstrating the importance of the provenance information is the case of scientific products. Preserving the provenance of scientific products (i.e., measurements, experiments, used algorithms, evaluation processes) allows scientists to better understand, ensure the validity, and allow the reproducibility of the product. Compared to material objects (e.g., an antique painting), digital objects change frequently, are unlimitedly copied, are used as derivatives to produce new digital objects, and are altered by several users. Additionally, provenance information of digital objects (files or composite digital objects) maintained in the digital files or externally (in digital libraries and archives) is crucial for authenticity assessment, reproducibility, and accountability. Therefore, provenance information for digital objects has to be properly recorded and archived. To this end, we need conceptual models able to capture and represent provenance information. Essentially, the provenance information about a digital object should allow answering the following kinds of questions: • Who was the creator (or the person responsible for the creation) of this object? • What was the context of the creation of this object, i.e., what was the context of the corresponding observation (e.g., where and when a photo was taken), or what was the context of the corresponding experiment (e.g., for data produced by the Large Hadron Collider)? • How was the outcome of the observation or experiment digitized? • What is the derivation history of this digital object, i.e., was it produced by other digital object(s) and, if yes, how and by whom? There are several conceptual models for representing provenance, such as OPM (Open Provenance Model), ProvDM, and Dublin Core. All of these contain some basic concepts like actors, activities, events, devices, information objects, and associations between each other (e.g., actors carrying out activities or devices used in events). For instance, the ISO standard CIDOC CRM (ISO 21127:2006), which is a core ontology describing the underlying semantics of data schemata and structures from all museum disciplines and archives, has been extended, resulting in CRMdig, for better capturing the modeling and query requirements of digital objects. Figure 7.3 shows a basic conceptual model containing six concepts and nine associations (it is actually part of CRMdig). In our running example, suppose that the provenance of the file “MyPlace.png” is the following: On January 6, 2015, it was Daphne’s birthday and she went for a walk at Koules, a fortress located at the entrance of the old port of Heraklion, which was built by the Republic of Venice in the early sixteenth century. That day it was very windy and the waves were crashing furiously on the breakwaters at the port. She could not approach the castle; therefore, she decided to take a photograph of it. A few days later (on January 15), she decided to convert the format of the photograph into a

7.2 Technical Background

55

Fig. 7.3 A small conceptual model for representing provenance

different one in order to reduce its overall size and after 2 days (on January 17) she decided to produce a variation of the photo of Koules using a tool for digital image processing. At first, she applied a sepia filter to the photo and then she added some water drops on the left as a reminder of the strong waves of that day. Figure 7.4 shows how we can represent all this information using CRMdig. This figure shows the RDF (Resource Description Framework) graph in the form of an UML (Unified Modeling Language2) object diagram. In this diagram, boxes represent objects, i.e., instances of classes. The name of each object’s class is shown in the upper part of the box (in bold). The name of each object (or attribute value) is shown at the bottom part of its box. The associations between these objects are shown as directed edges that connect the corresponding boxes. More about RDF and the representation of such RDF graphs in various syntaxes are given in Chap. 8. Returning to our case, we should note that files usually keep a limited (in size and representational complexity) form of provenance (e.g., the field’s author, last modified, etc.). A more complete provenance is often required, as in our scenario. However, it is sometimes difficult, for various reasons, to embed the full provenance into each particular file due to storage space. One alternative is to rely on external files or systems that are sometimes used for documenting the provenance of files and digital objects in general. Of course, this approach raises the issue of losing the provenance information or creating inconsistencies if the external provenance resources are not properly “connected” to the digital objects they refer to. Another alternative that bypasses these issues is to encapsulate provenance information with the digital objects into a single resource, e.g., into an Archival Information Package (AIP). More information about Information Packages and OAIS can be found in Sect. 18.3.2.

2

http://www.uml.org/

56

7 The File MyPlace.png: On Getting the Provenance of a Digital Object

Fig. 7.4 The derivation history of “MyPlace.png”

7.3

Pattern: Provenance and Context of Digital Photographs

Pattern ID Problem’s name The problem

Type of digital artifacts Task on digital artifacts What could have been done to avoid this problem Lesson learnt

P4 Provenance and context of digital photographs Robert wants to find out the physical location of the image depicted in a digital photograph, and any other information that concerns this file, e.g., the person that took the photograph, the shooting date, the creator of the particular file, and the actor that possibly processed that file Image, Video, Text Get provenance information If every file contained the metadata about its context, Robert’s task would be less laborious and more successful Metadata are not useless data that just enlarge the file size. Instead they can provide us with useful information about a file and its context (continued)

7.4 Questions and Exercises Related patterns

7.4

57

iPrevious: • P2 (Metadata for digital files and file systems) iNext: • P11 (Reproducibility of scientific results) • P13 (Authenticity assessment)

Questions and Exercises

1. Change the extension of the file name from png to txt and try to open the file by clicking on it through your operating system. Now first try to open an image viewer (e.g., Microsoft Office Paint), from the menu select File ! Open and select that file. 2. Use an image viewer and try to store the same image to a different format. Compare the sizes of the resulting files. 3. Rename one of the aforementioned files, just by deleting its extension. Find and use a file type identification tool like JHOVE (http://jhove.sourceforge.net/), DROID, or FileTypeIdentifier (https://code.google.com/p/filetype-identifier/) and check if that tool can identify the correct type of the file. 4. Try to find the embedded metadata in an image, e.g., through the corresponding option in the right-click menu. 5. Try to edit or add metadata to an existing image, e.g., through GeoSetter (http:// www.geosetter.de/en/), Opanda (http://www.opanda.com/), ExifTool (https://en. wikipedia.org/wiki/ExifTool), PhotoMe (https://www.photome.de/), or any other tool you may find. 6. Select one photo of yourself. Then visit a web search engine like http://images. google.com/, https://www.tineye.com/, or http://www.imageraider.com/. Select that file as a query and try to find photographs that resemble your photo. 7. Try to upload a file to a cloud storage service (e.g., Dropbox, Google drive, Box, Mega) and then upload newer versions of that file. Try to find the previous versions of that file in the cloud storage service. 8. See the pencil drawing shown in Fig. 7.5. It is based on a painting of a famous artist. Use the Internet and try to find the artist.

58

7 The File MyPlace.png: On Getting the Provenance of a Digital Object

Fig. 7.5 A pencil drawing

7.5 7.5.1

Links and References Readings

Papers on Modeling and Managing Provenance • Freire, J., Koop, D., Santos, E., & Silva, C. (2008). Provenance for computational tasks: A survey. IEEE Computing in Science & Engineering, 10(3). • Theodoridou, M., Tzitzikas, Y., Doerr, M., Marketakis, Y., & Melessanakis, V. (2010). Modeling and querying provenance by extending CIDOC CRM. Distributed and Parallel Databases, 27(2), 169–210. • Strubulis, C., Flouris, G., Tzitzikas, Y., & Doerr, M. (2014). A case study on propagating and updating provenance information using the CIDOC CRM. International Journal on Digital Libraries, 15(1), 27–51.

7.5 Links and References

59

On Image Compression • May, P., & Davies, K. (2016). Practical analysis of TIFF file size reductions achievable through compression. 13th international conference on digital preservation, Bern. Table with Common Image File Formats • Cornell University Library/Research Department. (2018, May 3). Table: Common image file formats. http://preservationtutorial.library.cornell.edu/pre sentation/table7-1.html. Accessed May 3, 2018. (Archived by WebCite® at http://www.webcitation.org/6z8aEn5nZ)

7.5.2

Tools and Systems

Exif Tools • ExifTool—http://www.sno.phy.queensu.ca/~phil/exiftool/ • Opanda EXif—http://opanda.com/en/iexif/ Image Search on the Web • Google Images—https://images.google.com/ • Yahoo Image Search—https://images.search.yahoo.com/ • TinEye Reverse Image Search—https://www.tineye.com/ Conceptual Models for Provenance • • • • •

OPM: http://openprovenance.org/ ProvDM: http://www.w3.org/TR/prov-dm/ Dublin Core: http://dublincore.org/documents/dces/ CIDOC CRM: http://www.cidoc-crm.org/ CRMdig: http://www.ics.forth.gr/isl/index_main.php?l¼e&c¼656

Chapter 8

The File todo.csv: On Understanding Data Values

8.1

Episode

“Time has passed. I have to hurry to catch my son’s recital,” Robert thinks. “But let me see another file; I may be lucky enough to solve the mystery in a few minutes.” He’s looking over the big list of files while trying to guess which file might give him the information he is looking for. His gaze stops at a file named “todo.csv”. The thought that the file can have useful information makes him feel optimistic; however, Robert has no idea what the extension “.csv” stands for. After asking his colleagues, he understands that csv is an abbreviation of “comma separated values,” meaning that such a file is a text file comprising rows where each row contains a number of values separated by commas. He opens the file using a text editor and he gets what is shown in Fig. 8.1. Robert observes that the contents are divided into two columns. He wonders what the values of the first column represent. They could be ratios representing the number of resources or activities that have been carried out; e.g., the first line could be used for describing the four out of the five tickets that have been bought already. They could also be dates that denote the actual dates that the activities described in the second column should be carried out. But then, they are incomplete in the sense that there is no year information in them. Furthermore, Robert cannot understand whether these dates correspond to the past or to the future. What does 4/5 mean—4th of May or 5th of April? The next two values of that column, i.e., 7/7 and 1/8, suffer from the same ambiguity; consequently, Robert can only make hypotheses about their meaning. He decides to focus on the descriptions of the second column. He realizes that they are not very helpful either. The first refers to ticket buying, the third refers to a trip departure, but there is no information about the destination of the trip. He looks at the second value that contains quite a strange code “hy450”. “What could this code mean,” he thinks. It could be anything. He conducts a quick web search using the word “hy450” as a query, but he does not get anything

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_8

61

62

8

4/5, 7/7, 1/8,

The File todo.csv: On Understanding Data Values

To buy tickets To submit hy450 To travel

Fig. 8.1 The contents of the file todo.csv

that seems useful. He decides to search again using different numbers. He tries “hy200” and “hy500”, but he is not getting anything interesting. Robert leans back in his chair. But why is the data still so incoherent and disconnected, he wonders. Then he recalls the case of Linked Data, a relatively recent method of publishing structured data so that it can be interlinked and become more useful through semantic queries. If the file todo.csv were web-accessible, then it would take 3 stars according to the 5-star deployment scheme for Open Data that was suggested by Tim Berners-Lee, the inventor of the web and Linked Data initiator. “If this file was ranked with 4 or 5 stars, then I would have fewer difficulties. At least the dates would be well-formed, and instead of the incomprehensible hy450, I would have a URI from which I could deduce much more. Without clear semantics about the contents, I cannot safely deduce certain facts from this file” Robert thinks. He looks at his watch and realizes time has passed. “Run Robert, run” he ponders, and leaves his office. Robert’s hypothesis was correct. The first column of the file was indeed used for describing dates. Daphne used to organize her activities in that way. She was creating small documents in csv format for her planned or past activities. She was using several such files for organizing different activities, including the tasks she should carry out, the dates of her exams, her meetings, and many more. She was using different file names for categorizing them. This particular file contained tasks that required her action to be completed, and for this reason she used the file name todo. The file contained reminders for buying tickets to fly back home on the May 4, for submitting an exercise for her course with course ID hy450 on July 7, and for catching her flight back to Greece on the 1st of August.

8.2

Technical Background

Data interpretation is a critical issue. This is the subject of this section and we focus on data formats that aim at being as self-describing as possible for assisting in the interpretation of data. To this end, first we say a few words about NetCDF (in Sect. 8.2.1), then we focus on the technology stack of Semantic Web (in Sect. 8.2.2), on which Linked Open Data is based (described in Sect. 8.2.3), and then (in Sect. 8.2.4) we compare these technologies with various other technologies (including EAST, DEDSL, XFDU).

8.2 Technical Background

8.2.1

63

NetCDF

NetCDF (Network Common Data Form) is a set of software libraries and selfdescribing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data (multidimensional scientific data). It is used as a common format by the geophysical data communities, e.g., it is used in meteorology, climatology, and GIS applications. The file extension of NetCDF files is “.nc”. The formats are aimed at being “self-describing,” meaning that the header of a file should describe the layout of the rest of the file (for being human-readable and machine-readable). An example of a NetCDF file, and its comparison with other formats is given in Sect. 8.2.4.

8.2.2

Semantic Web

The Semantic Web is an evolving extension of the WWW (World Wide Web) where content can be expressed not only in natural language but also in formal languages (e.g., RDF/S, OWL) that can be read and used by software agents, permitting them to find, share, and integrate information more easily. In general, we could say that the ultimate objective is the collaborative creation and evolution of a worldwide distributed graph. For the readers that are familiar with relational databases, we could say that this graph resembles the structure of an entity relationship diagram. To achieve the Semantic Web vision, several technologies have emerged in recent years. Many of them are international (W3C) standards. These technologies include: • Knowledge representation languages (i.e., RDF/S, OWL) and formats for exchanging knowledge (including RDF/XML, Turtle, N-Triples, JSON-LD, and others) • Rule languages and inference engines (i.e., SWRL) • Query languages (i.e., SPARQL) • Techniques for constructing mappings for integrating/harmonizing schemas and data • Technologies for mining structure knowledge from texts • Various APIs (Application Programming Interfaces) Starting with the World Wide Web where the content is represented as hypertext (i.e., pages containing texts and links to other pages), let’s now give an overview of the basic technologies on which the Semantic Web is based. The technological stack of the Semantic Web is depicted in Fig. 8.2. A brief description of the layers follows: • At the lowest level we have the URIs (Uniform Resource Identifiers) for identifying and locating resources, and UNICODE for representing characters from various natural languages. Nowadays (since 2005), instead of URI, we use the

64

8

The File todo.csv: On Understanding Data Values

User Interface and ApplicaƟons Trust Logic Query Language: SPARQL

Proof Ontologies: OWL

Rules: SWRL

Schemas and Taxonomies: RDF Schema

Graph-based Data Model: RDF Model and Syntax Structure: XML (Schema, Namespaces, Datatypes) IdenƟfiers: URI/ IRI

Character Set: Unicode

Fig. 8.2 Semantic Web stack









term IRI, standing for “Internationalized Resource Identifier.” While URIs are limited to a subset of the ASCII character set, IRIs may contain characters from the Universal Character Set (Unicode/ISO 10646), including Chinese or Japanese kanji, Korean, Cyrillic characters, and so forth. XML provides syntax for having and exchanging structured documents, but does not impose any constraints regarding the semantics of these documents. XML Schema offers a method to restrict the structure that XML documents can have. It also offers a rich and extensible set of data types that are exploited by RDF. For querying XML documents, we have XPath and XQuery. RDF (Resource Description Framework) is a structurally object-oriented model for representing objects (resources) and associations between them. It allows expressing content in the form of triples (subject, property, object). A set of triples actually forms a labeled graph, aka semantic network. These triples can be expressed and exchanged in various formats (e.g., TriG, N3, RDF/XML); some of them are based on XML (specifically RDF/XML). RDF Schema (RDFS) allows defining the vocabulary to be used in RDF and the semantic relationships between the elements of this vocabulary. OWL (Web Ontology Language) extends RDFS (and this is why it is often said that “OWL” covers “RDFS”) with the ability to specify transitive, inverse, and symmetrical properties, existential and cardinality constraints, more expressive domain and range constraints for properties. It therefore allows representing ontologies, i.e., formal specifications of certain domains, that describe things more rigorously, classes of things, relationships, and, consequently, it is associated with a richer inference mechanism. For exploiting the structured content that has been represented using RDF/S or OWL there are query languages and rule languages. Specifically, SPARQL (SPARQL Protocol and RDF Query Language) is a query language for knowledge expressed in RDF/OWL. SWRL (Semantic Web Rule Language) allows

8.2 Technical Background

65

expressing inference rules (essentially Horn rules). RIF is a rule interchange format. • The layers Logic and Proof concern the enrichment of the expressiveness of the representation languages. Finally, the Trust layer concerns trust issues, e.g., digital signatures for proving that one particular person has written or agrees with a particular document or sentence, as well as trust networks allowing users to define who they trust (and so on), eventually yielding trust networks (Web of Trust). Overall, we could say that the Semantic Web technologies are beneficial for digital preservation since the “connectivity” of data is useful in making the semantics of the data explicit and clear. This is the key point for the Linked Open Data initiative, which is actually a method for publishing structured content that enables connecting it.

8.2.3

Linked Open Data (LOD)

Linked Data describes a method of publishing structured data so that it can be interlinked and become more useful through semantic queries. It builds upon standard web technologies such as HTTP, RDF, and URIs, but rather than using them to serve web pages for human readers, it extends them to share information in a way that can be read automatically by computers. This enables data from different sources to be connected and queried. The inventor of the web and initiator of Linked Data, Sir Tim Berners-Lee, has suggested a 5-star deployment scheme, which allows rating data based on their openness; 5-star data are described in detail in Sect. 18.2.7.

8.2.3.1

Linked Data in HTML

One way to add information about the meaning of the contents of a web page is by using RDFa, which stands for “RDF in attributes,” and the main rationale behind it is that the web was initially built to be used by humans and much of the rich and useful information on the web was inaccessible to machines; humans can cope with all sorts of variations in layout, spelling, capitalization, color, position and so on, and still absorb the intended meaning from a web page. RDFa provides the means to make implicit declarations explicit by using particular references (in the form of Uniform Resources Identifiers—URIs). Suppose there is a web page presenting the book you are reading. At some point there are the following lines1:

1

This content is visible when looking at the HTML source of the web page.

66

8

The File todo.csv: On Understanding Data Values

Fig. 8.3 An example of HTML page with RDFa formatted data Cinderella’s Stick

Yannis Tzitzikas



A human could quite easily understand that Yannis Tzitzikas is actually a person. Furthermore, even for humans, it requires an effort to understand what the role of this person is: is he the writer, the publisher, a contributor, a character that appears in the story of the book? For a machine, it is even more difficult to automatically understand these details. The same piece of information written using the RDFa specification is shown in Fig. 8.3. Returning to our episode, todo.csv was actually an old todo-list of Daphne. On May 4, 2015 (this is the date corresponding to “4/5”), she wanted to buy tickets for her summer vacation that would start on August 1 (this is the date corresponding to 1/8). On July 7 of the same year (7/7), she had to submit an optional project for a university course with code “hy450”. If this todo-list was described according to the principles of Linked Data, then it would be much easier to understand. Below we show such a description in RDF. At first, Fig. 8.4 illustrates its representation in RDF in the form of a UML object diagram. The representation in RDF using the syntax of RDF/XML is given in Fig. 8.5, while Fig. 8.6 shows an equivalent representation using the TURTLE syntax. Both descriptions presuppose a header that defines the namespaces that are used. Namespaces are used to uniquely identify a set of resources, so that there is no ambiguity when resources having the same name but different origins (and probably different semantics) are mixed together. In order to improve the readability of resources that use namespaces, their abbreviations are used. These are shown in Fig. 8.7, and apart from the basic namespaces of the Semantic Web (rdf, rfds, owl, xml, xsd), the list contains crm, which corresponds to the ontology CIDOC CRM (CIDOC conceptual reference model ISO 21127:2006); csd, which corresponds to one namespace related to the university of Daphne; and trp, which corresponds to one site that provides trip itineraries. In the previous example, the vocabulary of the ontology CIDOC CRM was used for describing precisely and less ambiguously the contents of the file todo.csv. In general, various vocabularies can be used for this purpose. Another noteworthy initiative is Schema.org, which is a collaborative community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web

8.2 Technical Background

67

Fig. 8.4 The contents of todo.csv expressed in RDF

pages, in email messages, and beyond. The vocabulary of Schema.org can be used with many different formats (including RDFa and JSON-LD), and millions of sites already use Schema.org to markup their web resources. Semantic relations between the elements of these ontologies and schemas can be exploited for bridging the gaps between different ontologies and vocabularies. The latter is related to tasks that have to be carried out for semantically integrating data, including ontology matching, instance matching, and query federation.

8.2.3.2

On Producing RDF and Linked Data

RDF and Linked Data can be derived from sources of various kinds. RDF can be derived from sources of structured data (e.g., in CSV, spreadsheets, XML, relational databases), from sources of unstructured data (text), as well as sources of mixed data (e.g., web pages). The most common way to transform data from various formats to RDF is to rely on mappings that describe the conditions that will enable the transition of data to the RDF world. There are mapping languages and tools that rely on them that enable the transition of data from (a) relational databases to RDF like Direct Mapping [(Berners-Lee 2013); R2RML (Das et al. 2009); RML (Dimou et al.

68

8

The File todo.csv: On Understanding Data Values

Fig. 8.5 The contents of todo.csv expressed in RDF using the RDF/XML format

8.2 Technical Background

Fig. 8.6 The contents of todo.csv expressed in RDF using the TURTLE format

@prefix rdf: . @prefix rdfs: . @prefix owl: . @prefix xml: . @prefix xsd: . @prefix crm: . @prefix csd: . @prefix trp: . @base .

Fig. 8.7 The namespaces (and their prefixes) of the running example

69

70

8

The File todo.csv: On Understanding Data Values

2014)], (b) XML data to RDF like X3ML (Marketakis et al. 2015), (c) CSV data to RDF like XLWrap (Langegger and Wöß 2009), and (d) tabular data to RDF like OpenRefine. Furthermore, data can be extracted from texts and expressed in RDF using tools like DBpedia Spotlight (Mendes et al. 2011). Such processes have been used to derive various corpora of data in RDF. For instance, DBpedia was derived by extracting knowledge from Wikipedia (Lehmann et al. 2015). An example of extracting embedded metadata from the files of a file system and transforming them to RDF is described by Marketakis et al. (2009). The aggregation and integration of data from various sources of the marine domain for producing big knowledge graphs in RDF is described by Tzitzikas et al. (2016). An example of the adoption of RDF in the context of archaeological research is described by Felicetti et al. (2015). A process for producing Linked Data from unstructured bibliographic data in Japan is described by Yoshiga and Tadaki (2017). Currently, the cloud of Linked Data contains a large number of datasets and triples. For instance, Rietveld et al. (2015) provide an index of 8 billion RDF triples from 600,000 RDF documents. NetCDF-LD (https://binary-array-ld.github.io/netcdf-ld/) is an approach for constructing Linked Data descriptions using the metadata and structures found in NetCDF files (that were described in Sect. 8.2.1 and we will encounter them again in the next section). Essentially, it connects the contents of netCDF files with external web-based resources (vocabularies, ontologies, other data collections).

8.2.4

Other Technologies (and Their Comparison with Semantic Technologies)

To make the differences between semantic technologies and other alternative (and older) technologies clear, below we describe an example concerning a simple kind of scientific data (the example is taken from Marketakis and Tzitzikas 2009). Suppose that we want to preserve files containing temperature measurements from various locations. Each file can comprise an arbitrary number of lines. Each line contains three numerical values corresponding to the longitude, the latitude, and the measured temperature in degrees Celsius, respectively. Each line corresponds to the temperature at the coordinate-specified area as it was measured at a certain point in time. The time of the measurements is hardwired in the file name of the file, e.g., a file named datafile20080903_12PM.txt means that it contains measurements taken at 12 pm on September 3, 2008. Suppose that the contents of this file are as shown in Fig. 8.8. Also suppose that we have several such files, all having the same format, though, each one containing measurements of different locations and at different times. An alternative solution for storing and exchanging these data is to serialize them in JSON (JavaScript Object Notation) format, a lightweight data-interchange format. It is a language-independent text format consisting of attribute–value pairs, which is

8.2 Technical Background

71

25.130 35.325 30.2 25.100 35.161 28.9 25.180 35.333 29.3

Fig. 8.8 The contents of the file datafile20080903_12PM.txt

Fig. 8.9 The contents of the file datafile20080903_12PM in JSON format

simple to read, write, and be understood by humans. The contents of the above example would then be written in JSON, as shown in Fig. 8.9. The left part shows the actual data in JSON format and the right part shows a graphical representation. Although they contain a lot of semantic information about the data (i.e., a title for each number), much information is still missing (e.g., the measurement unit). Syntax Description To document the structure of the file datafile20080903_12PM.txt we could use the EAST (Enhanced Ada Subse T) language. It was first standardized as ISO 15889:2000 and since then it has been revised twice, producing ISO 15889:2003 and ISO 15889:2011. Each data description record (DDR) of that language comprises two packages: one for the logical description and one for the physical description of the data. The former includes a logical description of all the described components, their size in bits, and their location within the set of the described data. The latter (physical part) includes a representation of some basic types defined in the logical description, which depend on the machine that generates the data: the organization of arrays (i.e., first-indexfirst, last-index-first) and the bit organization on the medium (high-order-first or loworder-first for big-endian or little-endian representation, respectively).

72

8

The File todo.csv: On Understanding Data Values

package logical_datafileX_description is type HORIZONTICAL_COORDINATE is range -90.00 .. 90.00 for HORIZONTICAL_COORDINATE’size 64; --bits type VERTICAL_COORDINATE is range -180.00 .. 180.00 for VERTICAL_COORDINATE’size 64; --bits type TEMPERATURE_TYPE is range -180.00 .. 180.00 for TEMPERATURE_TYPE’size 16; --bits type MEASUREMENT_TUPLE is record LONGITUDE:VERTICAL_COORDINATE LATITUDE:HORIZONTICAL_COORDINATE MEASURED_TEMPERATURE:TEMPERATURE_TYPE end_record; for MEASUREMENT_TUPLE’s size use 144; type MEASUREMENT_BLOCK is array(1..1000) of MEASUREMENT_TUPLE; for MEASUREMENT_BLOCK’size use 144000; SOURCE_DATA:MEASUREMENT_BLOCK end logical_datafileX_description is package physical_datafileX_description is end physical_datafileX_description; Fig. 8.10 An example of an EAST description

Figure 8.10 provides an example of a possible DDR describing our data file. It defines one type for each of the three columns (longitude, latitude, and temperature) because each one represents a different kind of data (the distinction of longitude and latitude is only made because of their different upper and lower bounds). Semantic Description We can provide a semantic description for the entities’ longitude, latitude, and temperature of the file datafile20080903_12PM.txt for clarifying the interpretation of the terms “longitude” “latitude” and “temperature.” One approach is to use the DEDSL (Data Entity Dictionary Specification Language). The abstract syntax of DEDSL is described in ISO 21916:2003. Figure 8.11 shows an example of a DEDSL description for our data file according to the implementation of DEDSL using XML. If we have another file with the same kind of information, we can reuse the same semantic descriptions. EAST and DEDSL have been used mainly from CNES (Centre National d’Études Spatiales). CNES has been leading the activities toward the development of tools and APIs that exploit these standards under the DEBAT project (Development of EAST

8.2 Technical Background

Fig. 8.11 Example of a DEDSL description

73

74

8

The File todo.csv: On Understanding Data Values

Based Access Tools).2 More specifically, the standards are exploited from the BEST workbench,3 a suite of tools that allow describing data and offer data simulation facilities as well. Packaging Packaging formats can be used for preparing a package that contains the data files plus their EAST and DEDSL descriptions. XFDU (XML Formatted Data Unit) is a standard file format developed by CCSDS (Consultative Committee for Space Data Systems) for packaging and conveying scientific data, aiming at facilitating information transfer and archiving. By adopting a packaging approach (like that of XFDU), we can also add extra information about the components of the package. For example, suppose we would like to add information about the user that measured the temperatures for each file and the thermometer characteristics or the satellite information (if the samplings were made from space). We could easily add the above information using XFDU since we just have to add the necessary information to the package. For instance, one could describe such information using an ontology like CIDOC CRM. Such descriptions could be expressed in XML format or as an RDF/XML file. In both cases, the resulting file could be included in the package. Overall, the benefit of using XFDU is that we can package together heterogeneous artifacts (including data files, Java programs, provenance data) and deliver them to the user, or archive them, as a single (ideally self-describing) unit. XFDU is the basis of the Standard Archive Format for Europe (SAFE). The European Space Agency (ESA) has created it as a standard format for archiving and conveying Earth observation data within ESA and cooperating agencies. SAFE relies on XFDU, in the sense that SAFE is a profile of XFDU, and it restricts XFDU specifications for specific utilization in the Earth observation domain. To this end, ESA has developed a set of tools and APIs for managing these resources. Alternatively, we could create self-describing packages using NetCDF. To this end, we could create a file that would contain the different dimensions of the measurements (i.e., the longitude, the latitude, and the temperature) with the necessary semantic information in a single NetCDF file. Figure 8.12 shows the textual representation illustrating the structure of the NetCDF, which contains the temperature and geographical information of the running example. The left part of Fig. 8.13 shows the tabular values of the corresponding NetCDF file and the right part shows an indicative visualization using the open-source tool Panoply. Semantic Technologies Now we will describe, using the same running example, an alternative approach that is based on Semantic Web languages. When creating a new data file there is no need to create its DEDSL or EAST description every time. Instead, we can define an ontology (or use an existing one) expressed as an RDF Schema or as an OWL ontology, and then use the vocabulary of that ontology to

2 3

http://debat.c-s.fr/ https://logiciels.cnes.fr/en/node/16

8.2 Technical Background

75

netcdf file:temperatureMeasurements-20080903.nc { dimensions: latitude = 3; longitude = 3; time = 1; variables: double temperature(time=1, latitude=3, longitude=3); :long_name = "surface temperature"; :units = "Celsium degrees"; float latitude(latitude=3); :units = "degrees_north"; :axis = "Y"; :standard_name = "latitude"; :_CoordinateAxisType = "Lat"; float longitude(longitude=3); :units = "degrees_east"; :axis = "X"; :standard_name = "longitude"; :_CoordinateAxisType = "Lon"; int time(time=1); :units = "Hours counting from 12 AM GMT"; } Fig. 8.12 Textual representation of a NetCDF file

Fig. 8.13 Storing and presenting temperature information using NetCDF file

represent the data. Figure 8.14 shows an indicative ontology for our running example expressed in RDF/S XML. A benefit of using Semantic Web languages is that data and their descriptions are tightly coupled. In contrast, EAST/DEDSL-descriptions are represented as separate files, and for this reason packaging formats are important. On the other hand, in RDF, a data file would itself define the data type of each data element in

76

8

The File todo.csv: On Understanding Data Values

Fig. 8.14 Example of an RDF Schema for temperature measurement

the file. To clarify this point, consider the following line from our running example: 25.130 35.325 30.2 (Fig. 8.8). This line does not provide any information regarding what 25.130 might be. It could be the longitude, the latitude, or the temperature. On the other hand, its RDF representation would be as shown in Fig. 8.15. Clearly, this part of RDF can be understood without the existence of any other (EAST/DEDSL) files. Another benefit of Semantic Web technologies is that, whenever we want to support more data types, we could extend that ontology. For example, past data files can contain only the longitude, the latitude, and the temperature, while current ones may also include the name of each location, or the thermometer used for the measurement. In such cases, two different kinds of DEDSL/EAST descriptions have to be created and used. In the Semantic Web approach, we just have to extend the top ontology.

8.3 Pattern: Interpretation of Data Values

77

Fig. 8.15 The RDF representation of the temperature measurement

8.3

Pattern: Interpretation of Data Values

Pattern ID Problem’s name The problem

Type of digital artifacts Task on digital artifacts What could have been done to avoid this problem

Lesson learnt

P5 Interpretation of data values Robert did not manage to interpret a file containing data organized in rows and columns. Although the encoding of the symbols was not a problem, since he managed to read words of the English language (Latin characters), he could not understand the semantics of various numerical values and strings that do not correspond to English words. In particular, the file contained (a) values which were not complete (i.e., dates with no year information) or values that could have different interpretations in different regions, and (b) a word that did not correspond to a known English word or well-known acronym Data collection Perceive the data Robert would have tackled this problem if the data file was more informative, i.e., if (a) each column had a header indicating the semantics of the values of that column, (b) the dates were complete (year), and (c) instead of the unknown word (hy450) a resolvable reference (e.g., a URI) was used Metadata for files containing tabular data are essential for their interpretation. This includes (a) information about file formats (e.g., CSV) (b) information about the logical structure of the file (e.g., Latin characters organized in lines, where the values of each line are separated by commas) (c) information about the data values themselves (complete dates, regional date formats) (d) information about the semantics of the data values (e) information about the context of the data (who, when, why, how) There are various approaches for achieving the above objectives. One method is the Linked Open Data method for publishing and integrating data (continued)

78

8

Related patterns

8.4

The File todo.csv: On Understanding Data Values

iPrevious: • P2 (Metadata for digital files and file systems) • P3 (Text and symbol encoding) iNext: • P12 (Proprietary format recognition)

Questions and Exercises

1. Find catalogs that describe datasets that are published according to the principles of Linked Data. 2. Find an approximation of the datasets that currently exist that are published according to the principles of Linked Data. 3. Find how Semantic technologies tackle the problem stemmed from homonyms and synonyms of natural languages. 4. Express the contents of todo.csv in XML. Use xsd:date for the dates. 5. Express the contents of todo.csv in RDF. Search and find an appropriate vocabulary/ontology (apart from CIDOC CRM that is used in the book). 6. Enrich the previous file with triples that describe the provenance of the file. Assume that this file was written by Daphne on March 25, 2015. 7. Compare the resulting file with the original file (todo.csv) and observe the differences.

8.5 8.5.1

Links and References Readings

About Semantic Web • Antoniou, G., Groth, P., Van Harmelen, F., & Hoekstra, R. (2012). A semantic web primer (3rd ed.). MIT Press. ISBN 978-0-262-01828-9. About Linked Open Data • Heath, T., & Bizer, C. (2011). Linked data: evolving the web into a global data space. Synthesis Lectures on the Semantic Web: Theory and Technology, 1(1), 1–136. • Rietveld, L., Beek, W., & Schlobach, S. (2015, October). LOD lab: experiments at LOD scale. In International Semantic Web Conference (pp. 339–355). Cham: Springer. • Berners-Lee, T. (2006). Linked data, 2006.

8.5 Links and References

79

About the Comparison of EAST, DEDSL, and Semantic Technologies • Marketakis, Y., & Tzitzikas, Y. (2009). Dependency management for digital preservation using Semantic Web technologies. International Journal on Digital Libraries, 10(4), 159–177. • Tzitzikas, Y., Kargakis, Y., & Marketakis, Y. (2015). Assisting digital interoperability and preservation through advanced dependency reasoning. International Journal on Digital Libraries, 15(2–4), 103–127. About Ontology Matching • Shvaiko, P., & Euzenat, J. (2013). Ontology matching: state of the art and future challenges. IEEE Transactions on Knowledge and Data Engineering, 25(1), 158–176. About Instance Matching • Nentwig, M., Hartung, M., Ngonga Ngomo, A. C., & Rahm, E. (2017). A survey of current link discovery frameworks. Semantic Web, 8(3), 419–436. About Query Federation • Saleem, M., Khan, Y., Hasnain, A., Ermilov, I., & Ngonga Ngomo, A. C. (2016). A fine-grained evaluation of SPARQL endpoint federation systems. Semantic Web, 7(5), 493–518. About Producing RDF and Linked Data • Yoshiga, N., & Tadaki, S. (2017). Semi-automated generation of linked data from unstructured bibliographic data for Japanese historical rare books. In 14th International Conference on Digital Preservation, Kyoto, Japan. • Lee, T. B. (1998). Relational databases on the Semantic Web. Design Issues (published on the web). • World Wide Web Consortium. (2012). R2RML: RDB to RDF mapping language. • Dimou, A., Vander Sande, M., Colpaert, P., Verborgh, R., Mannens, E., & Van de Walle, R. (2014). RML: a generic language for integrated RDF mappings of heterogeneous data. In Proceedings of the 7th Workshop on Linked Data on the Web (LDOW2014), Seoul, Korea. • Langegger, A., & Wöß, W. (2009). XLWrap – querying and integrating arbitrary spreadsheets with SPARQL. In International Semantic Web Conference (pp. 359–374). Berlin: Springer. • Marketakis, Y., Minadakis, N., Kondylakis, H., Konsolaki, K., Samaritakis, G., Theodoridou, M., Flouris, G., et al. (2017). X3ML mapping framework for information integration in cultural heritage and beyond. International Journal on Digital Libraries, 18(4), 301–319. • Mendes, P. N., Jakob, M., García-Silva, A., & Bizer, C. (2011). DBpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems (pp. 1–8). ACM.

80

8

The File todo.csv: On Understanding Data Values

• Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P. N., et al. (2015). DBpedia – a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web, 6(2), 167–195. • Marketakis, Y., Tzanakis, M., & Tzitzikas, Y. (2009). PreScan: towards automating the preservation of digital objects. In Proceedings of the International Conference on Management of Emergent Digital EcoSystems (p. 60). ACM. • Tzitzikas, Y., Allocca, C., Bekiari, C., Marketakis, Y., Fafalios, P., Doerr, M., et al. (2016). Unifying heterogeneous and distributed information about marine species through the top level ontology MarineTLO. Program, 50(1), 16–40. • Felicetti, A., Gerth, P., Meghini, C., & Theodoridou, M. (2015). Integrating heterogeneous coin datasets in the context of archaeological research. In In Workshop for Extending, Mapping and Focusing the CRM (EMF-CRM) – Co-located with TPDL’2015 (pp. 13–27).

8.5.2

Tools and Systems

About NetCDF • A Data Viewer for NetCDF is Panoply: https://www.giss.nasa.gov/tools/ panoply/ • Several links are contained in https://www.unidata.ucar.edu/software/netcdf/ Transforming CSV Files to XML4 • Online Tools: There are a lot of online tools where you could load your CSV document and copy the resulting XML from another field online for free e.g.: – http://www.creativyst.com/Prod/15/ – http://www.convertcsv.com/csv-to-xml.htm – http://www.freeformatter.com/csv-to-xml-converter.html • Spreadsheet Applications: Such applications (e.g. Openoffice, Gnumeric or MS Excel) often allow you to export your data as “simple” XML natively, or with help of a third-party plug–in. • Graphical (GUI) Applications: – CSV2XML (http://www.jens-goedeke.eu/tools/csv2xml/) – XMLSpy XML Editor (https://www.altova.com/xmlspy.html)

4

Based on https://help.ubuntu.com/community/Converting%20CSV%20to%20XML

8.5 Links and References

Transforming CSV Files to RDF • Various tools are mentioned in https://www.w3.org/wiki/ConverterToRdf Working and Transforming CSV Files • Tools for editing/cleaning/searching files in CSV format – CSVkit (https://csvkit.readthedocs.io) – CSVfix (http://csvfix.byethost5.com/csvfix.htm) – OpenRefine (http://openrefine.org/)

8.5.3

Other Resources

About Semantic Web Technologies • • • •

RDF: http://www.w3.org/RDF/ RDF Schema: http://www.w3.org/TR/rdf-schema/ OWL: http://www.w3.org/TR/owl-features/ RDFa: http://www.w3.org/TR/rdfa-syntax

About Linked Data • Linked Data (W3C): https://www.w3.org/wiki/LinkedData • LinkedData.org: http://linkeddata.org About Languages for Describing Syntax/Semantic and Packaging • EAST (Enhanced Ada Subse T) language – https://public.ccsds.org/Pubs/644x0b3.pdf – ISO 15889:2011 • DEDSL (Data Entity Dictionary Specification Language) – https://public.ccsds.org/Pubs/647x1b1.pdf – ISO 21961:2003 • XFDU (XML Formatted Data Unit) – https://public.ccsds.org/Pubs/661x0b1.pdf – ISO 13527:2010 • SAFE (Standard Archive Format for Europe) – http://earth.esa.int/SAFE/index.html

81

82

8

The File todo.csv: On Understanding Data Values

Well-Known Ontologies and Vocabularies • Friend of a Friend—FOAF (http://www.foaf-project.org/) • vCard Ontology (https://www.w3.org/TR/vcard-rdf/) • Semantically Interlinked Online Communities Project—SIOC (http://sioc-pro ject.org/) • Music Ontology (http://musicontology.com) • Data Catalog Vocabulary—DCAT (https://www.w3.org/TR/vocab-dcat/) • Schema.org (http://schema.org)

Chapter 9

The File destroyAll.exe: On Executing Proprietary Software

9.1

Episode

May 26 The morning sun had just begun to warm MicroConnect’s headquarters when Robert entered his office and sat in his comfortable chair. Robert is happy, since everything went well with his son’s recital. A short power failure almost ruined the event but fortunately it lasted only three seconds. He contemplated on how important electricity has become in our everyday lives. Indeed, it seems that it is one of the most important blessings that science has given to mankind, and even his job revolved around devices that required electricity to function properly. These thoughts made him stare at the objects in his office. Every single one of them required electricity to function; the air conditioner, the lights, the television, the radio, his laptop. . . While staring at his laptop he remembers that he has an unfinished job to do. Just yesterday he began exploring the contents of the USB stick for discovering the mysterious girl. It’s time to abandon his philosophical thoughts about electricity. He turns on his laptop and opens the folder with the contents of the USB stick. The file labeled destroyAll.exe intrigues Robert. He knew that files having the extension “exe” were executable files. However, he is wondering whether the file was indeed an executable file. Although he is very curious, and eager to double-click it for executing it, the file name makes him a bit reluctant. “I do not know what kind of application this is,” he thought. “It can be a malicious application that could delete important files from my computer making it useless, or it could even steal private information like usernames and passwords and then send them to a remote server at a future time when no suspicion would be raised,” continued the monologue.

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_9

83

84

9 The File destroyAll.exe: On Executing Proprietary Software

Because of these concerns, Robert decides to use an antivirus program to check whether the file is malicious. The analysis of the antivirus software took only some seconds and indicated that nothing suspicious was identified. “But what if the program contained an instruction that deletes all of the files from my hard disk? I am not sure that an antivirus program would be able to detect such cases!” he thinks. “I run a big computer company; I could run this file on a brand new computer that has no data or anything of value and see what happens” he thinks with relief. The file extension indicates that the file is an executable for OperatingSystem, a popular and widely used operating system. “But for which version of OperatingSystem? What are the dependencies of this application? Does my system have these dependencies so that the program runs and behaves as expected?” There are several small tools and applications that analyze the dependencies of executable (.exe) files; however, Robert has never used any of them. He decides to download a free tool from the Internet to analyze the dependencies of the file destroyAll.exe. He loads the executable file, and a list of DLL (dynamic-link library) files appears in the dependency list. He is familiar with some of these files and he understands that they contain some commonly used libraries and functions. After failing to understand anything from these dependencies, he thinks that the only way to understand what this executable does is to run it. Since he does not want to risk his own computer, he asks his secretary to bring him a new laptop. Fifteen minutes later, a brand new laptop is brought to his office. He connects the USB stick and double-clicks the file destroyAll.exe. A message appears: This file requires OperatingSystem v_6 to run OperatingSystem v_6 was an older version of the operating system. He should use another machine that contains the proper version for executing the mysterious application. This awkward situation makes him realize that the company customers were sometimes reasonable with their complaints. The call center of the company was often receiving complaints about programs that could not be executed, especially after updating the operating system, and Robert was aware of that. However, modern hardware and new functionalities required changing core components in the operating system, which made old executable programs useless. “No worries, I will use a virtual machine where I will install the required operating system,” he thought. He had used the tool VM Virtualizer several times in the past for checking various applications that did not “run” in newer versions. The process for reinstalling it was relatively easy with the image file (ISO) that he already had. He kept the image files for all the different versions of all the operating systems that have been released since OperatingSystem v_3. The execution of the unknown application in such a controlled environment seemed the ideal solution. Once the file had been copied to the virtual machine, he starts the execution by double-clicking on it. Fortunately for him, the application did not crash the virtual

9.1 Episode

85

system. However, the execution terminates almost immediately, prompting the following error message: ERROR: cannot find config.properties file The error message was informative enough; it was clear that this mysterious application required an external file to function properly. Robert decides to create one. The error message did not contain any information about the path of the required file, so he has to guess. “If I was writing that program, I would put the file with the properties in the same folder with the executable file” he thinks and he creates a file and names it config.properties, placing the file in the same folder with the executable file. “If I’m lucky the application will continue its execution,” he thinks and tries to execute it again. The application terminates quickly again; however, this time the error message is slightly different: ERROR: Cannot find the property “folder” in the properties file It is clear to him that he should add something in the properties file to proceed. The new entry should be a folder. “What kind of folder? The path of a folder? And what about the contents of that folder?” Robert is more and more curious about this application. He creates a folder and copies a draft document that he quickly creates. Then he updates the properties file and executes the application again. This time there were no other error messages; in fact, the program stopped its execution rather quickly without any further messages or any observable impact. Robert is so disappointed that he does not even check to see what happened to the folder that he has created. Had he checked that folder he would have seen that the contents of that folder had been deleted. Indeed, Daphne had written this small program to regularly delete the contents from particular folders from her computer containing temporary files, files downloaded from the Internet, her browsing history, and various other files of that kind. She was using this program not only on her personal computer but also on the university computers that she was using for erasing her traces. “I would like to have formal guarantees for each program, guarantees that ensure whether a program is safe or not. But I think we always come to the same issue: how to balance freedom with security,” Robert thinks.

86

9.2

9 The File destroyAll.exe: On Executing Proprietary Software

Technical Background

This section discusses in brief executable files (in Sect. 9.2.1), introduces the fundamental concepts of program termination, decidability, and tractability (in Sect. 9.2.2), and then describes code injection (in Sect. 9.2.3) for enabling the understanding of software viruses and, therefore, antivirus software (in Sect. 9.2.4). Finally, it discusses software emulation and virtual machines (in Sect. 9.2.5).

9.2.1

Executable Files

An executable file or executable program is a file that causes a computer to perform indicated tasks according to encoded instructions, as opposed to a data file that must be parsed by a program to be meaningful. These instructions are usually machine code instructions for a physical CPU (central processing unit, i.e., processor). In a broader sense, a file that contains instructions (such as bytecode) for a software interpreter can also be considered as a kind of executable file. The same for a scripting language source file. Software-related concepts and tasks will be described in more detail in Chap. 10. In all cases, the common denominator is that of the algorithm, i.e., a step-by-step set of operations to be performed. From a high-level view, executable files look like normal files. The difference is that they contain a set of instructions that uses the computer resources (through the operating system) to carry out a specific task. On Windows operating system, compiled programs have an “.exe” file extension and are often referred to as “exe files.” On Macintosh computers, compiled programs have an “.app” extension, which is short for application. Linux distributions do not use any extensions for their executable files. Despite their differences in the extensions (and the instructions they contain), one thing that executable files have in common in all operating systems is that they contain binary machine code that is directly executable by the CPU. Below we will describe the typical structure of a Windows executable file. Windows executable files adopt the PE (Portable Execution) file format. Apart from exe files, PE file format can describe other objects as well (i.e., DLL files, drivers, etc.). Typically, a PE file contains the following: • DOS header: contains basic information about the executable. The first two letters of the header (also called the magic number) are MZ.1 • COFF header: contains several machine-related information. • PE header: contains information about the runtime of the process, e.g., the entry point address, the initial size of the heap and stack, etc.

1

From the initials of Mark Zbikowski who created the first linker for DOS.

9.2 Technical Background

87

• Data Directory table: it contains pointers for special segments (i.e., import and export directory, security directory, etc.). • Section table: contains the instructions of the executable file.

9.2.2

Termination, Decidability, Tractability

In computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running or continue to run forever. Alan Turing proved in 1936 that a general algorithm to solve the halting problem for all possible program–input pairs cannot exist. A key part of the proof was a mathematical definition of a computer and program, which became known as a Turing machine. The halting problem is undecidable over Turing machines. For a given program, termination analysis is a special kind of program analysis that attempts to determine whether the evaluation of the program will definitely terminate. Since the halting problem is undecidable, termination analysis is not possible for all cases. In complexity theory, problems that can be solved but lack polynomial-time solutions, meaning that the time required for solving them is exponential with respect to the input size (practically unsolvable if the input is big), are considered to be intractable.

9.2.3

Code Injection

Code injection is the addition of code into an application. The introduced code is capable of compromising privacy properties, security, and even data correctness. Code injection exploits the vulnerability of executable files, which allows them to inject code into the appropriate section and change the course of the execution. Code injection may be used with good intentions. It could, for example, change the execution of a problematic software that crashes, or offer new features and functionalities. Of course, it can also be used (and this usually happens) for malicious purposes, since it can be used for retrieving, modifying, or erasing databases, installing malware or executing malevolent code on web servers, attacking web users, etc. For example, a regular executable file has a structure like the one shown in Fig. 9.1 (left). In a virus-infected executable file, software code has been inserted, as shown in Fig. 9.1 (right). This code can do whatever their creator has decided to, and this usually includes infecting other files, i.e., modifying in a similar way the code of other executables, so as to proliferate itself. Libraries, museums, and archives tend not to preserve malware, since it could destroy the data that these organizations are bound to protect. However, as discussed by Besser and Farbowitz

88

9 The File destroyAll.exe: On Executing Proprietary Software

Fig. 9.1 The rough structure of a normal and of a virus-infected executable

(2016), computer viruses themselves could be the subject of preservation, since they are part of our digital lives. Moreover, and more importantly, their analysis and research could be valuable for preventing potential future digital disasters. To this direction, the Malware Museum (https://archive.org/details/malwaremuseum) was launched in 2016, a digital and web-accessed museum that contains examples of viruses, mainly of the early PCs.

9.2.4

Antivirus Software

Antivirus software, also known as anti-malware software, is software used to detect and remove malicious software from a computer system, but also to prevent such malicious software being installed. Antivirus software was originally developed to detect and remove computer viruses; however, with the proliferation of other kinds of malware, antivirus software started to provide protection from other computer threats. There are various malware identification methods: • Signature-based: detects malware by comparing the contents of a file to its database of known malware signatures • Heuristic-based: detects malware based on characteristics typically used in known malware code • Behavioral-based: detects malware based on the behavioral fingerprint of the malware at runtime; however, this technique can detect (known or unknown) malware only after they have started performing their malicious actions

9.2 Technical Background

89

• Sandbox: Similar to the behavioral-based detection technique, but instead of detecting the behavioral fingerprint at runtime, it executes the programs in a virtual environment • Data mining-based: detects malware by applying data mining and machine learning algorithms to classify the behavior of a file (as either malicious or benign) using a series of file features that are extracted from the file Many other alternative methods exist; however, we should be aware that, as Frederick B. Cohen showed in 1987, there is no algorithm that can perfectly detect all possible viruses.

9.2.5

Software Emulation and Virtual Machines

Emulation is generally described as the imitation of a certain computer platform or program on another platform or program. It requires the creation of emulators, where an emulator is hardware or software, or both, that duplicates (or emulates) the functions of a first computer system (the guest) in a separate second computer system (the host), so that the emulated behavior closely resembles the behavior of the real system. Popular examples of emulators include QEMU, Dioscuri, and bwFLA. Another related concept is that of the Universal Virtual Computer (UVC). It is a special form of emulation where a hardware- and software-independent platform is implemented, where files are migrated to UVC internal representation format and where the whole platform can be easily emulated on newer computer systems. It is like an intermediate language for supporting emulation. In computing, in general, a virtual machine (VM) is an emulation of a particular computer system (real or hypothetical). Its implementation may involve specialized hardware, software, or both. Virtual machines can implement the functionality of the targeted machine fully or partially. Full virtualization VMs enable the execution of a complete operating system (usually based on the virtualization of the underlying raw hardware). Other VMs can execute only a particular computer program by providing an abstracted and platform-independent program execution environment. Another related concept is that of software container. A software container offers a fully functional environment for building, deploying, and running software. Typically, software containers rely on the kernel’s functionality and use resource isolation (CPU, memory, I/O, network, etc.), unlike virtual machines that rely on an operating system to function properly. In practical terms, software containers allow applications to run in an isolated environment, and guarantee the runnability of the software, since the parameters and the configuration for executing it are preserved and any interference from the operating system is eliminated. Docker is an indicative platform that offers this functionality. We elaborate on Software Emulation in more detail in Sect. 18.2.3.

90

9.3

9 The File destroyAll.exe: On Executing Proprietary Software

Pattern: Safety and Dependencies of Executables

Pattern ID Problem’s name The problem

Type of digital artifacts Task on digital artifacts What could have been done to avoid this problem

Lesson learnt

Related patterns

P6 Executables: safety, dependencies Robert encountered an executable file that raised the following questions: Is it safe to run? Will it run? How will it behave? Is it harmless? If it does not run, how can I make it run? Software Run, discover software dependencies, verify execution, safety It would be less laborious and risky for Robert if the file contained trusted metadata about what the program does. It would also be useful if every operating system, before running an unknown software, could run it provisionally on a virtual machine and automatically perform a plethora of tests for assessing its safety. This includes checking if other members of the community (of the user) have performed such tests in the past and what the results were. For instance, in the case of Android applications, one uses sites like Google Play, which is essentially a browsable and searchable catalog of applications, that “promise” that each hosted application is trustworthy. Apart from these tests, the user that browses the catalog can see the ratings and the comments of other users and this helps the user decide whether to install it or not. However, this approach is not a panacea, in the sense that it presupposes that we fully trust Google, which maintains this catalog of Android applications The safety/security of running software is not easy to check. Science is not enough in the sense that we know that some fundamental tasks in the general (unrestricted) case are either undecidable or intractable. For instance, detectability of all possible viruses is not possible in theory. Termination is also undecidable in the general case Restrictions (as regards the “freedom” of what software can, or is allowed, to do), testing (e.g., on virtual machines) and trust networks are important for enhancing safety, also for sharing the computational cost that is required for performing such tests iPrevious: • P1 (Storage Media: durability and access) iNext: • P7 (Software decompiling) • P8 (External behavioral dependencies) • P9 (Web application execution) • P10 (Understand and run software written in obsolete programming language) • P12 (Proprietary format recognition)

9.5 Links and References

9.4 1. 2. 3. 4. 5.

6. 7.

8.

91

Questions and Exercises

With an antivirus tool, check whether the file destroyAll.exe (found in the USB stick) is safe. You can find the file on the website of the book. Find a method for running the executable file on your computer in a safe way. Is it feasible to write a program that can check whether another program satisfies a given property, i.e., whether it terminates? (hint: Alan Turing) In case you are interested in computer viruses and theory, search and read the paper by Chess and White (2000). Find a tool that can mine and analyze the dependencies of Windows executable files (exe). Do the same, but this time, for executable files of different operating systems. Suppose that you want to run destroyAll.exe on Mac OS. Find an appropriate emulator and try to run it. Find a virtual machine running Windows OS (either XP or 7). You can also set up your own virtual machine running the aforementioned operating systems (using tools like VirtualBox2). Then try to run the different versions (found in the USB stick) of the file destroyAll.exe (hint: simulate Robert’s behavior). Download and install Docker on your PC. Then execute the image marketak/ cinderella-stick-greeting. (It contains an application that requires Linux and Java7+ to function properly.)

9.5 9.5.1

Links and References Readings

About Executable Files • Pietrek, M. (2002). Inside Windows: an in-depth look into the Win32 portable executable file format. MSDN Magazine, 17(2). About Emulation • Granger, S. (2000). Emulation as a digital preservation strategy. • Granger, S. (2001, September). Digital preservation & emulation: From theory to practice. In ICHIM, 2, pp. 289–296. • Von Suchodoletz, D., Rechert, K., Schröder, J., van der Hoeven, J., & Bibliotheek, K. (2010). Seven steps for reliable emulation strategies solved problems and open issues. Proceedings of the 7th International conference on preservation of digital objects (iPRES’2010), p. 373.

2

https://www.virtualbox.org/

92

9 The File destroyAll.exe: On Executing Proprietary Software

• Rechert, K., Valizada, I., von Suchodoletz, D., & Latocha, J. (2012). bwFLA– a functional approach to digital preservation. PIK-Praxis der Informationsverarbeitung und Kommunikation, 35(4), pp. 259–267. About Theory of Detectability of Viruses • Cohen, F. (1987). Computer viruses: theory and experiments. Computers & Security, 6(1), pp. 22–35. • Chess, D. M., & White, S. R. (2000, September). An undetectable computer virus. In Proceedings of Virus Bulletin Conference (Vol. 5). About the Preservation of Software Viruses • Besser, H., & Farbowitz, J. (2018). Why save a computer virus?. The Conversation. URL: https://theconversation.com/why-save-a-computer-virus56967. Accessed May 3, 2018. (Archived by WebCite® at http://www. webcitation.org/6z8aahypA). • The Malware Museum, https://archive.org/details/malwaremuseum About Automated Reasoning for Interoperability • Tzitzikas, Y., Kargakis, Y., & Marketakis, Y. (2015). Assisting digital interoperability and preservation through advanced dependency reasoning. International Journal on Digital Libraries, 15(2–4), pp. 103–127.

9.5.2

Tools and Systems

About Emulators • Bellard, F. (2005, April). QEMU, a fast and portable dynamic translator. In USENIX Annual Technical Conference, FREENIX Track (Vol. 41, p. 46). • Van der Hoeven, J., Lohman, B., & Verdegem, R. (2008). Emulation for digital preservation in practice: The results. International Journal Of Digital Curation, 2(2). • Tarkhani, Z., Brown, G., & Myers, S. (2017). Trustworthy and portable emulation platform for digital preservation. In 14th International conference on digital preservation, Kyoto, Japan. • bwFLA. Legacy Environments at your Fingertips (http://eaas.uni-freiburg.de/). About Universal Virtual Computer (UVC) • Lorie, R. A. (2001, January). Long term preservation of digital information. In Proceedings of the 1st ACM/IEEE-CS joint conference on digital libraries (pp. 346–352). ACM.

9.5 Links and References

93

About Java Virtual Machine • Meyer, J., & Downing, T. (1997). Java virtual machine. O’Reilly & Associates, Inc. About Software Management in Java • JDK 8 Documentation. http://www.oracle.com/technetwork/java/javase/docu mentation/jdk8-doc-downloads-2133158.html • Apache Maven Project. https://maven.apache.org/ About Dependency Management Services for Digital Preservation • GapManager. http://athena.ics.forth.gr:9090/Applications/GapManager/ • Epimenides. http://www.ics.forth.gr/isl/epimenides About Software Virtualization • Docker. (https://www.docker.com/). Utilities for Identifying Software Dependencies • DependencyWalker. (http://www.dependencywalker.com/).

Chapter 10

The File MyMusic.class: On Decompiling Software

10.1

Episode

A sparrow perches on the window sill and pecks on the glass. Robert looks at it surprised, but then it flies away. Robert’s look returns to his computer screen. His gaze falls on a file with name MyMusic.class. “What is the extension .class ”, he wonders. It could contain information about a music class that the mysterious girl is attending. He tries to open it with a text editor but its content is incomprehensible. “It’s better to ask an engineer,” he thinks. Having asked a few questions, he learns that a file with an extension “.class” is normally a program in binary form that is written using the Java programming language. Specifically, the file MyMusic.class should be the binary file produced by the Java compiler after compiling code written in Java. Such a file cannot be executed directly over the operating system. It can be executed in a virtual machine of Java for that operating system. Robert looks on his computer and discovers that this software of Java is already installed. After some tinkering, he discovers how to execute the “.class” file. He opens a console window; then he opens the folder that contains the file “MyMusic. class”, and types the following command: > java myMusic Unfortunately, he encounters yet another error; a window pops up that notifies him that the version of the file is not supported. java.lang.UnsupportedClassVersionError: Bad version number in .class file

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_10

95

96

10

The File MyMusic.class: On Decompiling Software

He makes a related search and he sees that this error was due to the version of Java that is installed on his computer. “How can I find out which is the correct version that I should use for running my file?” he exclaims. Robert decides to find and install the latest available version of Java hoping that this version will support files that were compiled using older versions. “If this fails then I will try to find the specific version that is needed for this particular file. In the worst case, I could install and try out all previous versions of Java.” After having installed the latest version of Java, he tries to re-execute the binary file. This time a new error message appears: java.lang.NoClassDefFoundError: org/jfugue/Player The error this time was informing him that a required library was missing. It immediately strikes him how much easier the life of each computer user would be if the dependencies of all digital objects were formally recorded and appropriately managed. By searching the Internet he finds the missing library. “What if this library requires other libraries that I do not have?” he wonders. Without delay, he consults Kevin, a person working in the company who is an avid Java programmer. Kevin advices Robert to download the missing library from an appropriate repository (such as Maven) since such repositories also keep dependencies that the libraries have and allow you to download all other required dependencies along with the requested library. After Kevin suggests how to properly start the Java virtual machine and load the additional libraries, Robert manages to run the application. His office is instantly filled with music. However, the music is a bit strange. It does not sound like a song or a musical piece. It sounds more like an audio experiment. In any case, it is not helpful for Robert’s purposes. Then he has another idea. He decides to “decompile” the .class file to get the source code in Java that the mysterious girl had written hoping that in that source code some comments would reveal the identity of the programmer. He searches for a Java decompiler and finds plenty of them on the Internet. He downloads the first one he can find; it is a lightweight application that offers a graphical user interface for decompiling Java classes. He hopes for two things: (a) that he will be able to decompile the file; he is aware that many people do all sorts of things with their software to prevent people accessing their source code (i.e., encryption of classes, obfuscation, etc.), and (b) that the names of the variables and methods or the comments of the source code (if there are any) will reveal some hints about the mysterious girl. He opens the decompiler and loads the “.class” file. He decompiles it and gets something similar to the code shown in Fig. 10.4. He goes through the contents of the entire file. Unfortunately, the resulting Java code does not have any comments. He thinks that even if there were comments originally, they

10.2

Technical Background

97

might have been removed during the compilation, and he is right. Furthermore, the names of the variables, the classes, and the methods are not helpful at all. Robert feels disappointed and decides to relax by listening to some real music. The monophonic sound of the MyMusic.class reminds him of a monophonic piece that he liked a lot, the prelude from Bach’s Cello Suite No. 1. He selects a version by Mstislav Leopoldovich Rostropovich and sits comfortably in his chair.

10.2

Technical Background

This section introduces basic notions of software engineering, specifically compilers, interpreters, and decompilers (in Sect. 10.2.1); then it discusses in brief the Java programming language (in Sect. 10.2.2); and, finally, it discusses a particular build tool (i.e., Maven) for software (in Sect. 10.2.3).

10.2.1 Constructing and Executing Computer Programs (Compilers, Interpreters, and Decompilers) Figure 10.1 illustrates the concepts that will be introduced, with emphasis on compilers, interpreters, and decompilers. All these are essentially “translators” and their first step is to parse (recall Sect. 6.2.4) a digital file.

10.2.1.1

Compilers

A compiler is a computer program (or set of programs) that transforms source code written in a programming language (the source language) into another computer language (the target language, often having a binary form known as object code). The reason for converting a source code is to create an executable program. In general, the notion of compiler refers to programs that translate source code from a high-level programming language to a lower-level language (e.g., assembly language or machine code). In general, we could say that compilers are essentially translators.

10.2.1.2

Interpreters

An interpreter is a computer program that directly executes, i.e., performs instructions written in a programming language, without previously compiling them into a machine language program. In general, an interpreter can either parse the source code and execute it directly, or translate the source code into some efficient intermediate representation and then execute it.

98

10

The File MyMusic.class: On Decompiling Software

Fig. 10.1 Source code, executable code, intermediate code, and related tasks

10.2.1.3

Decompilers

A decompiler is a computer program that performs the reverse operation to that of a compiler. It translates program code at a relatively low level of abstraction (usually designed to be computer-readable rather than human-readable) into a form having a higher level of abstraction (usually designed to be human-readable). Decompilers usually do not accurately reconstruct the original source code and can vary widely in the intelligibility of their outputs.

10.2.2 Java (Programming Language) Java is a general-purpose object-oriented, class-based, and concurrent computer programming language designed to have as few implementation dependencies as possible. It is intended to let application developers “write once, run anywhere” (WORA), meaning that compiled Java code can run on all platforms that support Java without the need for recompilation. Java applications (comprising files with a “.java” extension) are typically compiled to bytecode (files with “.class” extension) that can run on any Java virtual machine (JVM) regardless of computer architecture. Nowadays, Java is one of the most popular programming languages in use, particularly for client–server web applications. An example of a simple Java program is shown in Fig. 10.2. The bytecode produced by compiling this file, if opened by a text editor, is shown in Fig. 10.3.

10.2

Technical Background

99

Fig. 10.2 Example of a Java program

By decompiling the bytecode, we can produce the Java code shown in Fig. 10.4. Apart from “.java” and “.class” files, another related and widely used file format is that of JAR, which is associated with the file extension “.jar”. It is an archive file format typically used to aggregate more than one Java class files and associated metadata and resources (text, images, etc.) into one file. This makes the distribution of application software or libraries on the Java platform easier. JAR files are built on the ZIP file format.

10.2.3 Maven (Software Build Automation Tools) Maven is a build automation tool used primarily for Java projects. Maven addresses two aspects of building software: firstly, it describes how software is built, and, secondly, it describes its dependencies. Dependencies refer to resources (i.e., classes,

100

10

The File MyMusic.class: On Decompiling Software

Êþº¾ 4 # ()V Code LineNumberTable main ([Ljava/lang/String;)V SourceFile MyMusic.java Enjoy ! Musician " MyMusic java/lang/Object java/lang/System out Ljava/io/PrintStream; java/io/PrintStream println (Ljava/lang/String;)V playMusic *· ± 9 ² ¶ » Y· L+¶ ± ! "

Fig. 10.3 Bytecode as shown using a text editor

Fig. 10.4 The result of decompiling Java bytecode

properties, etc.) that are required for building and properly running a software component. In the example of Fig. 10.2, the software program uses the Java class Player and some of its properties, which have been implemented as part of another software program (i.e., the JFugue API). So, in order to build and run it properly, we should have these dependencies in place as well. Contrary to older tools, like Apache Ant, Maven uses conventions for the build procedure, and only exceptions need to be written down. An XML file, called POM (Project Object Model) describes the software project being built, its dependencies on other external modules and components, the build order, directories, and required plug-ins. It comes with predefined targets for performing certain tasks such as compilation of code and

10.2

Technical Background

101

Fig. 10.5 An indicative POM for a Maven project using the JFugue library

Source Code

compile

(of sof tware)

compile decompile

Executable Code

Intermediate Code interpret

execute

Virtual Machine execute

Operating System Hardware Fig. 10.6 Source Code, executable code, intermediate code, and related tasks and dependencies

its packaging. Maven dynamically downloads Java libraries and Maven plug-ins from one or more repositories, such as the Maven 2 Central Repository, and stores them in a local cache. Figure 10.5 shows an indicative POM for a project using the JFugue library. The library is declared as a dependency in the project. Figure 10.6 is an enriched version of Fig. 10.1 that highlights which are the required dependencies for carrying out the corresponding tasks.

102

10.3

10

Pattern: Software Decompiling

Number Problem’s name The problem

Type of digital artifacts Task on digital artifacts What could have been done to avoid this problem

Lesson learnt

Related patterns

10.4

The File MyMusic.class: On Decompiling Software

P7 Software Decompiling Robert wanted to decompile the “.class” file (Java bytecode) to get the source code. Robert wanted the source code hoping that it would contain comments that are useful for identifying its author, i.e., the mysterious girl Software Decompile The problem could have been simpler if an external catalog/service of PLs (Programming Languages) existed, allowing every user to straightforwardly carry out some basic tasks like decompile, edit, compile, and find the dependencies, since there are hundreds of programming languages. Furthermore, if each executable file was linkable to its provenance, then that would allow Robert to find the original source code, and, thus, see the included comments In general, the ability to get the source code allows someone to inspect the code, to understand it, to reuse the code in different contexts, or to change and extend it for improving it (assuming that recompilation is possible) The compilation of source code to bytecode or executable code results in loss of information that exists in the source code. Although that information is not useful for the execution of the software, it is useful for other reasons (as in our story) iPrevious: • P6 (Executables: safety, dependencies) iNext: • P11 (Reproducibility of scientific results) Related: • P4 (Provenance and context)

Questions and Exercises

1. Suppose that you are given a particular “.class” file. How could you find which version of the Java compiler has been used for producing it? Use the web to answer this question. 2. Is it possible for “.class” files produced by old versions of the Java compiler to be loaded and run in newer versions of the Java Virtual Machine? Use the web to answer this question. 3. Exercise: Find, download, and install a Java decompiler and try to decompile the file “MyMusic.class” found in the USB stick. 4. Exercise: Change (as you wish to) the “.java” file that you derived in the previous exercise, then recompile it, and, finally, test that your modified program is executed as expected.

10.5

Links and References

103

Exercise: Suppose that you want to “hide” the source code of a program that you have written. Learn what obfuscation is and find such a tool for Java. Apply it to the code you wrote in the previous exercise. 6. Exercise: For the same source code use an IDE (Integrated Development Environment) for Java (like Eclipse, IntelliJ, or Netbeans) and produce “.jar” files for different versions of Java (e.g., for versions 1.4 and 1.8). 7. Exercise: Find how you can extract the contents of the “.jar” file found in the USB stick. 8. Find whether decompilation is possible for other languages, like C, C++, and Python. 5.

10.5

Links and References

10.5.1 Readings About Java • Lindholm, T., Yellin, F., Bracha, G., & Buckley, A. (2014). The Java virtual machine specification. Pearson Education.

10.5.2 Tools and Systems About Java • • • •

Java: https://www.java.com/en/ Maven: https://maven.apache.org/ Gradle: https://gradle.org Java decompiler: http://jd.benow.ca/

Chapter 11

The File yyy.java: On Compiling and Running Software

11.1

Episode

It’s just before noon and Robert walks quietly around the big oval meeting table in his office, arranging his thoughts from the meeting he just finished. He has opened wide all the windows to let the spring air fill the room. Standing next to the window, he takes two deep breaths and returns to his desk to continue searching for the identity of the mysterious woman. Robert encounters a file named yyy.java and thinks that this might be the source code of the previous class file (the one described in Chap. 10, MyMusic. class). Although the names of the files are different, it does not mean that these two files are not related. In fact, the file yyy.java could contain the source code of a Java class named MyMusic, whose compilation would indeed yield a file with the name MyMusic.class. He decides to open the file with a text editor, hoping to find information about the author of the code, but unfortunately there is no such information. Since the file does not contain any comments about the author of the code, or its context, Robert decides to compile it. He does not know which compiler to use since there is no information about the appropriate compiler (e.g., the version) that is capable of compiling a source file. He searches in his laptop and notices that a compiler for Java is already installed. Therefore, he tries to compile the source file. Unfortunately, he encounters several errors; most of them are related to some missing elements. He does not understand exactly what the problem is. Is the source code full of syntactic errors that prevent a successful compilation, or is his Java compiler too old for that particular code? “Let’s get the more recent compiler,” he thinks. He searches the Internet and quite easily finds and downloads the latest compiler for Java. “Let’s try again,” he thinks and issues the following at the command line: > javac yyy.java

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_11

105

106 Fig. 11.1 Information about an IP address

11

The File yyy.java: On Compiling and Running Software

IP Discovery Tool 128.180.123.1 Host: email.com IP: 128.180.123.1 Hostname: 128.180.123.1-vm341 OrganizaƟon: Email Services LLC Region: California Country: United States Longitude: 122.28017 LaƟtude: 37.535788

This time the compilation is completed successfully without any error messages. He inspects the folder and finds three new files with the extension “.class”, namely, “A.class”, “B.class”, and “C.class”. Since there is no “MyMusic.class” file, the source code does not correspond to the “.class” file that he encountered previously. Then Robert tries to execute the produced class files using the virtual machine for Java that is installed in his computer. He issues “java A” but he gets the following message: class A does not have a main method The same message appears when executing “java B”. He tries to execute “java C” and, yes, this time something happens. A frame opens; however, a dialog message indicates that the program requires some credentials for connecting to a remote server whose IP (Internet Protocol) address is shown. Robert attempts to find the password. He tries to use some quite common passwords but none of them is successful. “It is rather impossible to find the password in this way. Let’s try to find information about the displayed IP address,” Robert thinks. He searches the Internet to find a service that provides information about the physical location of a device based on its IP address. There are a number of online services that can track and give information about a specific IP. The results are shown in Fig. 11.1. The first thing he notices is the country and the geographic coordinates, which indicate that the given IP address refers to a location in the USA. This could be a hint. Soon enough, he realizes the name of the host that was assigned that IP address. It was email.com, a popular email service that offers free accounts for its users. Anyone could register for an email account and use it, and it was quite popular because it was offering an unlimited space for storing emails. He searches in another online service for the IP and the results were similar. He understands that the source code he is looking for was somehow connecting to the service offered by email.com; however, without the credentials he is not able to do anything. The program was written by Daphne. It would fetch her emails from the server of email.com and store them locally. She had developed it for archiving purposes: she wanted to have a local copy of her email organized in a folder structure that was

11.2

Technical Background

107

convenient for her. The program connected to a remote proxy server that had a permanent IP address. The program did not contain the password. Instead, the program was requesting it at running time. The code comprised three classes: A, B, and C. The first two contained code that was irrelevant to the application. Daphne had used the latest (at that time) version of Java, and this program was actually an opportunity for her to learn the new features of that version. This is why class A and class B contained code for testing some features of the language. Only the last class, i.e., class C, contained the code that was doing the intended job, and contained a main function. Since the program was not finished, she had temporarily named it yyy.java.

11.2

Technical Background

This section discusses in brief software runtime dependencies (Sect. 11.2.1), software documentation (Sect. 11.2.2), and, finally, IP addresses and DNS (Sect. 11.2.3).

11.2.1 Software Runtime Dependencies Software runtime dependencies are a specific case of dependencies that cannot be easily detected without executing a program or application. These dependencies are essential for the execution of a program; however, they can only be resolved during runtime. This means that when the source code is being compiled, runtime dependencies will not prevent it from being compiled. To make this situation clear, assume that we have a component that exposes the contents of a database (e.g., a relational database). We call it connector component, and the source code is written in Java and requires the inclusion of the corresponding libraries (i.e., the JAR libraries that are responsible for connecting to the relational database, for parsing the results, etc.). Imagine now that we want to build an application that uses the connector component. Clearly, it is only required to declare this component as a dependency (since we are using its API); however, if we try to run our application, it will fail; our application asks for the contents using the connector methods and syntax, and the connector component itself uses its underlying dependencies to fetch the contents, which are missing. Software project management systems (i.e., Maven, Gradle, Apache Ivy, etc.) can be exploited to deal with these problems. Furthermore, there are tools that analyze the dependencies of executable files. Figure 11.2 shows an indicative screenshot of the dependencies of the file destroyAll.exe (from Chap. 9). The list contains the DLL (dynamic-link library) files that are needed to execute the file. These files contain instructions that other programs can call upon to do certain things. This way, multiple programs can share the abilities programmed into a single file, and even do so simultaneously.

108

11

The File yyy.java: On Compiling and Running Software

Fig. 11.2 The dependencies of a Windows OS executable file

11.2.2 Software Documentation Software documentation usually comes with written reports that accompany a particular software. These reports usually explain the requirements for executing a software application (i.e., system requirements), the necessary resources for executing it, and explain what the objective of the software is and, usually, what the processes for accomplishing it are. There are different types of software documentation (i.e., written reports, javadocs, etc.). Returning to our example, the first lines of the file yyy.java could be as shown in Fig. 11.3. OpenAPI specification offers a standard way for documenting web services. It defines a standard, programming language agnostic interface for describing REST APIs. It allows discovering and understanding the features of a web service without requiring accessing the source code or deploying and running the service. The OpenAPI documentation is represented in either YAML or JSON format, therefore enabling documentation exchange across many developers.

11.2.3 IP Addresses and DNS An IP address (Internet Protocol address) is a numerical label that is assigned to devices participating in a computer network that uses the Internet Protocol for communication between its nodes. Essentially, an IP address is a 32-bit number

11.3

Pattern: External Behavioral (Runtime) Dependencies

109

/** * @author F * @version 2.0

*/ Fig. 11.3 The first lines of codes that contain Javadoc comments

(known as Internet Protocol Version 4 or IPv4). Due to the growth of the Internet, a new addressing system (IPv6), using 128 bits for the address, was developed in 1995. Although IP addresses are stored as binary numbers, they are usually displayed in human-readable notations, e.g., 208.77.188.166 (for IPv4) and 2001:db8:0:1234:0:567:1:1 (for IPv6). The Domain Name System (DNS) is a hierarchical naming system for computers, services, or any resources connected to the Internet (or a private network). Essentially, it translates domain names (which are easier for humans to remember than IP addresses) into the numerical (binary) identifiers (IP addresses) associated with networking equipment for the purpose of locating and addressing these devices worldwide. For example, www.example.com translates to 208.77.188.166. DNS assigns domain names to groups of Internet users irrespective of each user’s physical location. Consequently, web hyperlinks can remain consistent and constant even if the current Internet routing arrangements change or the participant uses a mobile device. However, the web is a dynamic environment that constantly changes: new contents are uploaded every day and previously existing information is updated, or vanishes. The corresponding issue of preservation is discussed in Sect. 14.2.2 (on web archiving and citation).

11.3

Pattern: External Behavioral (Runtime) Dependencies

Pattern ID Problem’s name The problem

Type of digital artifacts Task on digital artifacts What could have been done to avoid this problem

P8 External behavioral (runtime) dependencies The source code did not mention what version of the compiler should be used. Although Robert eventually managed to compile the source code, the derived byte code did not function properly because the program was attempting to connect to a remote source whose credentials were unknown to him Source code of software Compile, find, and retrieve dependencies From a software engineering point of view, Robert’s task would be easier if software was always accompanied by: (a) documentation that explains its compile dependencies, its specification, and its intended behavior (b) the “proxy pattern” (see references) which was used because it (continued)

110

11

Lesson learnt

Related patterns

11.4

The File yyy.java: On Compiling and Running Software

would provide adequate information about the expected functioning of the remote server A source code, apart from compile dependencies, can have runtime (behavioral) dependencies. A piece of software (either executable, or uncompiled source code) without proper documentation is (or eventually will become) useless iPrevious: • P6 (Executables: safety, dependencies) iNext: –

Questions and Exercises

1. Can you find the proper Java compiler (which Java version) for a given Java file? 2. How can you find the proper Java virtual machine, i.e., the proper JRE (Java Runtime Environment) version to use for a given class file? 3. Use http://www.whatsmyip.org/ to find your IP address and to see related information (about your browser, approximate location, etc.). 4. Try to find the IP address of the following URL: http://www.ics.forth.gr 5. Try to find the physical location of the above IP address: 173.194.203.103 6. Find what “dependency injection” is in software engineering.

11.5

Links and References

11.5.1 Readings About Software Dependencies • Dependency hell. https://en.wikipedia.org/wiki/Dependency_hell About IP Addresses and DNS (Domain Name System) • https://en.wikipedia.org/wiki/Domain_Name_System

11.5.2 Tools and Systems About Lookup Services for IP Addresses • http://www.whatsmyip.org/

11.5

Links and References

11.5.3 Other Resources About Proxy Pattern • https://en.wikipedia.org/wiki/Proxy_pattern About Software Documentation • https://en.wikipedia.org/wiki/Javadoc • https://github.com/OAI/OpenAPI-Specification

111

Chapter 12

The File myFriendsBook.war: On Running Web Applications

12.1

Episode

It’s afternoon. Robert has just returned from his short lunch. He is drinking some cold green tea as he usually does every afternoon. He continues browsing the contents of the USB stick. He stops when he finds a file named myFriendsBook.war. “So far I have been searching over various files but this should definitely lead me to her,” he thinks. He knew very well what FriendsBook was, but he did not know what the particular extension of the file is. Therefore, he starts searching the Internet for this file type. Soon enough, he realizes that it is a web application archive. He finds that in order to execute it, he has to download a proper “web container” and then deploy the application on that container (a web container is the component of a web server that interacts with Java servlets and is responsible for managing the life cycle of servlets, mapping a URL to a particular servlet, and ensuring that the URL requester has the correct access rights). He promptly calls his technical administrator, Scott, to help him with the WAR file. They download the latest web server and wait for the application to be deployed. After a few seconds, Scott visits the home page of the web container through a web browser to see if it is running. The server seems to be running but when they try to connect to that particular web application they face an error message informing them that the application is not there! Scott opens the log files of the web server to check if errors had occurred. Indeed, there are some errors indicating that there are some incompatibility issues with the version of the web container they are using. They did not know which version was the most suitable, and the WAR file itself did not contain any particular information about this. Scott had installed the latest version and he was expecting the application to work since such software commonly offer backward compatibility features, so that they can host and correctly run web applications developed in the past.

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_12

113

114

12

The File myFriendsBook.war: On Running Web Applications

They download a previous version of the web container and try to deploy the web application again. After a few seconds a message appears, informing them that the server was running and they open a web browser to visit the web application. They face the same situation: a blank web page informing them that there is no web application with the name they were trying. This confuses them even more, especially Scott. He checks the log messages again but he cannot find any error. “The WAR file has been deployed successfully, which means that we should see its first page here,” he complains. “Perhaps it is a web application that does not have any web interface,” he says. The file “myFriendsBook.war” was actually the bundle of a web application developed by Daphne and her friends as an alternative way to interact with FriendsBook. They didn’t like the original version of FriendsBook because whenever they were connecting they were receiving a lot of useless information such as sponsored pages, friends that liked other posts, and many more. What they wanted was a minimal version of FriendsBook that would allow them to see and find posts from their friends and nothing else. With this motivation they created their own web application that was using their credentials for logging in to FriendsBook and then presenting only the appropriate information by filtering out the unwanted ones. To minimize the overall size of the project, Daphne and her friends decided to store some common files (images, CSS files, scripts in JavaScript) on a particular remote server. These files were being downloaded when the application was deployed. Unfortunately for Robert and Scott, this server was down at that time due to a short power outage. Moreover, it was not easy for Robert and Scott to understand the cause of the problem because these errors were not described on the log files. Robert encourages Scott to try to deploy it one more time. This time a different message appears in the log messages informing them that the required resources have been downloaded. The remote server that contained the common resources is up again. They try to open the web application with the web browser again and this time they see a screen asking them for their credentials. They don’t know how to proceed. They try adding some “common” usernames and passwords like admin, user, etc., with no luck. Robert thinks that they should try their FriendsBook credentials. The application is named myFriendsBook after all. Scott enters his credentials. The application connects to FriendsBook and displays on the web browser a minimal version of the original interface of FriendsBook through which Scott can see all of his friends and their posts without any advertisement or sponsored link. He explains to Robert that it is like a lite edition of FriendsBook. The web application is running perfectly, but this success did not provide them with any new information about the identity of Daphne. Scott then has an idea. “The log files show that some resources have been downloaded. If we find the location in the source code, we will have a clue.” Robert agrees and they start extracting and inspecting the contents of the WAR file. The WAR file comprises several JSP (with extension .jsp) files and class files (with extension .class), along with some files in XML. They traverse through the files and inspect their components when they encounter a file named “config.properties”. They open it and find a URL

12.2

Technical Background

115

with the name “resourceURL”. The URL is the address of an FTP server containing the files required from the web application in order to be deployed. They copy that URL to their web browser and a list of files is revealed. The files themselves are not helpful, since they are images, logos, and stylesheet files. Neither was the URL informative, since it is just an IP address. They try to find more information about the owner of the IP address, and for this reason they use several sites that offer services that return information about IPs around the world. They hope that it would be the IP address of the girl’s university. Unfortunately, the IP address belongs to a web hosting company from Kiribati Island. “We are now in the middle of the Pacific Ocean,” Robert says disappointed.

12.2

Technical Background

This section contains information about WAR files (in Sect. 12.2.1). A WAR file is a compressed file that contains a collection of other files that together constitute a web application. Then (in Sect. 12.2.2) it describes cloud computing, the practice of using a network of remote servers to store, manage, and process data, rather than a local server or a personal computer. Finally (in Sect. 12.2.3), it describes MIT Scratch, a visual programming language that can be used and accessed through an online multimedia authoring tool, and thus a particular kind of web application.

12.2.1 WAR Files A web application contains all the resources that are required to run properly on a web container. For web applications that use Java, these resources are servlets, .class files, .JSP files, static HTML pages, images, XML files, properties files, and many more. The contents of the web application can be packaged into a single file, which is called Web Application Archive (WAR). The clear benefit of WAR files is that they contain all the logic and the resources for the representation of the web application in a single module; however, they cannot be created incrementally, in the sense that even for making minor changes, the whole archive has to be regenerated. WAR files have a specific directory structure, which is shown in Fig. 12.1. More specifically, the files that are stored in the archive are divided into two main categories: (a) files that are accessible only by the server and are not forwarded by any means to the client and (b) files that are sent to clients. The files that fall within category (a) are stored under the folder WEB-INF. This folder usually stores the web.xml file that contains the directives that configure the web application, the server-side classes and the dependent libraries, server-side configuration files, etc. Files that fall within category (b) contain static HTML and JSP pages, images, text files (either structured or unstructured), etc.

116

12

The File myFriendsBook.war: On Running Web Applications

Fig. 12.1 The structure of a WAR file

Fig. 12.2 The deployment of a web application

In order to make the functionality of WAR files runnable, web containers are used. Web containers are responsible for deploying WAR files, revealing their contents through HTTP, and managing the life cycle of their contents. Some indicative and widely used web containers are Apache Tomcat, GlassFish, Jetty, etc. Figure 12.2 illustrates the typical deployment of a web application.

12.2.2 Cloud Computing Cloud computing is defined as a type of computing that relies on sharing computing resources rather than having local servers or personal devices to handle applications.

12.2

Technical Background

117

In cloud computing, the word cloud (also phrased as “the cloud”) is used as a metaphor for “the Internet,” so the phrase cloud computing means “a type of Internet-based computing,” where different services (such as servers, storage, and applications) are delivered to an organization’s computers and devices through the Internet. Cloud computing relies on the sharing of resources to achieve coherence and economies of scale. Cloud resources are also dynamically reallocated to maximize the use of computing power. A key concept and technology for cloud computing is software virtualization, i.e., the separation of a physical computing device into one or more “virtual” devices, each of which can be easily used and managed to perform computing tasks. Another one is the service-oriented architecture (SOA) for dividing problems into services that can be integrated to provide a solution. Cloud provides all of its resources as services, and uses standards and best practices from the domain of SOA to allow global and easy access to cloud services in a standardized way. From the perspective of organizations, instead of buying and maintaining dedicated hardware to support their business needs, organizations can now move to the cloud, i.e., they use a shared cloud infrastructure for their needs and pay the infrastructure according to a “pay as you go” model. In recent years, as the volume of the data increases, the processing of such data also becomes increasingly demanding in terms of resources. For this reason, several platforms like Google Cloud Platform and Amazon Web Services have been launched, offering cloud processing capabilities. These platforms offer access to a set of services with different objectives and functionalities, including media transcoding and streaming, registry and directory services, relational and NoSQL databases, and even remote Windows desktops.

12.2.3 The Case of MIT Scratch In recent years, there are several frameworks for developing web applications. Among others, the purpose of these frameworks is to minimize the source code that would have to be written for a web application. In addition, there are tools that enable the design and implementation of web applications using graphical user interfaces. Scratch is a visual programming language, widely used nowadays even in primary schools in several countries to introduce students to programming. It was designed by Mitchel Resnick (for aiding children ages 8 and above to learn programming) and developed by the MIT Media Lab in 2003. Through its website, users can play and view existing Scratch projects; the projects are “played” on a web browser (using the Flash Player). The notion of community is important in Scratch; any scratch project can be made visible and accessible by all the members of a community. The source code of a project is open and visible (Creative Commons attribution and share-alike license). Moreover, a member can remix a project and create a new one (the provenance is recorded). Members can also “like” projects or leave comments. The website of

118

12

The File myFriendsBook.war: On Running Web Applications

Fig. 12.3 The categories of commands of Scratch

Fig. 12.4 The program of the Ball (designed and implemented using MIT Scratch)

Scratch receives millions of visits per month and it has more than 28 million registered users and over 32 million projects.1 Scratch allows users to use event-driven programming with multiple active objects called sprites. In the program editor, whenever the user selects one sprite, a number of commands can be selected by the supported categories: Motion, Looks, Sound, Pen, Data, Events, Control, Sensing, Operators, and More (as shown in Fig. 12.3). The user selects the desired command from the corresponding category and places the selected command in the desired position in the flow of control of the corresponding event handler using drag-and-drop operations. The current version of Scratch does not treat procedures as first class structures and has limited file I/O options. Figure 12.4 shows a screen in the web editor that shows the code of one sprite, specifically of the “Ball” of the game “Inflate the Balloon” (accessible through https://scratch.mit.edu/projects/64513642/).

1

Statistics were generated on May 2018.

12.3

Pattern: The Execution of a Web Application

119

Let us now discuss the maintainability of Scratch projects. A user can create a Scratch project that is hosted by the website of Scratch and not on the user’s machine. In this case, the programmer only has the credentials for connecting to the system through which they can create, change, publish, or unpublish projects. If the programmer wants to preserve their work, one approach would be to collect all the screenshots of the project editor (they can be numerous). In this way, they could in the future “visually read” their Scratch code. However, if the Scratch platform ceases to operate in the future, then the programmer (or the community of programmers) would have to re-implement the Scratch programming language (or an emulator of that language) to enable them to execute their code again. There is also Scratch 2.0 editor allowing users to work on projects without an Internet connection. In that case, the user can keep the local file that corresponds to their project. An alternative to preserve the executability of a Scratch project regardless of the future of Scratch is the following: one can download the project as a file with extension “.sb2” (version Scratch 2.0) or “.sb” (version Scratch 1.4). There is a converter from SB2 to SB format. Then from SB format, one can convert it to EXE (executable for Windows), APP (Mac OS X), JAR (executable on any machine with JRE).

12.3

Pattern: The Execution of a Web Application

Pattern ID Problem’s name The problem

Type of digital artifacts Task on digital artifacts What could have been done to avoid this problem

P9 The execution of a web application Robert wants to execute a web application, an application that (in comparison to executable files) has more deployment dependencies. Moreover, web applications typically require the availability of a web browser on the client side (this is a runtime dependency) Web application Execute The web application had some external dependencies (during the deployment of the web application where some resources were fetched from a remote FTP server). The programmer could have included all these dependencies in the WAR file to minimize the risk of failing to deploy the web application (in case of no network access, or if the FTP server is not reachable). These dependencies were not clearly documented in the WAR file. As regards the dependency of the WAR file to a particular version of the web container, the programmer could have removed this dependency by replacing the functionality that is bound to the particular version of the container with a more generic functionality that would not cause these problems (continued)

120

12

Lesson learnt

Related patterns

12.4 1.

2. 3.

4. 5. 6. 7. 8.

The File myFriendsBook.war: On Running Web Applications

Web applications are more complex to deploy. To successfully run (as intended), apart from server-side dependencies (i.e., web servers and application containers), they require the availability of web browsers on the client side. Sometimes, specific versions are required. For instance; a web application may require a browser version, higher than a particular one, for supporting a particular feature of HTML, CSS, or JavaScript. In brief, a web application can have “deployment dependencies” (e.g., web servers and application containers, or remotely accessible resources) and “runtime dependencies” (web browsers and other resources). Finally, informative error messages and detailed logging can greatly aid in spotting the cause of problems in software engineering iPrevious: • P6 (Executables: safety, dependencies) iNext: –

Questions and Exercises

Open a .war file, e.g., the one in the book’s USB stick, using a common compression/decompression software (like rar or winzip), and then inspect its contents, i.e., the files that it contains. Install a servlet container (e.g., Apache Tomcat) and deploy the .war file of the USB stick. The .war terminates because it attempts to connect to a remote service that does not exist. Try changing the implementation by adding a proxy for the remote server. The new version should not terminate if the remote service is not functioning or if inaccessible. Try to upload, deploy, and test the .war file (or one of yours) in Google Cloud platform. Find guidelines for error messages. Find guidelines and tips for proper application logging. Find websites or services to trace the physical location of a particular IP address. Search the web to assess whether there is any automatic method for converting Scratch projects to Android applications.

12.5

12.5

Links and References

121

Links and References

12.5.1 Readings About Java Server-Side Technologies • Servlet Specification. http://download.oracle.com/otndocs/jcp/servlet-3.0-froth-JSpec/ About Tutorials for Web Programming • Marty Hall’s website with tutorials about Java and Web technologies. http:// www.coreservlets.com/

12.5.2 Tools and Systems About Web Servers • Yeager, N. J., & McGrath, R. E. (1996). Web server technology: The advanced guide for World Wide Web information providers. Morgan Kaufmann. • Apache Tomcat. http://tomcat.apache.org/ • GlassFish. https://glassfish.java.net/ • Jetty. http://www.eclipse.org/jetty/ About Cloud-Based Hosting of Web Applications • • • •

Google Cloud Platform. https://cloud.google.com/ Amazon Web Services. https://aws.amazon.com/ Microsoft Azure. http://azure.microsoft.com/en-us/ Rackspace Cloud Hosting. http://www.rackspace.com/

About MIT Scratch • https://scratch.mit.edu/

Chapter 13

The File roulette.BAS: On Running Obsolete Software

13.1

Episode

It’s late afternoon. The tall eucalyptus in the backyard of the building casts its shadow through the western window and is wavering over the green carpet of his office. Robert is already exhausted. Apart from his daily activities, for the past days he has been addedly acting as an investigator who is trying to disclose the identity of the “suspect”. So far, he has collected only some weak clues, so he reopens the folder with the contents of the USB stick. A folder named game grabs Robert’s attention. He opens it and sees inside a single file named “roulette.BAS”. Robert thinks that it does not look like an executable game; nevertheless he tries to open it. Unfortunately for him, the operating system informs him that the extension .BAS is unknown and, therefore, there is no software installed in his computer that is appropriate for opening the file. The OS requests that Robert choose the software, but he does not know what the extension .BAS is. After an Internet search he realizes that this file contains code of an old programming language (PL), called BASIC. He finds that BASIC was introduced in 1964. Robert wonders: “What is a young programmer, under 30 years of age, in accordance with the criteria of the competition, doing with an outdated (obsolete) file that could have been written up to 50 years before?” This was one of the first programs that Daphne’s uncle had written in his old Amstrad 464. Her uncle was a computer engineer, and in fact he was the one that influenced Daphne in her early age to study computer science. Robert opens the file using a plain text editor. An excerpt of the contents is shown in Fig. 13.1. The word “Drachmas,” the currency of Greece before entering the Eurozone in 2002, makes Robert believe very strongly that the author of the code was probably a Greek person. After all, he already has some indications about it, mainly from the language that the file poem.html was written in (Chap. 6). Robert is curious. He would like to compile and run this file. But how? It is software written some decades ago. © Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_13

123

124

13

The File roulette.BAS: On Running Obsolete Software

... 10 PRINT “Welcome. You have ” 20 PRINT amnt 30 PRINT “drachmas” 40 PRINT “place your bet (Input number < 0 to stop here)” 50 INPUT bet 60 IF (bet < 0) GOTO 100 … 100 END Fig. 13.1 Excerpt of a program in Pascal

I should probably find a “converter” to transform it to a programming language that is currently in use. But which one? Ideally, to a programming language that we use in MicroConnect. Robert has not written any line of code for years, therefore his computer does not have an installed compiler or interpreter. Robert calls Scott again (the technical administrator) and asks him about the programming languages that the local team uses. Scott says that everyone can use C++ and Java. Therefore, the next challenge is to find a converter that could transform the source code of BASIC to source code of C++ or Java. Then Robert can compile the resulting source code and eventually run the produced executable on his computer. He makes a search to check whether such a converter exists and he finds one. He applies it on roulette.BAS, which produces as output a file roulette.cc containing C++ source code. He now installs a compiler for C++ and compiles the file roulette.cc. A new executable file named roulette.exe is derived. He is now eager to run it. He double-clicks on the file. “Yes it runs!” he exclaims. The interface is character based. It reminds him of the old days of computing. He spends 20 min playing that game. It is a simple roulette game. Although the graphical user interface (GUI) is text based, the programmer had done a good job on giving the impression that there is a ball that spins. “How imaginative and inventive the programmers of the old times were!” he ponders. The previous week Robert had bought a tablet for his 12-year-old son. The tablet ran the TabletOS operating system, which was a lightweight operating system for mobile devices. He wonders if it would be possible to port that application for TabletOS, so his son could be acquainted with the way the “GUIs” of the past decade looked like. “Well, that is not going to be easy” he thinks. He decides to call Scott to his office. Scott tells him that since Robert had already ported the source code to C++, he could then probably use an emulator that will run on TabletOS, emulating the operating system that Robert has installed on his computer. Scott says that he needs one hour to find and test such an emulator, if it exists. After an hour, Robert receives an email from Scott with details about how to download an appropriate emulator for his needs. Robert installs the emulator on the tablet, and then adds the executable that was produced after compiling the source code of the game. After a few minutes Robert exclaims “It works!” Fig. 13.2 shows the series of conversions and emulations that were required for making the old game run on the tablet.

13.2

Technical Background

125

roulette.cc

roulette.BAS

convert

compile

roulette.exe

emulate Fig. 13.2 The series of conversions and emulations that are required

Robert is impressed by the fact that tools and processes exist for making a program that was written decades ago in an obsolete programming language runnable. However, he realizes that without the aid of a very capable technician, like Scott, he would not have achieved it. He realizes that it would be a good idea if the entire process could somehow be automated. Although Robert managed to run an old program and install it on his tablet, he did not make any progress regarding the identification of the creator of the file, the probable owner of the USB stick. As a good and committed inspector, he has to continue inspecting the files of the USB stick.

13.2

Technical Background

This section provides information about two legendary computers Amstrad 464 and Commodore 64 (in Sect. 13.2.1), describes in brief two ancestors of the current programming languages, specifically BASIC (in Sect. 13.2.2) and Pascal (in Sect. 13.2.3), and, finally, it discusses about the aging of programming languages (in Sect. 13.2.4).

13.2.1 Amstrad 464 and Commodore 64 Amstrad 464 (formally Amstrad CPC464, where CPC stands for Color Personal Computer) was a home computer produced by Amstrad between 1984 and 1990 that sold around 3 million units. It had 64 KB of memory (RAM) and an internal cassette tape deck. A photo is provided in Fig. 13.3 (left).

126

13

The File roulette.BAS: On Running Obsolete Software

Fig. 13.3 Amstrad 464 (left) and Commodore 64 (right)

Amstrad 464 was based on a Zilog Z80A processor, clocked at 4 MHz. It had its own operating system and a BASIC interpreter built in as ROM. It was used mainly for games but a lot of programmers learned to program using this machine. Amstradrelated magazines appeared during the 1980s, including publications in countries such as Britain, France, Spain, Germany, Denmark, Australia, and Greece. Currently, there are various emulators for this machine. Commodore 64, also known as C64, was an 8-bit home computer introduced in 1982 by Commodore International. Commodore 64 was based on an MOS 6510 microprocessor with clock speed ranging from 0.985 MHz to 1.023 MHz. It used 64 KB of RAM, of which 38 KB were available for BASIC (see next section) programs. It used several peripherals, including floppy drives, dot matrix printer, mouse, video monitor, etc. It is known as the best-selling computer model of all time. In fact, it has been listed in the Guinness World Records, with estimates about its sales ranging between 10 and 17 million pieces. Its production stopped in 1994.

13.2.2 BASIC BASIC (an acronym for Beginner’s All-purpose Symbolic Instruction Code) programming language is a family of high-level programming languages whose original version was designed by John G. Kemeny and Thomas E. Kurtz back in 1964. Versions of BASIC were used in the mid-1970s and 1980s by the microcomputers of that time. This allowed people to develop software on computers they could buy. BASIC influenced new languages like Microsoft’s Visual Basic. The built-in list of BASIC commands fall within the following categories: (a) commands for manipulating data (i.e., LET), (b) flow of control commands (i.e., IF-THEN-ELSE, FOR, WHILE, GOTO, etc.), (c) commands controlling I/O (i.e. LIST, PRINT, INPUT, etc.), (d) commands offering mathematical functions (i.e., ABS, EXP, LOG, etc.), and (e) generic commands. Figure 13.4 shows a program in BASIC for computing the first 50 Fibonacci numbers. The Fibonacci numbers are the numbers in the integer sequence characterized by the fact that every number after the first two is the sum of the two preceding ones. If we assume that F(0) ¼ 0, and F(1) ¼ 1, then the k-th number is defined by

13.2

Technical Background

127

1000 REM Fibonacci Numbers 2010 CLS 2020 REM The array F will hold the Fibonacci numbers 2030 ARRAY F 2040 LET F[0] = 0 2050 LET F[1] = 1 2060 LET N = 1 2070 REM Compute the next Fibonacci number 2080 LET F[N+1] = F[N] + F[N-1] 2090 LET N = N + 1 2100 PRINT F[N];" "; 2110 REM Go to line 2080 until 50 numbers have been printed 2120 IF N < 50 THEN GOTO 2080 Fig. 13.4 Computing the Fibonacci numbers in BASIC

Fig. 13.5 The Fibonacci spiral

F(K) ¼ F(K1) + F(K2), yielding the sequence 0, 1, 2, 3, 5, 8, 13, 21, 34, 55, and so on. A tiling with squares whose side lengths are successive Fibonacci numbers can be used for approximating the golden spiral; specifically, the spiral can be created by drawing circular arcs connecting the opposite corners of squares in the Fibonacci tiling, as is shown Fig. 13.5 (showing squares of sizes 1, 1, 2, 3, 5, 8, 13, and 21).

13.2.3 Pascal Pascal is a procedural programming language published in 1970 by Niklaus Wirth. Its name has been chosen in honor of the French mathematician and philosopher Blaise Pascal. Pascal has its roots in the ALGOL 60 language, but also introduced concepts and mechanisms that enabled programmers to define their own complex structured data types, and also made it easier to build dynamic and recursive data structures such as lists, trees, and graphs. Some of the key features of Pascal are as follows: (a) it supports a strongly typed language; (b) it offers extensive error checking; (c) it offers a plethora of data types like arrays, records, files, and sets; (d) it offers a variety of programming structures; (e) it supports both structured

128

13

The File roulette.BAS: On Running Obsolete Software

Fig. 13.6 Computing the Fibonacci numbers in Pascal

Fig. 13.7 Computing the Fibonacci numbers in Java

programming through functions and procedures and object-oriented programming. There are several compilers and interpreters for Pascal with the most famous ones being Turbo Pascal, Delphi, Free Pascal, and GNU Pascal. Figure 13.6 shows a program in Pascal for computing the first 50 Fibonacci numbers, while Fig. 13.7 shows the corresponding program in the Java programming language.

13.2.4 The Aging of Programming Languages Since the creation of the first computers to date, many Programming Languages (PLs) of various kinds (low level vs. high level, general-purpose vs. domain-specific) have been created. In number, we would say that there are hundreds of PLs; the exact number is difficult to determine because that depends on the counting method (i.e., whether we consider as different languages, the various dialects, and versions of a given PL).

13.3

Pattern: Software Written in an Obsolete Programming Language

129

As regards aging, some PLs, although quite old, are still widely used, e.g., the programming language C, which was developed between 1969 and 1973. However, some of these old PLs are less frequently used. For instance, COBOL (designed in 1959) is not currently used for developing new software, but only for maintaining software that is written in that language and is still operational. In general, the less a programming language is used today, the smaller the corresponding communities of vendors and developers are, and consequently the harder the preservation of software written in that PL becomes. The general issue of software aging has been identified long ago, e.g., by Parnas (1994). There are various periodic reports that attempt to measure the popularity of programming languages, for instance, the five most popular PLs for May 2018 according to TIOBE Index1 are: Java, C, C++, Python, and C#. COBOL appears in the 28th position. We should also mention that software written in older programming languages and platforms can run today thanks to emulators (we have encountered emulators in Sect. 9.2.5 and we shall discuss them also in Sect. 18.2.3). Just indicatively, an online emulator for executing programs written in COBOL can be found at https:// www.tutorialspoint.com/compile_cobol_online.php. Nowadays, software development is usually based on collaborative cloud-based environments (e.g., GitHub, Bitbucket, and others) that, among other services, support versioning and software documentation services (versioning is described in Sect. 17.2.9). The usage of such systems is beneficial also for the objectives of long-term software preservation. Since they are cloud-based (recall Sect. 4.2.4), they promise better bit preservation. Their ability to retrieve older versions, as well as the code documentation services they offer (software documentation was discussed in Sect. 11.2.2) are important for achieving interoperability, as well as for tracing the provenance of software.

13.3

Pattern: Software Written in an Obsolete Programming Language

Pattern ID Problem’s name The problem

P10 Understand and run software written in an obsolete programming language Robert wants to understand and run software written in an obsolete programming language. This sometimes requires transforming the old source code to source code of a different (and more modern) programming language. This task can be complicated (continued)

1

TIOBE Index for May 2018. URL: https://www.tiobe.com/tiobe-index/. Accessed June 8, 2018. (Archived by WebCite® at http://www.webcitation.org/701Rd8Dp2).

130

13

Type of digital artifacts Task on digital artifacts What could have been done to avoid this problem

Lesson learnt

Related patterns

13.4

The File roulette.BAS: On Running Obsolete Software

Software (obsolete) Run, perceive Robert could have used a system that can offer dependency management services and supports emulators and converters [like Epimenides (see Kargakis et al. 2015; Kargakis and Tzitzikas 2014), which is also described in Chap. 18]. If the knowledge base of such a system contained information about all converters, emulators, and the software installed on the computers of Robert’s company, that system would inform Robert as to whether it is possible (or not) to run the file on his computer and what steps he would have to carry out to achieve this objective The task of understanding and running software written in an obsolete programming language can be difficult. Automatic dependency reasoning could be adopted for solving such difficult (for humans) problems iPrevious: • P6 (Executables: safety, dependencies) iNext: –

Questions and Exercises

1. Search (through a web search engine) and find programming languages that are now obsolete. 2. Search and find the ten most popular programming languages today. 3. Search and find a tutorial for the programming language BASIC for Amstrad 464, 1986. 4. Search through a web search engine and find a compiler, interpreter, or emulator of BASIC that can operate on your computer. 5. Search through a web search engine for tools that allow converting C++ source code to Java code and vice versa. 6. Suppose that you have a game written in JavaScript which is runnable through web browsers. Search for tools that allow converting this game to an Android application for running it on smart phones and tablets. 7. Install BASIC programming language in your computer and try to execute the file roulette.BAS. You can find it in the website of the book. 8. Find an appropriate emulator for executing the source code of the file roulette.BAS. 9. Find an appropriate converter for converting the source code of the file roulette.BAS to your favorite programming language and then try to execute it.

13.5

13.5

Links and References

131

Links and References

13.5.1 Readings About Amstrad CPC and BASIC • Weidenauer, B. (1987). AMSTRAD PC1640 Technical Manual. Amstrad. URL: http://www.seasip.info/AmstradXT/1640tech/section1.html. Accessed May 3, 2018. (Archived by WebCite® at http://www.webcitation.org/ 6z8b7NviT) • Kemeny, J. G., Kurtz, T. E., & Cochran, D. S. (1968). Basic: A manual for BASIC, the elementary algebraic language designed for use with the Dartmouth time sharing system. Dartmouth Publications (web link: http://bitsavers. informatik.uni-stuttgart.de/pdf/dartmouth/BASIC_Oct64.pdf) About Epimenides System • Kargakis, Y., Tzitzikas, Y., & van Horik, R. (2015). Epimenides: Interoperability reasoning for digital preservation. 2015-01-20. https://phaidra.univie. ac.at/detail_object/o:37 8066 • Kargakis, Y., & Tzitzikas, Y. (2014). Epimenides: An information system offering automated reasoning for the needs of digital preservation. In Proceedings of the 14th ACM/IEEE-CS joint conference on digital libraries (pp. 411–412). IEEE Press. About Aging of Programming Languages and Software • Parnas, D. L. (1994, May). Software aging. In Software Engineering, 1994. Proceedings. ICSE-16., 16th International Conference on (pp. 279–287). IEEE.

13.5.2 Tools and Systems A list of Amstrad CPC Emulators • http://www.cpcwiki.eu/index.php/Emulators Programming in BASIC • http://justbasic.com/ • http://www.freebasic.net/ • http://www.quitebasic.com/

132

13

The File roulette.BAS: On Running Obsolete Software

Basic PL Converters • BaCon. http://www.basic-converter.org/ • https://code.google.com/p/vb6-to-java/ • JSBasic. http://www.codeproject.com/Articles/25069/JSBasic-A-BASIC-toJavaScript-Compiler • VARYCODE. https://varycode.com/

Chapter 14

The Folder myExperiment: On Verifying and Reproducing Data

14.1

Episode

It’s 11 pm. Robert is at home browsing a leaflet of local cultural events. His gaze stops at the picture of a sculpture depicting a naked girl. Who was the woman who posed? It is unknown, just like the woman that submitted the best solution in the contest of MicroConnect. Although Robert has not yet managed to recover the identity of the unknown woman, the fact that this afternoon he managed to run old software infused him with a feeling of optimism. Although he had long ceased working late in the evening on work issues, tonight he could not resist the temptation. He opens his laptop and continues browsing the contents of the USB stick. He focuses on a folder named myExperiment. That folder contains a subfolder named paper that contains a PDF file. Robert opens it. It is a scientific research paper; however, it is anonymized. It does not contain any information about the authors of the paper, or their affiliation and contact emails. This is a policy followed by several scientific conferences and journals. The scientific papers are submitted anonymized to avoid bias in the review process, and only if a paper gets accepted are the author names added in the published version. Even though the paper was anonymized, Robert finds the title and the summary of the paper appealing; therefore, he decides to read it. Daphne was preparing a paper for submission to a forthcoming research conference. The paper claimed that social networking services can keep records of the browsing history of their users. These services could also keep information about the browsing history from users that are not registered members of these social networks. In that paper, Daphne was experimenting with FriendsBook; specifically, she was analyzing all the web (HTTP GET) requests that had been issued from the network of her university for a period of 6 months. However, in order to ensure the anonymity of the paper, the name of the university, as well as other critical information (i.e., the IP addresses), was anonymized too. The experiments in the paper revealed that a large number of requests concerned HTML pages that included © Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_14

133

134

14 The Folder myExperiment: On Verifying and Reproducing Data

FriendsBook elements in their source code, e.g., a Like or Share button. Daphne exploited the API provided by FriendsBook and proved that the social network could archive the fact that a particular user visited a specific web page. The most interesting fact was that this information could be archived even if the user was not registered or signed in to FriendsBook. Robert wonders whether that claim is true. He visits the official FriendsBook developers’ site in order to read more information about the API that FriendsBook offers. He finds step-by-step instructions for social plug-ins (like, share, and send buttons, comments, embedded posts, etc.). There was a full code example about how to add a like button to your website, which Robert tries to understand. He concludes that the claims of that paper were correct; the observation about image requests was true and it could be easily verified by anyone. However, there are some questions in his mind: Were the reported numbers and percentages correct? Were the data real and valid? Have the measurements been performed correctly? He soon realizes that these questions are applicable to most of the scientific papers and reports he has been through. He thinks, “Publishing only the results of a research work and discussing about them is only half the job. The authors of such publications should guarantee that they also provide the required information (including the algorithms, the software, the datasets) for validating and reproducing the scientific results.” He decides to dive deeply into the paper and experiment himself. He is determined to validate the results of the publication by repeating the experiments. He has already seen the algorithms in the paper and the references to the documentation pages of the public API that were used. He starts exploring those documentation pages, but he is astounded to see that all the web pages he found were practically nonexistent. He compares the URLs of the web pages that were found in the paper with the URLs he found earlier on the web and notices that they are different. He tries to figure out what is happening and understands that the remote API has changed. The paper used older references that do not exist. He soon realizes that he has been in similar situations many times before, especially when he tried to visit an old bookmark of his browser or follow a link from an old email. He is frustrated because he understands that this dynamic feature of the Internet, where web resources can be moved or can disappear could be an obstacle, especially for the reproducibility of scientific results. Robert remembers the words of Jeff Collins, an old professor at his university. When Robert was studying, Mr. Collins was quite old, just a few years before his retirement. He may have had the reputation of a demanding and fussy professor but many students liked him. He remembers that in a conversation they had in a lecture break, Mr. Collins said, “A conclusion is reckoned scientific only if someone else can verify it. If it is a mathematical proof then someone else must be able to check that all the steps of the procedure are correct and that there is no gap. If it is experimental work then all the conditions of the experiment must be described so that someone else can repeat the experiment and verify its results. Anything else is either conjecture or dogmatism. If a conjecture or opinion has not been scientifically studied, it does not necessarily mean it is not justified; we simply do not know

14.2

Technical Background

135

whether it is true or not. And never forget the boundaries of our proofs; do not forget the work of Kurt Gödel.” It’s rather late for such thoughts. Robert is exhausted and his eyelids feel heavy. He closes the laptop and goes to rest.

14.2

Technical Background

The concept of provenance, which is important for the reproducibility of scientific results, was already discussed in Chap. 7. Here (in Sect. 14.2.1) we discuss HTML and remotely fetched images (since they are related to the plot) as well as issues related to web archiving and citation (in Sect. 14.2.2). Then we briefly discuss scientific publishing (in Sect. 14.2.3), the issue of trust in digital repositories (in Sect. 14.2.4), the data–information–knowledge–wisdom hierarchy, the recent trend toward lab notebooks (in Sect. 14.2.5), and, finally, (in Sect. 14.2.6) we discuss Kurt Gödel’s work as it has been mentioned in the episode.

14.2.1 HTML and Remotely Fetched Images HTML was briefly discussed in Sect. 6.2.2. Now look at the contents shown in Fig. 14.1 and suppose they are stored in a file P1.html that is hosted on a website and the URL of the page is http://www.example.org/P1.html. Now suppose that the author of that page would like to add at the bottom of that page a “like” button coming from a hypothetical social networking application, say FriendsBook. To do so, the author has to embed in the HTML code of the page an additional code that is provided by FriendsBook. An indicative sketch of the required additions in the HTML code is shown in Fig. 14.2. Note that even if the above web page is hosted on the website http://www. example.org, whenever any user visits that page, the user’s browser will connect to FriendsBook, in this way allowing FriendsBook to make a record of this.

Fig. 14.1 The HTML code of a web page

136

14 The Folder myExperiment: On Verifying and Reproducing Data

Fig. 14.2 The HTML code enriched with FriendsBook like buttons

14.2.2 Web Archiving and Web Citation The web is a dynamic environment that constantly changes. New content is uploaded every day and existing information is updated (a related discussion about the rate at which data are produced and uploaded on the web can be found in Sect. 1.1). In a similar manner, web contents can disappear at any time due to various reasons (e.g., removed by users, moved to other locations, etc.). This situation creates particular problems as regards the citation and preservation of web contents, since they are usually used as references (i.e., in scientific publications). There are two approaches that try to solve this problem: the construction and maintenance of web archives and the citation of particular versions of a web resource. The former mainly provides a solution for the web resources that have “disappeared,” and the latter elaborates with dynamic web resources. Web archiving is the process of harvesting contents from the World Wide Web (e.g., a website) and storing them in archives for ensuring that they can be discovered even if the actual contents “disappear.” Archives use web crawlers that periodically collect and store contents from the web. As a result, they construct an archive of different snapshots (in time) of the web resource. Apart from the contents

14.2

Technical Background

137

of web accessible information (including text, images, multimedia contents, etc.), they also store metadata information about the collected contents. Specifically, the Internet Archive is an online digital library with the mission of providing universal access to all knowledge. The archive began in 1996, when the Internet was starting to grow, and nowadays it contains archives of web resources from various time periods. By the time these lines are published,1 the library will have included approximately 280 billion web pages and millions of books, texts, and multimedia resources, and the estimated storage required for storing the Internet Archive is more than 30+ petabytes. Since 2010, the Internet Archive has been running Worldwide Web Crawls of the global web. The crawls are initiated using a set of URLs called seed lists. In addition, there are several rules that are applied to the logic of each crawl, defining things like the depth that the crawler will try to reach. As regards the frequency of updating, the archives are being updated in every crawl; however, sites with dynamic content (e.g., news sites) are updated more frequently. Web citation is a similar approach, compared to web archiving. The main difference is that this approach allows archiving a web resource on demand. Upon user request, it constructs a unique reference (i.e., a URL) that points to an archive of the web resource and ensures that it will be available for the long term, even if the actual resource changes or is removed. A special case describing the preservation of weblogs (for short blogs) can be found in Sect. 17.2.10.

14.2.3 Proposals for Changing the Scientific Publishing Method The number of scientific articles that are published yearly increases (Ware and Mabe 2015). Just indicatively, around 2.5 million articles were published in 2014 and this is the average annual rate of new publications (Boon 2016). According to the current method of scientific publishing, scientists and researchers submit their work, in the form of articles, to scientific conferences and journals in their area. The submitted papers are reviewed by the members of the program committee, which comprises experienced and recognized scientists and researchers. In computer science, each submitted paper is usually evaluated by three reviewers. Papers are evaluated with respect to importance, originality, soundness, technical depth, and presentation quality. Each reviewer, apart from providing comments and constructive suggestions, makes an overall suggestion usually in the scale of Strong Reject, Weak Reject, Borderline, Weak Accept, Strong Accept. The authors of the papers receive the reviews of their papers but they cannot see the names of the reviewers. The best

1

End of 2018.

138

14 The Folder myExperiment: On Verifying and Reproducing Data

articles from those submitted are then selected by the chair(s) of the program committee for publication in the conference proceedings or in the journal. In general, paper reviewing is a laborious and time-consuming process. Those serving as reviewers do not have any tangible benefit. They do it mainly because they feel they can contribute to their area. Moreover, since they also submit papers, the load of reviewing has to be shared; so without scholarly peer reviews, the entire process is not sustainable. There are several discussions and proposals for enhancing the reproducibility of the research findings, and for improving the reviewing process as well as the justification for the significance of a research paper. There are several reasons that instigate and motivate this discussion. One of them is that the list of retracted scientific studies grows and this is observed in many disciplines including biology, medicine, physics, and psychology. Specifically, there are published papers based on falsified or fabricated data, on fake experiments, or contain manipulated images and charts. For checking the validity of the research results, research papers that are based on measurements, data, and analysis of data could be accompanied by the used datasets and the algorithms that were applied on this data. This would allow anyone of the community to check whether the dataset was credible and whether the analysis has been correctly performed. Unfortunately, as Faria et al. (2016) show, less than half of the academic papers or web pages are archived. We should not forget that the advent of academic journals in the seventeenth century was the result of the societal demand for access to scientific knowledge. The path toward open science continues today. In this context, the term Open Access refers to making the research outputs accessible to all, without charges or other kinds of restrictions (for the impact of Open Access, see Tsakonas and Papatheodorou 2008; Antelman 2014). Moreover, the reviewing phase in scientific conferences and journals could be improved; currently, reviewers do not have any substantial incentive to review well, and their performance is not tracked or rewarded. As regards the significance/ importance of research papers, one approach is to develop a kind of marketplace where the significance of a paper rises and falls based on its reception by the community. Scientific data archiving refers to the long-term storage of scientific data and methods. Such journals have requirements about the data and the methods that are used to collect them. These data should be in a public archive and there are already several scientific data archives for different scientific fields, e.g., NCAR Research Data Archive archives data for atmospheric and geosciences research, Dryad for medical publications, SO/ST-ECF Science Archive Facility for astronomical data. Finally, we should mention that the issue of validity and reproducibility does not concern solely the scientific community; it concerns all public organizations, politicians, even the legal sector (e.g., see Spencer 2015).

14.2

Technical Background

139

14.2.4 Trustworthy Digital Repositories The mission of trustworthy digital repositories is to provide reliable and long-term access for their digital resources to their designated communities. Of course, this means that such a repository should consider the threats and risks that exist, and this requires constant monitoring, planning, and maintenance. It becomes evident that trustworthiness is not a one-time accomplishment. It should be retained by undertaking a regular audit and certification cycle. Audit and certification refer to the formal process that is usually carried out and delivered by external service providers. It is a time-consuming process that aims at explaining to the wide audience that a product (i.e., a digital repository) complies with one or more particular standards. Audit and certification provides: (a) reassurance that anyone, besides the repository managers, can tell that the repository is digitally safe; (b) confidence that, besides bits, the semantics of the digital resources are also preserved, so that they will remain usable in the future; and (c) recommendations about areas that need improvements. Trustworthy digital repositories are governed by a family of three ISO standards (which have been developed from the same international group): • ISO 14721:2012 (open archival information system (OAIS)—Reference model), which defines a framework for understanding the increased awareness of archival concepts needed for long-term digital information preservation. • ISO 16363:2012 (audit and certification of trustworthy digital repositories), which defines repository as an organization that is responsible for digital preservation, and not just as a technical element. It comprises 109 metrics divided into three areas: organizational infrastructure, digital object management, and infrastructure and security risk management. • ISO 16919:2014 (requirements for bodies providing audit and certification of candidate trustworthy digital repositories), which is meant primarily for those setting up and managing the organization performing the auditing and certification of digital repositories. A related certification organization is CoreTrustSeal2 that offers a core level certification. The preparations for certifying cultural-heritage-related repositories in the Netherlands are described by Sierman and Waterman (2017).

2

https://www.coretrustseal.org/

140

14 The Folder myExperiment: On Verifying and Reproducing Data

Fig. 14.3 Data, information, knowledge, insight, wisdom

14.2.5 The Data–Information–Knowledge–Wisdom Hierarchy Above we stressed that data are crucial for the credibility of scientific findings. Apart from the term “data,” various other related terms are encountered frequently, like “information” and “knowledge.” Various models have been introduced for distinguishing these concepts. The distinction, however, is not always clear as it is domain-, perspective-, and granularity-specific. Essentially, these categorizations (e.g., see Rowley 2007, for a review) aim at providing some kind of hierarchical organization. For instance, Fig. 14.3 provides an illustration of “Data,” “Information,” “Knowledge,” “Insight,” and “Wisdom.” Although this figure can be interpreted in various ways (as mentioned earlier), one possible interpretation, just for conveying the main idea, is as follows: We start from “Data,” which are raw elements with no meaning or any connection. By adding a kind of meaning or interpretability (color in this illustration), we get “Information.” By connecting information together, we get “Knowledge” (i.e., connected and contextualized information). By analyzing the “knowledge network” (e.g., probabilistically), we get “Insight” (e.g., we distinguish the more important information). Finally, the connectivity of the more important information can be considered as a kind of “Wisdom.” These distinctions sometimes correspond to data processing levels. We can use an example from ESA (European Space Agency). The GOME (Global Ozone Monitoring Experiment) dataset consists of data captured from sensors onboard the ESA ERS-2 (European Remote Sensing) satellite. The captured measurements are sent to a ground Earth acquisition station (at the Kiruna Station), transferred to an Archiving Facility (at ESA-ESRIN) for long-term preservation and to a Processing Facility (at DLR—German Aerospace Center) for various data transformations that yield various kinds of products. Datasets are distinguished according to their processing level as Level 0 (raw data), Level 1 (radiances/reflectances), Level 2 (geophysical data as trace gas amounts), and Level 3 (a mosaic composed of several Level 2 data with interpolation of data values to fill the satellite gaps), as shown in Fig. 14.4.

14.2

Technical Background Level 0 Raw Data

00 BC 66 12 00 84 AA

45 78 43 A1 06 DF 42

3A 2E 32 C2 39 FF 72

02 B1 07 D6 8C 00 36

Level 1 Calibrated Radiances

141 Level 2 Atmospheric Trace Gas

Level 3 Global Maps

41 F3 09 BB 75 01 C9

connectedness, understanding, value, applicability Fig. 14.4 An example of data (processing) levels of ESA

It is worth mentioning at this point the concept of “laboratory notebooks.” A laboratory notebook is a preliminary record of research used by scientists, engineers, and technicians to document research, experiments, and procedures performed in a laboratory. It serves as an organizational tool, a memory aid, and can be used for protecting intellectual property. An electronic lab notebook, for short ELN, is a computer program designed to replace paper laboratory notebooks. ELNs enable interactive computing and are used for data analysis in research, education, journalism, and industry. Platforms for ELNs are essentially cloud-based solutions that enable groups of people to collaborate and share data and code. Emphasis is given on combining text with live-code for better supporting interactive data science and scientific computing, and thus reproducibility. One such system is the Project Jupyter (recipient of the ACM Software System Award for 2017) and at the time of writing, there are more than two million Jupyter notebooks on GitHub.

14.2.6 Gödel’s Incompleteness Theorems Kurt Gödel (1906–1978) is considered, along with Aristotle, Alfred Tarski, and Gottlob Frege, one of the most significant logicians in history. His incompleteness theorems (published in 1931) were the first of several closely related theorems on the limitations of formal systems. Roughly, he demonstrated that there are formulas that cannot be proven true or false.

142

14 The Folder myExperiment: On Verifying and Reproducing Data

Gödel’s incompleteness theorems were related to Tarski’s undefinability theorem on the formal undefinability of truth and Turing’s theorem (that there is no general algorithm to solve the halting problem, as was mentioned in Sect. 9.2.2).

14.3

Pattern: Reproducibility of Scientific Results

Pattern ID Problem’s name The problem

Type of digital artifacts Task on digital artifacts What could have been done to avoid this problem Lesson learnt

Related patterns

14.4

P11 Reproducibility of scientific results Robert would like to reproduce a scientific result; however, this is not possible because the data and the processes that were applied on the data are not available Scientific article Reproduce scientific results/experiments If the paper that Robert opened was published and all published papers were accompanied by the related datasets and the used algorithms, and the URLs had been preserved, then Robert would have been able to reproduce the results The performability of the task “reproducibility of scientific results” requires data, algorithms, and provenance. In general, reproducibility is of utmost importance in science, which in turn presupposes provenance and transparency. But this is not always easy because there are cases when this is incompatible with the anonymity that is necessary to protect privacy and ensure impartiality. Furthermore, web links are sensitive in terms of changes and removals. Platforms for electronic lab notebooks aim at providing a technical solution for tackling these problems in a more systematic and less laborious manner iPrevious: • P3 (Text and symbol encoding) • P4 (Provenance and context of digital photographs) • P7 (Software decompiling) iNext: –

Questions and Exercises

1. Use your web browser and see the HTTP requests that are issued by your computer. Hint: Via Chrome, hit Ctrl+Shift+I and go to the corresponding tab. 2. Use a web application (like http://supportdetails.com/) to see what information about you is maintained in the various websites that you visit. 3. Search the Internet for retracted scientific studies that were based on fabricated data or manipulated images.

14.5

Links and References

143

4. Search the Internet for patents based on fraudulent papers. 5. Search the Internet for flawed data on global warming. 6. German Chancellor Angela Merkel, while speaking at a rally in the western German town of Meschede in May 2011, mentioned that southern Europeans are not working enough, while Germans are expected to bail them out. Find data for checking whether that statement was true at that time. Do the same for other statements from political persons. 7. The file in the USB stick 2014 Temperatures in Heraklion.txt contains temperature measurements in Heraklion Crete, Greece. Using Excel (or a program), compute the average temperature or the mean temperature. You will get 18.9, which is definitely wrong (it is not that cold in Crete). The erroneous value is based on one wrong data value. Spot it. 8. Search the Internet and check if there are scientific publishers who, apart from offering access services for research papers, also offer storage, access, and curation services for the data that were used in the research papers.

14.5

Links and References

14.5.1 Readings About Academic Publishing • Buneman, P., Khanna, S., Tajima, K., & Tan, W. C. (2004). Archiving scientific data. ACM Transactions on Database Systems (TODS), 29(1), 2–42. • Ware, M., & Mabe, M. (2015). The STM report: an overview of scientific and scholarly journal publishing. • Boon, S. (2016). 21st century science overload. Canadian Science Publishing. • Faria, L., Akbik, A., Sierman, B., Ras, M., Ferreira, M., & Ramalho, J. C. (2013). Automatic preservation watch using information extraction on the web: a case study on semantic extraction of natural language for digital preservation. In iPRES 2013-10th International Conference on Preservation of Digital Objects (pp. 215–224). Biblioteca Nacional de Portugal (BNP).

About Retracted Scientific Results and Related Flaws • Spencer, S. H. (2015). FBI admits flaws in hair analysis over decades. The Washington Times, 18 April 2015. • Ioannidis, J. P. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124.

144

14 The Folder myExperiment: On Verifying and Reproducing Data

About Reproducibility in E-Science • Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature News, 533(7604), 452. • Freire, J., Fuhr, N., & Rauber, A. (2016). Reproducibility of data-oriented experiments in e-Science (Dagstuhl Seminar 16041). In Dagstuhl Reports (Vol. 6, No. 1). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. About Open Access • Tsakonas, G., & Papatheodorou, C. (2008). Exploring usefulness and usability in the evaluation of open access digital libraries. Information Processing & Management, 44(3), 1234–1250. • Antelman, K. (2004). Do open-access articles have a greater research impact? College & Research Libraries, 65(5), 372–382. About the DIKW Hierarchy • Rowley, J. (2007). The wisdom hierarchy: representations of the DIKW hierarchy. Journal of Information Science, 33(2), 163–180. About Gödel’s Incompleteness Theorem • Gödel, K. (1992). On formally undecidable propositions of Principia Mathematica and related systems. Courier Corporation. About Trustworthy Digital Repositories • Yakel, E., Faniel, I. M., Kriesberg, A., & Yoon, A. (2013). Trust in digital repositories. International Journal of Digital Curation, 8(1), 143–156. • Houghton, B. (2015). Trustworthiness: self-assessment of an institutional repository against ISO 16363-2012. D-Lib Magazine, 21(3/4), 1–5. • Ambacher, B., Ashley, K., Berry, J., Brooks, C., Dale, R. L., Flecker, D., et al. (2014). Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC). • Sierman, B., & Waterman, K. (2017). How the Dutch prepared for certification. In 14th International Conference on Digital Preservation, Kyoto, Japan.

14.5.2 Other Resources About Audit and Certification • ISO 16363:2012 – Audit and certification of trustworthy digital repositories (https://www.iso.org/standard/56510.html) • ISO 16919:2014 – Requirements for bodies providing audit and certification of candidate trustworthy digital repositories (https://www.iso.org/standard/ 57950.html) • CoreTrustSeal (https://www.coretrustseal.org/)

14.5

Links and References

145

About Web Archives and Web Citation • International Internet Preservation Consortium (IIPC): http://netpreserve.org/ • Maureen Pennock. (2013). Web-Archiving. DPC Technology Watch Report 13. March 01, 2013. • The Internet Archive (https://archive.org/) • WebCite (http://www.webcitation.org) About Scientific Publications Preservation • The Keepers Registry (https://thekeepers.org)

Chapter 15

The File MyContacts.con: On Reading Unknown Digital Resources

15.1

Episode

May 27 Robert wakes up earlier than usual. His mind is stuck on the USB stick. He decides to open his laptop again. He begins to search for folders and files with a name that refers to identity data. After a few minutes, he finds a file called MyContacts. con. Robert feels relieved, thinking that he has eventually found a file containing the contacts of the mysterious girl. “This file could contain contact details, like emails, and phone numbers, and if I am lucky, I could also find people from her family, if they are stored using names like Mom, Dad,” he thinks. Robert is determined to communicate with every single contact until he finds the mysterious girl. “But first let’s see the contents of that file,” he says. He double-clicks on the file, and a pop-up message shows up, informing him that the operating system could not open the contents of the file, because the file extension seems to be unknown. “MyContacts. con” he whispers, “I haven’t seen this file extension before.” He thinks that this might be one of the thousands of applications that exist in the Internet that store data according to their proprietary format. “OK, I will use a web search engine to find the proper application, if it exists,” he thinks. He starts searching the Internet, using several web search engines, but the only relevant result he finds is a small application for creating animated images (gif) whose default extension for storing the files is . con. “Strange” he mumbles, “who would have saved their contacts as an animated image.” He has the impression that this application was not what he is looking for, and he is right. He downloads and installs the application; however, an error message appears on his screen again.

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_15

147

148

15

The File MyContacts.con: On Reading Unknown Digital Resources

Cannot open file: MyContacts.con. He entertains the idea that it might be an old file, in an obsolete format that is no longer supported by modern operating systems. He remembers reading an article the other day about a big project that was related to digital media obsolescence. The article narrated the efforts carried out for preserving the contents of the Domesday Book, a manuscript compiled in 1086 AD that describes in remarkable detail the landholdings and resources of late eleventh-century England. Several hundreds of years later, in particular in 1986, an effort was made to produce a modern version of the Domesday, for celebrating the 900th anniversary of the original book. Over 1 million people contributed to this digital snapshot of the UK with information that they thought would be of interest for another 1000 years. The result was named “The BBC Domesday Project” and made use of the cutting-edge technology of that period using particular software and hardware solutions that raised the cost up to almost 2.5 million pounds. However, 16 years later, the great advances in computing technology revealed that the contents of the BBC Domesday project have become obsolete. The special hardware that was designed to store and reproduce the multimedia contents was unreadable. At that point Robert thought that it was a great irony that the original Domesday Book that was compiled almost 1000 year ago was still intact and readable, but the digital version was useless only 16 years later. Now he was facing a similar situation where he had a file that was unreadable. “I have to try something else,” he thinks, and starts changing the extension of the file. He tries changing the extension to “.txt”, “.rtf”, “.xml”, “.doc”, “.xls”, “.csv”, “.odt”, and uses various applications including text editors and spreadsheet editors, but none of them open and render the contents of the file correctly. He tries almost every application that is installed in his computer; he even tries to open it using an image/ video editor, with no luck. After these fruitless efforts, he decides to search for and try a format recognition application. Such programs analyze the content of a file and determine its format. He uses JHOVE for analyzing the file; however, the results do not reveal anything new. JHOVE reports that MyContacts.con is a file in binary format (not text) and reported information that is not very useful to him, like the date it was created, the date and time it was updated for the last time, and a few more. “I’ll never find her,” he murmurs. Indeed it was almost impossible for Robert, or anyone else, to make sense of this file. The file had been produced by a small application developed by Daphne when she was attending a course on object-oriented programming and Java. During that period Daphne had started developing a small application for managing her contacts, i.e., for storing them, editing them, grouping them according to various criteria, and searching them. The application was exporting all the contacts in a single file with extension “.con”. This file was then loaded by the application during start-up. The file contained a series of serialized Java objects. To maximize the security of her contacts, the file was stored encrypted using a simple encryption method that she had

15.2

Technical Background

149

devised; specifically, it was about adding a small number to every byte, in her case, 7. The application was able to decrypt it upon loading, if the user entered the key that had been used at encryption. Consequently, it is almost impossible for anyone to make sense of such a file, without the appropriate software and the key.

15.2

Technical Background

Here we provide information about the process of format recognition (in Sect. 15.2.1), then we discuss preservation-friendly file formats (in Sect. 15.2.2), and finally object serialization and storage (in Sect. 15.2.3).

15.2.1 Format Recognition: JHOVE In general, a file format is a standard way for encoding and storing information in a computer file, i.e., it specifies how bits are used to encode information in a digital storage medium. We could distinguish formats as proprietary or free. We could also distinguish them as being published or not. Published file formats have a published specification that provides the required details. A commonly used method to specify (or recognize) the file format of a file is through the extension of its name (recall Sect. 5.2.3). Another way is to include (look for) information about the file format inside the file. One widely adopted approach is to place particular binary strings in specific locations in the files, usually at the beginning, an area that is usually called file header or magic number (more details about this can be found in Sect. 5.2.4). As regards tools for format recognition, JHOVE (JSTOR/Harvard Object Validation Environment) is a format-specific digital object validation API written in Java. It can analyze documents to check whether they are well-formed (i.e., consistent with the basic requirements of the format). The supported formats include AIFF, ASCII, Bytestream, GIF, HTML, JPEG, JPEG 2000, PDF, TIFF, UTF-8, WAV, and XML. JHOVE is available for downloading, licensed under the LGPLv2, and it can run on any platform that supports Java. The Open Preservation Foundation took over stewardship of JHOVE in February 2015. JHOVE and other format recognition tools are exploited by the system PreScan that was described in Sect. 5.2.6.

15.2.2 Preservation-Friendly File Formats As stated before, a file format defines standard ways for encoding information in digital objects. It specifies how symbols (i.e., bits) are used for encoding information in a storage medium and have a direct impact on our ability to read the contents of our files in the future. Apart from some particular file formats that are “digital preservationfriendly” like PDF/A (ISO 19005-1:2005), there is no clear statement about which of

150

15

The File MyContacts.con: On Reading Unknown Digital Resources

Fig. 15.1 Marshalling and unmarshalling

them should be used for such purposes. Although there are no standard guidelines for selecting a durable file format for archiving purposes, there are some best practices for selecting a file format proper for digital preservation purposes. Specifically, the Library of Congress has published (and revises every year) a Recommended Formats Statement (Library of Congress, 2018) with the purpose of maximizing the chances of survival and continued accessibility of creative content well into the future.

15.2.3 Object Serialization and Storage Object serialization is the process of translating a data structure into a format that can be stored and exchanged (e.g., a file). The file is created by serializing the data structure (or the object), using a process that is usually called marshalling. Afterwards, the object can be re-created using the reverse process that is usually called unmarshalling. The process is sketched in Fig. 15.1. Serialization can be considered as the reverse direction of parsing (as described in Sect. 6.2.4) (Fig. 15.1). Figure 15.2 shows two code snippets that demonstrate how such tasks can be performed using the Java programming language, i.e., the code used for saving a Java object to a file and the code needed for loading it from a file.

15.3

Pattern: Proprietary Format Recognition

151

Fig. 15.2 Saving and loading Java objects from a file

15.3

Pattern: Proprietary Format Recognition

Pattern ID Problem’s name The problem

Type of digital artifacts Task on digital artifacts What could have been done to avoid this problem

Lesson learnt

Related patterns

12 (Proprietary) Format Recognition Robert tries to recognize the format of the file MyContacts.con. However, the extension and the format of the file are unknown and he cannot find the appropriate information/application to open them Digital object with proprietary or unknown format Recognize, view contents If the file had been stored according to a well-known format, or even better, according to an open standard, then the file could be opened easily. If, for example, the file was stored as UTF-8 encoded text (it could be a CSV or XML file), its contents could be reviewed easily with a simple text editor (like Notepad) The creation of custom file types is not good practice for digital preservation purposes, because such files are tightly coupled with the application that created them. Furthermore, the extension of a file is just an indication about the type of that file and does not necessarily correspond to its actual type iPrevious: • P3 (Text and symbol encoding) • P4 (Provenance and context of digital photographs) • P5 (Interpretation of data values) • P6 (Executables: safety, dependencies) iNext: –

152

15.4

15

The File MyContacts.con: On Reading Unknown Digital Resources

Questions and Exercises

1. Search for tools that can read files with extension “.con”. 2. Change the extension of the name of one of your files, e.g., change a “.doc” to a “.bmp”. Then try to open that file by a double-click. Then try to open it with MS Word. 3. How many widely known filename extensions (file format) exist? 4. Was there a computer virus that changed the extensions of files? 5. How can you understand that a binary file contains the serialization of Java objects? 6. Try to read the contents of the file MyContacts.con (found in the USB stick). Can you identify what type of information it includes? Can you say something about the values of the objects it contains? 7. Try to read the contents of the file MyContactsNonDeflate.con (found in the USB stick) and answer the same questions as before.

15.5

Links and References

15.5.1 Readings About Object Serialization in Java • Java Object Serialization Specification. https://docs.oracle.com/javase/7/docs/ platform/serialization/spec/serialTOC.html About BBC Domesday project • http://www.bbc.co.uk/history/domesday About File Formats and Digital Preservations • ISO 19005-1:2005 Electronic document file format for long-term preservation—Part 1: Use of PDF 1.4 (PDF/A-1). (https://www.iso.org/standard/ 38920.html). • Library of Congress—Recommended Formats Statement (2017–2018). (https://www.loc.gov/preservation/resources/rfs/).

15.5.2 Tools and Systems About Format Recognition • JHOVE. http://jhove.sourceforge.net • JAXB. http://jaxb.java.net

15.5

Links and References

• PreScan. http://www.ics.forth.gr/isl/PreScan About File Formats • PRONOM. https://www.nationalarchives.gov.uk/PRONOM/

153

Chapter 16

The File SecretMeeting.txt: On Authenticity Checking

16.1

Episode

It’s 11 am, probably the most intense hour in Robert’s office: phones keep ringing and he has several short meetings and teleconferences. The contest is just one of Robert’s concerns. At this period, MicroConnect makes an opening in the field of artificial intelligence, and Robert struggles to properly staff the new department of the company. The secretary informs him that a scheduled meeting has been canceled. Robert seizes the opportunity to return to the mysterious stick. In a folder /toremember he encounters a file named SecretMeeting.txt. He opens it with a text editor. The file contains the text shown in Fig. 16.1. Robert has no clue as to who wrote this text. Was the mysterious girl the author or someone else? He also has no idea about the credibility of the written message. Has a real meeting been scheduled for 1/1/2020, or is it just a joke or a fake message? Robert wonders how he could check the validity of the contents. Certainly, it would be difficult to contact Lady Gaga about this, and even if he does so, Lady Gaga would probably be reluctant to verify anything. After all, the file name indicates that the meeting is secret. Does the string “Lady Gaga” refer to the famous singer and songwriter Lady Gaga? It could be the nickname of a different person with whom the mysterious girl has indeed arranged a private meeting. Robert also considers that the first sentence, “No man has ever walked on moon,” is somehow problematic. How could one ever prove this? One could, at most, prove that one particular photograph or video from the expedition to the moon is fake or touched. It is, however, impossible to prove that no person has ever visited the moon, in the sense that we cannot exclude the existence of various secret expeditions to the moon. Robert decides that although he believes that man has visited the moon, and there is evidence of that, it is indeed very difficult to assess the authenticity of the videos that have been circulated.

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_16

155

156

16

The File SecretMeeting.txt: On Authenticity Checking

No man has ever walked on moon Meeting on 1/1/2020 at Lady Gaga’s house. She will hand me evidence about that. Fig. 16.1 The contents of the file SecretMeeting.txt

While he is thinking all this, he receives a strange email. It is from the accounting department of MicroConnect. It is untitled and its body contains only one link to a PDF file. The email is unsigned, although emails coming from the accounting department always contain a digital signature. He decides to delete it as it could be malicious. This makes Robert think about the wider problem of authenticity.

16.2

Technical Background

This section introduces the issue of authenticity and describes in brief the various related technologies, e.g., checksums, digital signatures, web authentication, cryptography (in Sect. 16.2.1), and then it discusses processes for assessing authenticity (in Sect. 16.2.2), and copyright and licensing issues (in Sect. 16.2.3).

16.2.1 Technologies Related to Authenticity In general, authenticity refers to the quality of being genuine or not corrupted from the original, and to the truthfulness of origins, attributions, commitments, sincerity, and intentions. One approach for facilitating the assessment of the authenticity of digital artifacts is to add to them extra information, e.g., for ensuring that the information content (sequences of bits) has not been altered in an undocumented manner (e.g., checksums), and for checking the author of the encoded information. Of course, this extra information should be authentic too, otherwise it is useless. Below we describe in brief some technologies that are in use for this purpose. However, this also raises the fundamental problem of information identity, an issue that is elaborated in Sect. 18.2.9.

16.2.1.1

Checksums

A checksum (or hash sum) is a datum related to a digital artifact, computed by an algorithm (usually called checksum function or checksum algorithm), and is used for detecting errors that may have occurred during storage or transmission of the

16.2

Technical Background

157

digital artifact. If the computed checksum for the current data input matches the stored value of a previously computed checksum, then this can be considered as evidence that the data has not been altered or corrupted. There are also errorcorrecting codes that rely on special checksums; these, apart from detecting errors, in certain cases allow the original data to be recovered. Checksums are mainly used for checking integrity. They cannot be used for verifying authenticity. They are broadly used when data are transferred through networks. For example, let’s assume that Robert wants to check the integrity of an ISO file that he uses for the installation of the VM Virtualizer software he wants to use (as we have described in Chap. 9). For example, suppose that VM Virtualizer is the open source virtualizer tool VirtualBox. The official website of the VirtualBox provides two checksum values (SHA256 and MD5), one for each version available for downloading. Each value is based on a different hashing algorithm. As Robert visits the official website to get these values, he notices a footnote that says: “The SHA256 checksums should be favored as the MD5 algorithm must be treated as unsafe!” Nevertheless, he decides to check both values. Finally, he finds that the checksum values for his downloaded version (VBoxGuestAdditions 5.0.8) are the following: SHA256: a6ab45a043e460c87f145ec626c1cc2a1ae74222f15aa725004cdc84bf48e9f3 MD5:28aa52d82296604e698e281a33cdaa3d

The only thing Robert has to do for verifying that the downloaded file is correct, and that it has not been corrupted or modified, is to calculate the checksum value of the downloaded file on his local machine and then to compare it with the value from the website. Of course, there are a lot of tools that can calculate the checksum of a file, but Robert uses a program that one of his friends had given him several years ago. He runs the code. The checksum values (SHA256 and MD5) that are prompted by his program are the same as those on the website. Now he is certain of the integrity of the downloaded ISO file. This approach is adopted by many applications nowadays for checking the integrity of resources. For example, Maven (discussed in Sect. 10.2.3) uses SHA1 and MD5 checksum for ensuring that the downloaded dependencies of a Java application are valid.

16.2.1.2

Digital Signatures

Digital signatures are often employed for authenticity reasons in various digital artifacts (documents, messages, software, any bit string in general). The digital signature of an agent on a digital artifact (e.g., a message) is used as evidence that (1) the artifact was created by a known agent (authentication), (2) the agent cannot deny it (nonrepudiation), and (3) the artifact was not changed in transit (integrity). Digital signatures are also related to electronic signatures, which, in some countries, have legal significance.

158

30 f7 85 8d 5c 27 ac d5 28 15 e1 ab a3 93 0f

16

82 67 d0 10 13 85 1b e2 43 0b 89 2d 08 b7 7c

01 9d 10 89 41 b7 b8 59 3e 38 26 c7 df c8 9a

0a d8 00 5f 3b 5d 2f ff 7e f9 c5 4d d0 1f d4

02 12 25 c5 1e 09 fc 10 0f 0b 5e ae 05 03 d5

82 4d b3 dc d6 bf 89 14 6e 24 3b b6 e5 13 fb

01 f2 07 63 65 1e c3 df 4e d8 3f 7b b5 83 e1

01 46 63 89 92 c6 16 4e a5 f9 14 82 29 36 d6

The File SecretMeeting.txt: On Authenticity Checking

00 df 8d f2 bc 1a 65 0c 68 ba 51 51 92 7f df

de 89 e3 62 ac 6c cc 4b 82 61 3f 0e 53 99 af

bb 33 9f 86 15 5d f5 21 2b 3c 83 f4 ca b9 3f

35 50 69 8c f5 a7 d5 f3 cc 6e b5 07 e1 42 94

05 25 ff ba 89 70 41 9e 9c f6 c0 5f 78 00 31

21 69 79 b2 0d 70 10 94 52 9f 7e b1 50 6d 02

e3 c3 4b 8a 16 a9 4a b3 f9 aa b1 36 c0 9d 03

df f6 e6 b3 22 89 50 7e bc 43 19 69 f1 55 01

51 9f a6 aa d4 76 d2 73 50 df 99 e4 54 c9 00

a1 d3 a0 0b 93 ef 86 56 b8 1a ac 9f 45 44 01

Fig. 16.2 Robert’s public key in RSA (2048 bits) format

Digital signatures employ asymmetric cryptography and they essentially rely on three algorithms: (1) an algorithm that outputs the private key and a corresponding public key (the algorithm selects a private key uniformly at random, from a set of possible private keys); (2) an algorithm that takes a given message as input (digital artifact in general) and a private key, and produces a signature; and (3) an algorithm for verifying a signature: it takes the message as input, the public key and the signature, and checks if the signature is correct (i.e., whether the message is authentic). The objective of all such schemes is to be computationally very expensive to produce a valid signature for an agent without knowing the private key of that agent. Below, continuing our example about the strange email that Robert receives, we describe how the accounting department of MicroConnect digitally signs an email in order to send it to Robert. First of all, to send and to receive a digitally signed email, you must have a digital certificate. This certificate allows the secure exchange of information using the public key infrastructure. For instance, Robert’s certificate is a file (.crt) with size 1.31 KB. It is valid for one year and was issued by MicroConnect. The used signature algorithm is SHA256, and we can see the public key of his certificate in Fig. 16.2. Both Robert and the accounting department (AD for short) know each other’s public key. AD creates its email and also a “fingerprint” or a digest of that email, then encrypts that digest with its private key. Finally, the email, the encrypted digest, and the public key of the AD are sent to Robert. Let’s suppose that AD wants to send the following email to Robert:

16.2

Technical Background

159

Hello Robert! The digest of that email using the MD5 (128-bits) algorithm is the following: 13 cf f3 af 7c 73 24 9a 41 4e 0d b8 58 10 f. 66

At first, Robert creates the fingerprint of the email that he received. Then he takes the encrypted digest and decodes it using AD’s public key. Now Robert has two fingerprints; if they match, he is sure that the email is from the AD and that it has not been changed.

16.2.1.3

HTTPS

HTTPS can be considered a “secure version” of HTTP. It is essentially a protocol for secure communication over a computer network that is widely used on the web (e.g., in all financial transactions). The motivation for HTTPS is authentication of the visited website and protection of the privacy and integrity of the exchanged data, i.e., it checks the authenticity of the visited web server and provides bidirectional encryption of communications between the client (the web browser) and the server (web application). HTTPS is encrypted by Transport Layer Security (TLS) or, formerly, its predecessor Secure Sockets Layer (SSL). The protocol is therefore also often referred to as HTTP over TLS or HTTP over SSL.

16.2.1.4

Web Server-Side Authentication and Client-Side Authentication

HTTPS provides authentication for the server, not for the client. Adding client-side authentication results in what is usually called mutual authentication. It refers to two parties authenticating each other at the same time. In the context of the web, it is often referred to as website-to-user authentication, or site-to-user authentication. Technically, mutual SSL is SSL with the addition of authentication and nonrepudiation of the client authentication, using digital signatures. The server requests the client to provide a certificate in addition to the server certificate issued to the client. In addition, the client must buy and maintain a digital certificate. There are some authorities that provide free certificates, but these are for a limited period of time. Nowadays, mutual authentication is used when extra security is required, e.g., in financial transactions between organizations. Mutual authentication is not used in most web applications because of the extra cost that users would have to pay for buying digital certificates (many organizations require certificates only from trusted authorities) and the maintenance effort to keep them in their web browsers, etc.

160

16.2.1.5

16

The File SecretMeeting.txt: On Authenticity Checking

Quantum Cryptography

One application of quantum cryptography is quantum key distribution (QKD). It uses quantum communication to establish a shared key between two parties without a third party learning anything about that key, even if the third party can eavesdrop on all communication between the two parties. With quantum cryptography it is impossible to copy data encoded in a particular quantum state, in the sense that the act of reading data encoded in a quantum state changes the state. This is used to detect eavesdropping in quantum key distribution. Quantum cryptography was first proposed in 1984, but it took decades to bring the concept to market. There are already a few companies that sell QKD products. One of these products has been used to secure elections in Geneva (see the related reference).

16.2.1.6

Bitcoin

We can also mention bitcoin as an example of a decentralized trust system that relies on cryptography. Bitcoin (Nakamoto 2008) is a decentralized virtual currency. It can be used for buying products and services. It is a peer-to-peer payment system that functions without a central repository or single administrator. The transactions are verified by the network of nodes and are recorded in a public distributed ledger. Users can offer their computing power to verify and record payments (this is called mining) and receive newly created bitcoins as a reward. This technology relies on private–public key cryptography: payers should digitally sign their transactions, using the corresponding private key, while the network nodes should verify the signature using the public key. The underlying technology is Blockchain (Swan 2015) and is currently in use for various applications (not only for virtual currencies). The importance of preservation is evident from the following story: a user once claimed that he lost bitcoins that were worth millions of dollars when he discarded a hard drive containing his private key (see the references). It is therefore clear that digital preservation is crucial for digital currencies. The inverse is also true, i.e., decentralized trust systems could offer valuable services for assessing the authenticity of digital material.

16.2.2 Processes for Authenticity Assessment The assessment or verification of authenticity can be complex; therefore, in most cases it is not a single action but a process. This includes checking the entire provenance of the digital artifact at hand, and checking the extra information about authenticity that may be available.

16.2

Technical Background

161

Furthermore, apart from technical evidence, there is also nontechnical evidence. The latter is based on the reputation and trust of the people that are responsible for the digital artifact, as well as on the trust of the “communication channel” that has been used. Such evidence is not technical, in most cases it is not rigorous, and therefore it could be elusive. However, this is actually the main method that we use in our daily life. For example, whenever we receive an email from a friend or collaborator, we tend to trust the email and its contents (it could contain plain text and/or various attachments like documents, data, or software). The reason for trusting the email is not because we have read our emails through a web server that uses HTTPS (because HTTPS does not ensure the authenticity of the sender, e.g., one malicious person could have stolen the username and password, or any other kind of credentials, of our friend). We trust the email because we recognize the “style” of the messages that the sender has, and because in the future we will have the chance to meet the sender of the message face to face and/or through other exchanges (emails, phone calls), and these forthcoming meetings will provide extra evidence about the authenticity of the received email. If, however, we receive an email that is a bit unexpected (in style and/or contents), then that will make us perform an extra check, e.g., to send an extra email, to make a phone call, or to ask one common friend or collaborator. In general, we can say that there is a kind of implicit trust network in our social life that we actually exploit for assessing (a priori and a posteriori) the authenticity of our exchanges. We could generalize and state that a more general network is defined that comprises nodes that correspond to human and artificial agents (systems) and edges that correspond to communication channels. As individuals, we know a fraction of this network, and in our mind we have implicitly assigned some probabilities to the trust of the known nodes and edges. Whenever we receive information from this network, we somehow probabilistically analyze the relevant part of the network and we estimate the trust of the received message.

16.2.3 Copyright and Licensing One key operation for digitally preserving a file is to copy it. In fact, every time we want to use a file, we have to copy it; in order to display the contents of a file, we have to copy its contents from the storage medium (local or remote) to the RAM memory of the computer and then visualize the contents. From a legal perspective, copying is known as “reproduction” and it’s one of the exclusive rights of the copyright owner. Copyright refers to the legal right that protects the owner of a work derivative. The copyright law lays out a framework of rules around how a work can be used, which are the rights of the owner, and which are the responsibilities of the persons that use it. Despite its value in terms of protecting the intellectual property rights, copyright poses several issues when it is required to preserve digital objects. The main reason for this is that copyright legislation in many countries was not designed with the

162

16

The File SecretMeeting.txt: On Authenticity Checking

digital environment in mind or is outdated with respect to the digital era. In addition, there are updated regulations that provide exceptions for libraries, archives, and preservation institutions, but they have limitations that are proved inefficient. For example, some laws allow libraries to make one or only a few copies of a work for preservation. These issues affect digital libraries. As the Encyclopedia Britannica puts it:1 “When libraries do not own these resources, they have less control over whether older information is saved for future use, another important cultural function of libraries. In the electronic age, questions of copyright, intellectual property rights, and the economics of information have become increasingly important to the future of library services.” In this light, the European Union started modernizing the copyright rules for the digital age. The reviewed rules consist of a new copyright directive that include copyright exceptions for text and data mining, teaching activities, and preservation of cultural heritage works.

16.3

Pattern: Authenticity Assessment

Pattern ID Problem’s name The problem

Type of digital artifacts Task on digital artifacts What could have been done to avoid this problem

Lesson learnt

Related patterns

1

P13 Authenticity Assessment Robert cannot check the authenticity of the document. He has no information about the author of the document, or about the validity of the information that is contained in the document Text Understand the semantics and the context, assert authenticity If every file was digitally signed, then we would be able to verify the author of each file. But even if the file in our story was digitally signed, that would not be enough for understanding who the person that is referred to as “Lady Gaga” is. The full provenance is also important for assessing the authenticity In general, it is hard to assess the authenticity of digital objects. However, there are several technologies (i.e., digital signatures, secure protocols, authentication mechanisms, cryptography, decentralized trust systems) that can be exploited for providing evidence about the authenticity of digital artifacts. Of course, the problem of authenticity is not only technical iPrevious: • P4 (Provenance and context of digital photographs) iNext: • P14 (Preservation planning)

https://www.britannica.com/topic/library

16.5

16.4

Links and References

163

Questions and Exercises

A) Technical Exercises 1. Find a checksum service or tool (e.g., CertUtil, cksum) and compute the checksum of one of your files. 2. Search for websites that offer software for downloading and also provide a checksum value for each of the downloadable software. 3. Create your own email certificate. 4. Find how you could acquire your own digital signature and whether it has legal implications. 5. Find how in your web browser you could add a digital certificate for user authentication. 6. Find real-world cases where quantum cryptography is in use. 7. Install a Bitcoin Wallet (https://bitcoin.org/en/choose-your-wallet) and mine one or more bitcoins. Try using them to buy something. B) Exercises Related to Nontechnical Evidence 1. Send an email to one of your friends that has a quite normal and expected content (e.g., an email asking him to bring you a coffee or do something that you often request from him/her) but “add” a lot of misspellings to your email. Check whether your friend responds normally. 2. Do the same (as in the previous exercise) but instead of misspellings use a different natural language (e.g., French) that your friend knows. Check whether your friend responds normally. 3. Now send an email using the backward technique. For example, instead of “Hello John” send: “nhoJ olleH”. Check whether you friend responds normally.

16.5

Links and References

16.5.1 Readings About Authenticity • Alliance for Permanent Access to the Records of Science Network (APARSEN). (2012). D24.1 Report on authenticity and plan for interoperable authenticity evaluation system (urn:nbn:de:101-20140516151). https://doi. org/10.5281/zenodo.1256510 About Bitcoin and Digital Preservation • Nakamoto, S. (2008). Bitcoin: A peer-to-peer electronic cash system. • Man throws away 7,500 Bitcoins, now worth $7.5 million. CBS DC. 29 November 2013. Retrieved January 23, 2014.

164

16

The File SecretMeeting.txt: On Authenticity Checking

• Swan, M. (2015). Blockchain: Blueprint for a new economy. O’Reilly Media, Inc. About Copyright Issues • European Commission. Directive of the European Parliament and of the Council on Copyright in the Digital Single Market – COM(2016)593 • Muir, A. (2003). Copyright and licensing issues for digital preservation and possible solutions. In ELPUB. • Carroll, M. W. (2012). Copyright and digital preservation: The role of open licenses. In Digital preservation 2012, Arlington, VA.

16.5.2 Tools and Systems About Checksums • CertUtil: A pre-installed Windows utility that is able to generate and display a cryptographic hash over a file. • Cksum: A command in Unix-like operating systems that generates a checksum value for a file or stream of data. About HTTPS • HTTPS: https://tools.ietf.org/html/rfc2818 • TLS: https://tools.ietf.org/html/rfc5246 • SSL: https://tools.ietf.org/html/rfc6101 About Digital Signatures • Microsoft Outlook offers the ability to send and receive digitally signed messages. • Thunderbird also supports the same functionality. About Applications of Quantum Cryptography • Stucki, D., Legre, M., Buntschu, F., Clausen, B., Felber, N., Gisin, N., Henzen, L., Junod, P., Litzistorf, G., Monbaron, P., & Monat, L. (2011). Long-term performance of the SwissQuantum quantum key distribution network in a field environment. New Journal of Physics, 13(12), 123001.

Chapter 17

The Personal Archive of Robert: On Preservation Planning

17.1

Episode

May 28 The weekend has arrived and Robert is reflecting on the problem while relaxing at home. He is still thinking about the files on the USB stick; although they are in digital form, meaning that they could be replicated and distributed very easily, they are almost useless to him since he can neither understand them nor use them. Soon he realizes that the same issue could happen after several years with his own files. He remembers reading an article from Eric Schmidt, the CEO of Google, discussing that every 2 days we create as much digital information as we have done since the dawn of civilization until 2003. He starts worrying about his digital archive. Will he be able to use the applications he developed several years ago? Even if he could use them now, is it guaranteed that they will remain functional in the future, and that his children and grandchildren will still be able to run them? He is already convinced that keeping backups of his entire file system is not a panacea. Keeping copies certainly protects his data from hardware failures and natural disasters; however, this does not ensure the runnability of his applications or the intelligibility of his documents, nor does it protect them from the possible obsolescence of various file formats. “One good thing could come out of this. I should do something better with my own files.” Robert thinks. He decides to arrange his digital heritage into categories and then to reflect on what he should do with each category. The first category he identifies is photos and videos. “For the moment they are safe, besides I have already stored them in the cloud,” he thinks. However, he cannot This chapter has been co-authored with Yannis Kargakis. © Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_17

165

166

17

The Personal Archive of Robert: On Preservation Planning

be sure that the file format of his photos and videos will not become obsolete in the future. At that time, another more stressful thought crosses his mind. “What will happen if the company that hosts my media goes bankrupt or decides to shut down the service? In such a case, I will probably lose my photos and videos. I doubt I would even receive some sort of compensation, although nothing would be good enough for such a loss.” The cloud hosting service he is using is GlobalDrive, a service provided by the company he was directing. However, in that moment, he rationalizes as an end-customer. He starts anxiously searching for the terms of use. He finds them and reads them impatiently. “It does not mention anything about compensation! I should think again about this,” he asserts. The second category that he identifies contains various documents and reports in different formats, including doc, docx, rtf, pdf, txt, tex documents, and many more. Things are becoming more intricate here. “For sure, WordEditor is widely used today, but would that persist 20 years from now? Should I also produce a PDF version for each WordEditor document?” Robert realizes that the structure of the folders is also important. They should allow him to easily locate a file now and in the future. Moreover, the structure is vital for understanding the context of these documents and thus for interpreting their content. For instance, his notes contain names, places, but without including the context they could become ambiguous in the future. He remembers that when he acquired his first cell phone he added his contacts in it by entering their name, surname, and phone number. After some years, he started encountering contacts for which he could not remember the corresponding person. Following that, he decided to keep more complete information about his contacts. Since then, he has never failed to record their profession and a note for aiding his memory (e.g., the name of a common friend, a related event, etc.). The same problem could occur with his digital files. In general, the content of the files is related to the contents of other files, so they all form a graph of dependencies between documents and other digital artifacts (images, emails, applications). This graph should be in some way preserved. For privacy reasons, Robert was not storing his documents, reports, and archived emails in the cloud. Even though there were several options offering encryption as well, he was not using any of them. When such services emerged, Robert was too busy or just lazy to test them. Robert is still not using them because he is not entirely sure that these services don’t have a backdoor somewhere in their software. The third category that he identifies is software and applications. The plot thickens here. Most of them have a lot of dependencies to compile and run properly. One approach would be to preserve the source code and the executables of his applications; however, in this case he should also care about the environment variables that are required (which operating system, which version, etc.) and also their proper documentation, which in many cases is missing. Moreover, many of his applications have particular hardware requirements and need specific software drivers, and this is also something that he should consider. However, he knows that many of these are no longer supported by their manufacturers. He has already come down with a headache. At this time, he remembers a visit from Peter, a classmate from high school, a couple of months ago.

17.1

Episode

167

They hadn’t met for years, but it took Robert only two minutes to realize that Peter had turned to activism. Peter had visited Robert to ask him to sponsor an event about the protection of the environment. He still remembers the words Peter passionately uttered: “Robert, please see what is going on. To maintain demand, products are designed to break or get obsolete quickly. This leads to pollution, waste of natural resources, and malevolent exploitation of science and technology. Can a new discovery go into production if it risks the turnover of a big market? Could we ever have an ever-burning lamp or an inexpensive and effective way to deal with a severe illness? We have to focus on quality, we have to make good and long-lasting products, not just competitive products for fast consumption. Perhaps we need a new economic model that is ecology-conscious, eliminates the fear of unemployment, and gives good incentives for creation, novelty, and quality. Robert, someone with your authority and influence could help us.” Robert agreed that this issue is important and replied to Peter: “But please, take into account that people buy new computers because the new ones are much faster than the previous ones, and with the new ones they can do things that were impossible with the older ones. It is not planned obsolescence in our sector, it is Moore’s law.” Since this answer did not seem to quite satisfy Peter, Robert started asking questions about the planned event and eventually he made a brave sponsorship. However, Peter’s words stayed with him for a long time. Since then, he has been following the issue with more interest. He has studied cases of “abandonware” and has positively acknowledged the recent EU regulations for mandatory support by the manufacturers for the first 2 years. Robert continues to scour his folders. He realizes that for various documents or applications he has kept not only the latest version but also past versions. “I will keep the old versions of my documents. They allow me to trace how their final version was derived,” he thinks. He decides to follow the same approach for his applications. “After all, not all applications are backwards compatible, so I have to keep the old versions,” he concludes. During this browsing, he encounters files that seem not to be very important to him, like impromptu notes from various meetings. After every meeting, he used to spend some time to keep notes in a document for further discussion, or to write down ideas that had come up. “I cannot delete them. This is a kind of diary,” he realizes. To take his mind off these persistent thoughts, he is led to gaze outside the window. He observes the slow rustle of the flowering pomegranate tree in his garden. He understands that this procedure would have been much easier and more efficient if he had considered the issue of preservation from the beginning. He realizes that, whenever possible, he should rely on open and widely used standards for increasing the probability of not losing his digital heritage. It would be much easier to annotate the important and the not-so-important files at the time of their creation. The contents of his personal computer occupy approximately 1.5 terabytes. By a rough estimate of the extra information he has to preserve, he concludes that he will need more than 2 terabytes of storage space. If he decides to keep at least two copies of them, then that means 4 terabytes in total. He does not want to use hard disks, because they are error-prone and can be easily corrupted. On the other hand, other types of magnetic storage media are rarely used, and most probably they will not be

168

17

The Personal Archive of Robert: On Preservation Planning

in use in the future. Another safe solution is to use optical disks, since they are more durable and can be stored efficiently. “But I would need more than 850 optical disks! Well, this is not at all practical!” he exclaims. “Compression” comes to his mind, but then he realizes that space saving will be limited. It seems that the optimal solution is to make a selection; he should keep files that are updated frequently somewhere so he can work with them (i.e., hard disks) and files that are not going to change somewhere else. But again, he should periodically check that all of them are usable. He is very confused. Robert makes a decision about the files he should preserve, but he is unable to decide which is the proper medium for that purpose. He opens the browser on his computer and starts typing the address of his favorite web search engine. He wants to find out about preservation strategies and examples. He stumbles onto an article describing how the annals of the Josen dynasty survived for centuries. The article relates that the annals have been well preserved for centuries as a result of the systematic efforts of dedicated archivists who worked on improving the durability of the texts. They kept different copies in remote mountainous regions to protect them during times of crisis, and stored them in special boxes using medical herbs to ward off insects and absorb moisture. Furthermore, they aired them out once every 2 years as part of a continuous maintenance and preservation process. “That process worked well for handwritten archives. This is definitely what I should do for my digital heritage,” Robert thinks. Although he does not trust hard disks, he decides to use them, because they seem to be the most appropriate option because of their cost-effectiveness, capacity, and lifetime. Furthermore, in order to increase their lifetime, he decides to write them once and then store them, since spinning the disks up for external access and integrity checks is what increases their temperature, causes wear, and leads to errors. If, in the future, he finds another solution, then he will replicate the contents there. In order to put things in the right order, he decides to engage with one category at a time. And he starts with his applications. Hours pass by, and Robert is committed to replicating the contents of his personal archive into external hard disks. It was already midnight when he finishes and he stares at the hard disks in front of him. He uses seven hard disks containing his personal archive, organized into different categories. He uses the first hard disk with the data from the applications he was using, the second one for storing documents and reports from his work, the third one for his personal documents, photos, and videos. He also keeps replicas of these disks in three other hard disks. The last hard disk is intended to be used for files that are updated on a regular basis. He decides to store the first three hard disks in the closet, with the commitment to check them periodically and refresh them if needed, for ensuring that their contents were still readable. For this reason, he adds a periodic reminder in his calendar that would notify him every 6 months to check them. The last disk is not going to be archived in any closet. On the contrary, Robert is determined to update its contents periodically, since it contains files that are constantly changing. For this reason, he decides to keep this hard disk on his desk so that he can use it whenever he wants.

17.1

Episode

169

For some peculiar reason, the sight of the hard disks that contain his personal archive makes Robert feel relieved. He tried to preserve the contents that matter to him, by replicating them and adopting a strategy for checking their validity. He defined his own personal preservation plan, and this satisfies him, because he knows that his files are safer now. Delighted as he is, he turns off his computer and stores the hard disks in the closet. Robert’s Blog That moment Robert remembered his blog. Over the past 15 years, he has been maintaining a blog, a kind of personal diary. He wrote a post only when he wanted to express something special that he had not discussed with others or when he was feeling that his opinion about an issue was very different from those of the others. He wrote mainly for himself and he used a pseudonym. Sometimes he would read the posts he had made the same month, the previous year, or the previous decade. This reminded him of the issues that concerned him at that time, as well as those which were subject of public debate. Of course, in the last 15 years, the world has changed, he has changed, and his blog was a kind of Captain’s Diary. “I should backup my blog,” he thinks. After connecting to the blog hosting platform, he notices a message alert. According to that message, in 1 month from now the blogging service would be shut down and all bloggers were requested to copy their content somewhere else. He is shocked. He feels as if he was told that his personal diary would be deleted in a month. He realizes that he has not kept any copy of his posts. “Let’s check what export options are provided by the platform,” he thinks. He finds an XML export option; he tries it and almost immediately an XML file is downloaded to his computer. He opens the file where he observes a structured view of his posts, as well as the comments of the readers, but there is no kind of formatting. Neither the pictures are displayed, nor the embedded videos. He notices that even the textual content of his posts is difficult to read because various special characters have intruded. Furthermore, he is not able to navigate his posts based on dates or tags. “This is not usable. I have to keep a readable copy in HTML,” he thinks. He starts saving the pages of his blog in HTML using the “Save As” function offered by his web browser. Soon he realizes that this stores only the current page, which contains only ten posts. However, his posts are over 500. “I cannot do this 50 times,” he thinks. Robert starts searching for the platform setting that determines how many posts are shown per page. He locates that option in the menu, and then he raises the number of posts per page to 1000, hoping that in this way all of his posts will fit into a single HTML page. He visits his blog again but he realizes that its first page does not contain all of his posts, but only 127 posts, the rest of the posts are placed on the next pages. “This is not a problem, I just have to save 5 pages for keeping all of my posts.” He does it and he mutters “OK, I’m done.” Before shutting down his computer, he thinks that he should check that the local copies are fine. To this end, he turns off the Internet, and opens one of the stored HTML pages from the local folder. It looks fine, both text and formatting as well as the images are rendered fine. However, by clicking on a tag, instead of getting the list with the posts that have been

170

17

The Personal Archive of Robert: On Preservation Planning

marked with that tag, the browser notifies him that the URL address is wrong. “Tag navigation does not work,” he says disappointed. He realizes that the same problem occurs when he wants to navigate his posts by date. The connecting links amongst these pages are not functional. “I can no longer do what I always did: to read posts I have made a year ago, or to see all posts grouped together by a specific topic.” Robert then tries to see if the comments of the readers of his posts are visible. He again receives an error message from his browser. “They are also not working!” Robert is disappointed. In several cases, the reader’s comments and discussions have been more interesting than the post itself. Unfortunately, comments did not exist in the HTML copy, because the comment link was not pointing to a static page; it was actually a request to the platform hosting his blog. When the platform shuts down, this service will also be cancelled and all the comments will be lost forever. “This is unacceptable,” he thinks. “I should use a tool to download all these pages automatically.” He finds a tool for downloading websites and begins testing it. He realizes that the tool does indeed download all of his posts and he does not need to deal with folder issues. Unfortunately, navigation by tags and dates is still not working. He inspects one downloaded page and he realizes that these links follow a different pattern. For this reason, he starts testing the settings offered by the website copier tool. He sees there are settings related to relative and absolute URIs and to URI transformations in general. After a few tests, he finds a set of configuration parameters that works. “Yes!” he exclaims “Navigation to posts via tags and dates works! Let’s also check whether readers’ comments are visible.” He clicks on the counter indicating the number of comments of a post but unfortunately he gets nothing. “No!” Spontaneously, he clicks on the title of a post, and the comments of that post appear. Luckily for him, the tool had downloaded a file for each post (since each post had its own URL) in a format where reader comments were visible. “Perfect! I now have them all! I have fully retained the navigation experience in local files. Everything will work in the future as long as browsers continue to interpret correctly the current version of HTML, CSS, and JavaScript.” To be sure, he copies the folder with the downloaded pages to one of his hard disks, ensures that the Internet is off, and tests again the pages at the hard disk. Everything works fine. “I preserved my blog for personal use, but no one will be able to read these posts except for me. What am I supposed to do about this?” He recalled that there are web archiving services,1 and for this reason he connects to one of them and fills a form with the URL of his blog. The system responded with the dates of the archived copies. Roughly, it seemed that a copy was taken every 6 months and the more recent one was 6 months old. Consequently, it did not contain the posts of the past few months. Moreover, Robert notices that only the first page of his blog is archived. The previous pages, those that one gets by clicking “Older Posts,” were not archived. Fortunately, by clicking on the link “Older Posts” the system informs him that the

1

We have seen web archiving services in Sect. 14.2.2.

17.2

Technical Background

171

page he is requesting is not archived, but since that page is online now, he could request immediate archiving. He responds positively and after a few seconds these pages are archived as well. However, he does not notice any options for requesting the refresh of the first page of his blog. Robert now starts testing the archived pages of his blog. He notices that navigation by tags and dates does not always work. Moreover, the comments were not archived. On clicking on the title of a post, the system again informs him that the URL is not archived and the system again offers the option of immediate archiving. He responds positively, and then that particular post is archived together with its comments. “It is prohibitively time-consuming to do this for each individual post of my blog, just for archiving the readers’ comments. I already have the readers’ comments in the local HTML copy, also in an XML file. I will see in the future what I will do with them,” Robert thinks. So the only unresolved issue is that the web archive has not refreshed the first page of his blog. “I need a kind of web archive on demand.” After searching for a few minutes he finds that there are some archive-on-demand services, and he chooses one of them. For each of his five pages of posts, he fills a form with the URL of the page, his email and name, and receives back a URL that points to their archived content. It takes him less than 5 min to complete the entire process. His recent posts are now archived. However, the comments (for the same reason as before) are not archived. “It does not matter; I will do something about them when I decide to create a new blog in another platform. In the new blog I am going to import all of my posts as well as their comments. I will just have to transform the information that is already in the XML file that I have.” He looks at the platforms and he realizes that there is no standard for exchanging blogs across different platforms. It is late and Robert feels exhausted. He turns off his computer. Shortly afterwards, his mind goes back to his blog. “I will lose the contacts!” Most of his friends in the platform were anonymous, and probably for this reason their posts were quite interesting and very genuine. After the termination of the platform, this community would probably cease to exist and, without knowing their real names, it would be difficult for Robert to reconnect with them. “What a pity. It would be nice if the platform could continue its operation.” Robert starts wondering about who is, or should be, the owner of the platform. Is it the company that has written the software 20 years ago and currently maintains it, or the people who have uploaded so much content there, or both? And even if all these people move to another platform, will it be the same for them? Robert had never imagined that he would feel a kind of digital uprooting and that it would hurt.

17.2

Technical Background

This section briefly discusses Moore’s law (in Sect. 17.2.1), storage minimization from a theoretical point of view (in Sect. 17.2.2), compression-related risks (in Sect. 17.2.3), preservation planning (in Sect. 17.2.4), the question “what to preserve”

172

17

The Personal Archive of Robert: On Preservation Planning

(in Sect. 17.2.5), information value (in Sect. 17.2.6), data management plans (in Sect. 17.2.7), backup and replication (in Sect. 17.2.8), version control (in Sect. 17.2.9), and web blog preservation (in Sect. 17.2.10).

17.2.1 Moore’s Law “Moore’s law” is an observation or projection and not a physical or natural law. It refers to the observation made by Gordon Moore (the co-founder of Fairchild Semiconductor and Intel). Moore (1965) described a doubling every year in the number of components per integrated circuit and projected this rate of growth would continue for at least another decade. In simple words, it means that computers (digital electronics in general) double their capabilities every 1.5–2 years. A spin-off of Moore’s law is Kryder’s law. It was first introduced in 2005, when Scientific American published an article containing the observations of Mark Kryder. Mark Kryder has been vice president of Seagate Company, which is one of the largest companies that produce data storage solutions. Kryder observed that magnetic disk aerial storage density was increasing very quickly and noticed that it was progressing at a much faster rate than the 2-year timespan in Moore’s law. In fact, Kryder predicted that the doubling of disk density on 1 in. of magnetic storage would take place once every 13 months. In simple words, we can say that Kryder’s law is a deviation of Moore’s law, which is applicable only in the evolution of magnetic hard disks. Moore’s prediction has proved accurate until recently: in January 2017, Intel CEO Brian Krzanich declared, “I’ve heard the death of Moore’s law more times than anything else in my career . . . And I’m here today to really show you and tell you that Moore’s law is alive and well and flourishing.” This rate of progress of digital electronics has contributed to world economic growth in the late twentieth and early twenty-first centuries. Even today, Moore’s law is used in the semiconductor industry to guide long-term planning.

17.2.2 Storage Space and Kolmogorov Complexity The amount of digital information constantly increases. Its preservation requires storage media, and this obviously costs money. One rising question is how much we could reduce the required storage space. In this section, we make a brief comment on what holds in theory, so that we know the limits of the technical solutions that we can develop. Consider a digital string x, i.e., any sequence of 0 and 1 digits. The Kolmogorov complexity of the string x, denoted by K(x), is the length of its shortest description p on a universal Turing machine U that can produce x. The universal Turing machine U is essentially a mathematical description of a simple computer. Roughly, it can

17.2

Technical Background

173

take as input a program p, it can execute it and produce output; let U(p) denote the output of the execution of p. Consequently, the Kolmogorov complexity of the string x, denoted by K(x), is defined as: K(x) ¼ min{ size(p) | U(p) ¼ x}. It follows that if we want to reduce the storage space of the string x to the minimum, then the Kolmogorov complexity of the string x is by definition the best according to this criterion. Note that in general, K(x) is smaller than size(x), e.g., a program that produces one billion “1”s can be encoded in less space than the space occupied by one billion “1”s. As another example, a program that produces the first 1 billion Fibonacci numbers, like the programs we have seen in Chap. 13, requires less space than storing explicitly these 1 billion numbers. It is also not hard to see that the minimal description of a string cannot be much larger than the string itself. The Kolmogorov complexity has proved that there is a constant c such that K(s)  |s| + c for every string s. Another important question that arises is: can we compute K(s) for any string s? The answer is negative: it has been proved that K is not a computable function, i.e., there is no program that takes as input a string s and produces the integer K(s) as output. Therefore, in practical applications, K(s) has to be approximated. For instance, a straightforward method to compute upper bounds for K(s) is to first compress the string s with a method, implement the corresponding decompressor in the chosen language, and then concatenate the decompressor to the compressed string, and finally measure the resulting length of the string. Another related question here is how the choice of the description language affects the value of K(s), i.e., the minimum storage space needed for preserving s. A related result of Kolmogorov complexity is the following: If K1 and K2 are the complexity functions relative to description languages L1 and L2, then there is a constant c (which depends only on the languages L1 and L2) such that |K1(s)  K2(s)|  c for every strings s. Consequently, the effect of changing the description language is bounded.

17.2.3 Compression-Related Risks As mentioned in Sect. 7.2.1, many compression algorithms that are used for images and videos are lossy (i.e., not lossless), meaning that they permit reconstruction only of an approximation of the original data. As a consequence, multiple successive applications of such compression algorithms could significantly reduce the quality. This occurs because the (i + 1)-th version of a digital object is derived by compressing the i-th version of the digital object. This is analogous to the effect of successive photocopying. For example, in 2009 a certain YouTube user decided to test image and sound degradation that occurs when you upload a video to YouTube, then download the video from YouTube, and upload it again. He did that 1000 times and the final version was almost unrecognizable. The result of each compression cycle is shown in the video entitled “Video Room 1000 Complete Mix—All

174

17

The Personal Archive of Robert: On Preservation Planning

1000 videos seen in sequential order!”2 A brief discussion related to the digitization of analogue content and the authenticity issues that arise in audiovisual collections is given in the work by Teruggi (2010). It describes the implications of the “opposite problem”: the quality of the original audiovisual content that was created years ago can be very poor from a technical point of view, because the quality of images and sounds has continuously improved. For this reason, in the context of preservation, sometimes the audiovisual content is improved and can become better than it originally was. This, amongst other things, can make us wonder: Are we viewing the image/video as it originally was, or are we viewing a degraded image/video of the original image/video? It is not hard to realize that the preservation of audiovisual material is a complex and challenging task (see, e.g., Addis et al. 2010). PrestoCentre3 is a community of audiovisual archives that focuses on sharing knowledge about digital preservation for audiovisual heritage materials.

17.2.4 Preservation Planning As already mentioned in various places in the previous chapters (and will be analyzed further in Chap. 18), two widely used strategies for preserving digital content are: (a) migration (i.e., copying or conversion of digital media from one format to another) and (b) emulation (imitation of the behavior of a computer or other electronic system with the help of another type of computer/system). However, even the applicability of a strategy is a huge and context-dependent problem. This is because in each case we have to inspect different parameters, tackle different requirements, and pursue appropriate solutions. For example, the migration of large collections of documents is not always a safe approach, and should be treated carefully, since as you migrate you might lose something (e.g., coloring, formatting). In addition, in the case of digital libraries, emulation cannot always be the best choice since this could be applied to millions of digital objects, which have high complexity. Preservation Planning is a process by which the specific needs of the preservation of digital objects are determined, the list of available solutions is recorded, and the actions that an institution or a data manager will take are defined. The ultimate goal is to define a strategy that will ensure authentic future access for a specific set of objects and designated communities by defining the actions needed to preserve it. An indicative workflow for a digital preservation planning process is as follows: 1. Initially, a process defines a preservation scenario by choosing some sample datasets or collections and identifying the requirements and goals for that

2 3

https://www.youtube.com/watch?v¼icruGcSsPp0 http://www.prestocentre.org/

17.2

Technical Background

175

scenario. Requirements are specified in a quantifiable way, starting at high-level objectives and breaking them down into measurable criteria. 2. Afterwards, another process defines and evaluates the potential alternatives. This includes the identification of the different approaches that could be followed, the technical details and configurations, and the required resources for executing them on the datasets defined in the previous step. The potential alternatives are evaluated by applying the selected tools to the defined datasets (or collections) and producing the evaluation output. 3. Finally, the results are aggregated, the important factors are set, and the alternatives are ranked. The analysis takes into account the different weighting of requirements and ends up in a well-informed recommendation for a digital preservation solution to be adopted. The above process can be automated and there are tools to support it. One tool that is available online is the Plato Preservation Planning tool,4 which integrates the services for context characterization, preservation action, and automatic object comparison. Figure 17.1 shows two indicative screenshots of the process (selection of alternatives and inspection of the results).

17.2.5 On Selecting What to Preserve The task of selecting what to preserve is usually referred to by the term selection and appraisal. It is a difficult task especially if the volume of information is big. The selection of the digital material to be preserved in many cases depends on what is feasible to preserve, as well as the cost of preservation. This is evident in the case of computer games (or video games). Various approaches for the preservation of computer games are possible, e.g., through physical preservation, through the development of an emulator, or through the video image as mentioned by Nakamura et al. (2017). It is worth noting that augmented reality games pose even more challenges as described by Lee et al. (2017). For instance, it is challenging to establish the boundary of the “game object” in such games, because they include maps, photos, and real-world interactions, and this raises both technical as well as legal issues. Inevitably, the selection and appraisal of digital material to be preserved is related to the “value of information,” an issue that is discussed in the next subsection.

4

http://www.ifs.tuwien.ac.at/dp/plato/

176

17

The Personal Archive of Robert: On Preservation Planning

Fig. 17.1 Indicative screenshots from the Plato Preservation Planning tool

17.2.6 Value of Information Digitally encoded information is extremely valuable to everyone. Just indicatively, in 2011, McAfee released the results of a user study, revealing that consumers place an average value of approximately $37,000 on their digital assets.5 However, we also daily create a vast amount of information that: (a) is of no use to certain communities or people, (b) will never be used by anyone, and (c) can be reproduced easily from the existing information. There are several everyday digital objects that fall within the above categories: log files, temporary files, binary files produced from original sources (e.g., .class files). There are cases where some digital objects seem to be extremely valuable, and we should not risk losing them and others that are not so valuable to us. So, in order to have a quantifiable measure for the necessity of such objects, we should first define the term “value.” 5

McAfee Press Release—September 2011. URL: https://www.mcafee.com/us/about/news/2011/ q3/20110927-01.aspx. Accessed: 2018-05-03. (Archived by WebCite® at http://www. webcitation.org/6z8bleb9T)

17.2

Technical Background

177

From an economics point of view, value can be conceptualized as the relationship between the consumer’s perceived benefits and the perceived costs for receiving these benefits. However, this might not always be true as regards the preserved contents; many organizations create and preserve scientific information; however, they gain no profits from it as it is usually open and free access is available. Therefore, one could define the value of preserved information on the basis of the processes and activities needed over time to offer this information to the final users. In practical terms, this means that the value depends on the achievement of the organizational objectives and missions as well as on the satisfaction of the needs of the final users. There are several past research projects in the field of digital preservation that have been focused on the value and the estimation of the cost of digital preservation. These works identify the value with respect to different parameters, including the type of data, the complexity, the volume, the preservation policies that will be followed, the level of automation, the risks, etc. The main characteristics of these models are as follows: • They measure the value according to the willingness of decision-makers (or others who use the data) to pay. Their willingness depends on the level of uncertainty and the amount of possible loss. • They measure the value taking into account the usability, sharability, time, accuracy, precision, risks, unicity, and integrity. • They measure the value taking into account an approximation of the cost of acquiring, creating, archiving, and preserving information. • They measure the present value of the expected future economic benefits. It turns out that defining the cost of preserving digital information is a rather complex task. This happens because the total life cycle cost for preserving a digital object depends on several cost factors. An indicative cost model is presented in the work by Strodl and Rauber (2011), which focuses on a small-scale automated digital preservation system. This cost model provides a calculation for the hardware storage demand of the archive and it also considers the growth of the size of the archive, the hardware migration, and the cost trend of the storage media. The overall cost is defined as the summary of the following costs: • Acquisition cost, which includes the selection of policies and content. The cost reflects the effort that should be made by the user for performing the selection and is multiplied by the user’s requirements level. • Ingestion cost, which includes the cost of creating metadata and making updates to the holdings. The creation of the metadata is a labor-intensive work and can cause considerable costs. • Bit-stream preservation cost, which covers the cost of hardware and manual work for physical backups. There is a clear distinction between storage as a service (e.g., cloud storage) and storage on hardware (e.g., hard disks, optical disks). This measure also takes into account the continuous incremental rates in storage capacities and the decreasing rates in storage prices. The cost includes the initial

178

17

The Personal Archive of Robert: On Preservation Planning

cost for purchasing the hardware for storage, the cost of refreshing the storage, the cost of recovering from disaster, the cost of backups, and other indirect costs that might occur (storage maintenance, storage procurement, etc.). • Content preservation cost, which includes the quality assurance of preservation actions. For example, migrations modify the actual data; therefore, validation of results is very important for guaranteeing the authenticity and trustworthiness of the archive. • Preservation system software, which includes the costs of the software and its customization. Figure 17.2 shows the detailed formulas for computing the total costs. In the work by Strodl and Rauber (2011), a case study was conducted showing the cost calculations for a small business setting. The initial collection had a size of 75 GB (with an expected rate of 5%). By using particular rates for the software, hardware, and the users, the authors have shown that the total costs per year for preservation ranges from €1500 to €3500. Another EU project that focused on the cost of curation is the 4C (Collaboration to Clarify the Cost of Curation) Project, which has analyzed ten current and emerging cost-benefit models.6

17.2.7 Data Management Plan Data management plans (DMPs) are formal documents that are prepared in the context of research projects and outline how data are to be handled during the lifetime of the project and after the project is completed. The ultimate goal of a DMP is to reinforce scientists and researchers in considering the various aspects of data management, metadata generation, and data and information preservation before the project even begins. DMP defines the guidelines when collecting and generating data within a project and this ensures that data are well organized, in the proper format and repository, and they contain the required metadata that describe them. A DMP contains information about the kind of the data that will be collected or generated, as well as the processes for collecting them (or generating them). Moreover, it should describe what kind of documentation and metadata will be stored with the actual data, as well as other information about sharing and distribution. Although there is no standard guideline for producing it, we could say that a typical DMP should at least contain information about the following: • The data collection (i.e., description of the data that will be collected or generated during the lifetime of a research project).

6

http://www.4cproject.eu/summary-of-cost-models/

17.2

Technical Background

AcquisiƟon

Ingest

179 Bit-stream PreservaƟon

Policy SelecƟon Metadata CreaƟon * Storage Hardware e·c·u Content SelecƟon e·c·u

e·c

Content PreservaƟon QA PreservaƟon AcƟon

n · c · s · rmd + (n · c · s · rmd · frc)

Update Holding

Refreshment

e·c·n * Disposal

e·c·u

frc · e · c · n

c

Storage Procurement frc · e · c Disaster Recovery c Storage Maintenance and Support * c Backup Procedure e·c·u Backup e·c·u·r

PreservaƟon System PreservaƟon System SoŌware c CustomizaƟon of SoŌware c

c: costs in € e: human effort n: number or amount r: rates in % s: size in GBs u: number of user requirements frc: frequency rmd: storage capacity annual improvement rate *: op onal

Fig. 17.2 Cost model (based on Strodl and Rauber 2011)

• Documentation and metadata (i.e., all the supplementary information that should accompany the data collection). • Copyright and intellectual property rights issues (i.e., description of the licenses that are applicable, restrictions, possible embargo periods, owners of the data, etc.). • Ethical issues (i.e., if there are sensitive data, then the required processes for protecting them should be described in detail). • Storage and Backup (i.e., description of the storage and backup solutions for both the data and the metadata). • Data Preservation (i.e., the processes carried out for ensuring long-term access to the data). Nowadays, there are platforms that assist users creating a DMP through a series of guidelines. More specifically, they assist users by asking various questions about the data of their research project, and the results are therefore compiled into a DMP. An indicative tool for preparing DMPs is DMPonline. A more detailed list of relevant tools is given at the end of this section. In the last years, there has been a growing interest toward machine-actionable DMPs (maDMPs), i.e., those that can be operational. Such plans could reduce the effort required for carrying out DP-related actions, either one-off actions or periodic ones. It is not hard to see that the realization of this objective requires associating data management plans with particular technologies, like workflow management systems, content management systems, data catalogs, and others, for tackling the preservation-related requirements related to storage, transformation, metadata, and monitoring (e.g., automatic periodic checksum control and others). (For more details, see Miksa et al. 2017.)

180

17

The Personal Archive of Robert: On Preservation Planning

17.2.8 Backup and Data Replication Against Hardware Media Failures Backup and replication refers to the process of copying your data in order to recover it after a possible loss. Data loss is not something unusual: in a survey conducted in 2008, 66% of the respondents said that they had lost files on their home PC. It is obvious that the backup process is something crucial for big organizations and companies, and it should be part of their disaster recovery plans. Backup raises various questions: how often to backup, what data to backup, where to store backups, etc. The backup process is also related to replication. While a backup process aims at avoiding data loss, a replication process (that keeps various snapshots of your data in various places) also aims at instant restoration of data, something that is important for business continuity. One might think that replicating files in multiple copies keeps them protected against hardware media failures. In general this is true; however, one might ask the following question: “How many copies are considered a safe option?” Although the question seems reasonable, it cannot be safely answered due to two main reasons: pricing and statistical factors. Pricing seems to be an obvious factor that determines the number of copies we are willing to preserve. Replicating audiovisual files can be a space-consuming process since such files are usually very large. Apart from the size of the files to be replicated, another factor is the price of the hardware medium: solid state storage is 15 times more expensive than hard disk storage. As regards the statistical factors, assume that we are replicating files once. This means that there are two replicas and at some point one replica fails. When the failure is detected, we have to read the contents from the working replica and copy them in a new disk. If during this period, something happens to the working replica that makes its contents unreadable, then we lose the files forever. The same pattern occurs with more replicas. Consider for example the RAID systems. RAID systems claim to be reliable; however, this is not always true. RAID assumes that disks fail randomly; however; failures are usually correlated. The disks in RAID systems are typically from the same manufacturer and from the same batch; they have the same firmware and manufacturing glitches. In addition, they are also co-located; they share the same power supply, same cooling, same vibration environment, etc. So if one disk in a RAID system fails, there is a high probability that another will fail as well. Although we cannot safely define the number of copies that guarantee that our digital heritage is safe (with respect to hardware media failures), we can define the following basic principles: • The more copies we preserve, the safer it is. • The less correlated the copies are, the safer it is.

17.2

Technical Background

181

A common rule about the number and storage of backups is the 3-2-1 rule. According to this rule, you should have at least three copies of your data, two local copies on separate storage devices, and one copy off-site. For example, if a device has a probability of failure 1/1000, then the probability of failure if you have stored one copy in two such devices is 1/1,000,000, in three devices is 1/109, and so on. The reason the above rule suggests keeping a third copy off-site is to be prepared in case the physical location of the primary copy and its backup is hit by a physical or other disaster.

17.2.9 Version Control Version control is the management and tracking of changes that are applied in files (documents, source code, etc.). Changes are usually labeled by a revision ID, which is actually identification (usually a number) of the applied changes on a collection of files. Revisions can be compared, restored, and merged. Version management is typically needed when a team of people work on the same files. It is common practice in computer software engineering where different developers write or update code in the same files. Two key notions in version control are the notion of branch and the notion of trunk. Branching is the duplication of a file collection under revision and allows the parallelization of changes, in the sense that two different copies can be changed independently of each other. Trunk refers to the base/main line of changes. Branches after changes are merged into the trunk. Version control systems usually store the versions of a file in a differential manner, thereby avoiding the need to store the entire contents of each version. There are several software systems, commercial as well as open source, for version control, the most common are Bazaar, Git, and Apache Subversion (SVN). What should be logged in a version control system (for digital resources in general) depends on the provenance-related requirements, as described in Sect. 7.2.4.

17.2.10

Blog Preservation

Typically a blog (or weblog) is a website that contains a log or diary of information of a person (the blogger). Commonly, a blog consists of posts by the author, each having a title, a body (in HTML), a date, and, optionally, a set of tags. The readers of the blog can leave comments on a post. The presentation and the navigation of the posts in a blog is offered in reverse chronological order, as well as through a chronological index and through the tags used. In addition, readers can subscribe to blogs for getting notifications.

182

17

The Personal Archive of Robert: On Preservation Planning

Currently, the interoperability between blog platforms is rather limited. Although blog platforms offer import and export functions in XML, the output from one platform cannot be used straightforwardly for importing it to another platform. This happens because there are no agreed standards for exchanging blogs, although technically it is not a difficult problem. However, if the objective is the preservation of the posts of one particular blog (and not the continuation of the operations of the blog in a different platform), one could resort to website copiers. They are tools that allow a user to download and store, in local storage space, a website, i.e., any set of interlinked HTML pages. In such tools, the user provides the URL of the website to be copied and the local folder where the downloaded pages should be stored. Various options are commonly offered, including the depth of the copy, i.e., how many consecutive HTML links of the source website the copier should follow and download. Moreover, there are options for controlling how the URLs of the locally stored pages should be formed for ensuring that the links included in them are functional, i.e., they point to the correct locally stored pages, and thereby preserving the navigational structure of the website. Apart from website copiers, there are also tools that download websites and store them in PDF. An indicative list of these tools is given in the references section at the end of this chapter. In our plot, the counter-anchored link of Robert’s blog (e.g., number 6 in the upper right corner of the post shown in Fig. 17.3) did not function because on clicking it a JavaScript function (embedded in the HTML page) is called, which enriches dynamically the HTML page with a new link, i.e., the HTML code was like the following:

The way it was written, i.e., in code, did not allow the website copier tool to detect and transform it for being functional in the downloaded website. In the general case, it would be rather impossible for a program to do so, since with a programming language there are several different ways to express the same command. There should be technology that fully understands what a code will do (something that is not generally the case today). Issues related to the preservation of blogs (and websites in general) have been studied in the literature: for blogs preservation, see the work of Kasioumis et al. (2014), for web content management systems (CMS), see the work of Banos and Manolopoulos (2015), while a method for evaluating website archivability is described in another work by Banos and Manolopoulos (2016). Another aspect of the problem concerns regulations. For instance, in the European Union, as of 2018, these and other platforms are supposed to respect the related EU regulations

17.2

Technical Background

183

Fig. 17.3 An indicative blog

[EU 2016/679] on the protection of natural persons with regard to the processing of personal data and the free movement of such data. It is worth noting that even digital spaces can be associated with emotions; therefore technology (and digital preservation in particular) should be able to respect this aspect as well. A real post written by one blogger after having been informed that the blog platform she uses will cease to work follows (the name of the platform has been kept anonymous): By Ligeri Vasiliou 4/3/2018: I’m charged !!!! So emotionally charged I did not suspect it. I had not imagined that it would cost me so much . . . the demolition of the XblogPlatform. Perhaps the review of my posts (1107 in total), and most of all the 16,810 comments by visiting friends and even passers-by from the above pages, raised upset waves in my soul that literally touched the sky. In particular, the comments, which are soul deposits, expression of emotions, and dedication of time of their authors, have been kneeling me. How could I gather these things, the sanctuaries? Things can be stored cumulatively, but our heartbeats never. And how do I place them with reverence on another blog platform? Any other platform may be stable, large, provide some guarantees, but it does not inspire, because it is impersonal. It does not have the warmth, the immediacy, the intimate atmosphere of the XblogPlatform neighborhood.

184

17.3

17

Pattern: Preservation Planning

Pattern ID Problem’s name The problem

Type of digital artifacts Task on digital artifacts What could have been done to avoid this problem

Lesson learnt

Related patterns

17.4

The Personal Archive of Robert: On Preservation Planning

P14 Preservation planning Robert is worried about his digitally encoded files. He fears that with no action, a significant percentage of his digital content could turn out to be useless in some years from now. He would like to preserve his content, but he does not know how; which part of his archive to preserve, and how. This concerns not only digital content that is stored on his computer but also content that is stored on various web-based platforms. He also wonders about the cost of these procedures (not only the financial part) Collection of digital objects All tasks Robert could rely on a preservation planning tool (like Plato) for evaluating the different alternatives and selecting the optimal one. Such tools can automate, or just assist us, in carrying out processes that should otherwise be performed manually. For content that is stored in web platforms, he should have checked what export functions they offer. Moreover, periodically he should have used and tested the export facilities that they offer, as well as tools for automatically copying the material that he has uploaded We all agree that our digital assets should be preserved. However, many of us do nothing (or very little) to preserve them properly, because of the required time and cost, as well as for reasons of privacy. Moreover, users tend to overlook what export options are offered by the platforms they decide to use for uploading content. iPrevious: • P1 (Storage media: durability and access) • P2 (Metadata for digital files and file systems) • P13 (Authenticity assessment) iNext: –

Questions and Exercises

1. Find recent reports that estimate the amount of digital information that is produced yearly. 2. Browse the file system of your computer starting from its root directory and estimate the size of the stored information that you would not like to lose (if you take periodic backups, the sought size is the backup size). 3. Suppose that you can store (backup) only one-tenth of the amount that you estimated in the previous question. On what criteria would you base your selection?

17.5

Links and References

185

4. While browsing your file system, identify the types of digital objects that will probably require migration or conversion in the future. Compute their size, also as a percentage of the size that you calculated in Question 2. 5. Use a Preservation Planning tool (e.g., Plato) for defining the best strategy for preserving: a. Your personal collection of photos; all the photos should be in JPEG format, so please consider that migration activities might be required. b. Your documents; estimate that all your documents must be migrated to an archival-safe standard (i.e., PDF/A). c. Your music collection; include activities that migrate older format to MP3 format. 6. Select one of your files and compute approximately its Kolmogorov complexity. 7. Calculate the cost for preserving all your documents (doc, docx, pdf, txt, rtf, xml) for 20 years using the formulas described in Strodl and Rauber (2011). For the calculations, you can pick the common rates for employees and the current prices for the hardware/software that will be used (see also the related exercises from Chap. 4). 8. Calculate the cost for preserving all your documents using the 3-2-1 rule. For your calculations, check for the most recent prices of storage media. 9. Find an image compression tool (there are various such tools online) and compress an image of yours using a lossy format (i.e., JPEG). Do it as many times as you can. Can you spot any differences? 10. If you have done Exercise 6 of Chap. 4, estimate the amount of digital information that is stored in obsolete storage media in your home.

17.5

Links and References

17.5.1 Readings About Kolmogorov Complexity • Kolmogorov, A. N. (1968). Three approaches to the quantitative definition of information. International Journal of Computer Mathematics, 2(1–4), 157–168. • Strodl, S., & Rauber, A. (2011). A cost model for small scale automated digital preservation archives. About Moore’s Law and Kryder’s Law • Moore, G. (1965). Moore’s law. Electronics Magazine, 38(8), 114. • Schaller, R. R. (1997). Moore’s law: past, present and future. IEEE Spectrum, 34(6), 52–59.

186

17

The Personal Archive of Robert: On Preservation Planning

• Walter, C. (2005). Kryder’s law. Scientific American, 293(2), 32–33. About the Preservation of Computer Games • Lee, J. H., Keating, S., & Windleharth, T. (2017). Challenges in preserving augmented reality games: A case study of Ingress and Pokémon GO. 14th international conference on digital preservation, Kyoto, Japan. • Nakamura, A., Hosoi, K., Fukuda, K., Inoue, A., Takahashi, M., & Uemura, M. (2017). Endeavors of digital game preservation in Japan – A case of Ritsumeikan Game archive project. 14th international conference on digital preservation, Kyoto, Japan. About Machine-Actionable Data Management Plans • Miksa, T., Rauber, A., Ganguly, R., & Budroni, P. (2017). Information integration for machine actionable data management plans. International Journal of Digital Curation, 12(1). About Replicating Backup Copies • Rosenthal, D. S. (2010). Keeping bits safe: how hard can it be?. Communications of the ACM, 53(11), 47–55. About Multimedia and Compression and Quality • Teruggi, D. (2010). Ethics of Preservation. Cultural heritage on line, 1000–1004. • Dar, Y., Bruckstein, A. M., & Elad, M. (2016, December). Image restoration via successive compression. In Picture coding symposium (PCS) (pp. 1–5). IEEE. • Addis, M., Boch, L., Allasia, W., Gallo, F., Bailer, W., & Wright, R. (2010). Digital preservation and access in the PrestoPRIME project. In DPIF symposium. About Weblog and Website Preservation • Kasioumis, N., Banos, V., & Kalb, H. (2014). Towards building a blog preservation platform. World Wide Web, 17(4), 799–825. • Banos, V., & Manolopoulos, Y. (2015). Web content management systems archivability. In East European Conference on advances in databases and information systems (pp. 198–212). Cham: Springer. • Banos, V. & Manolopoulos, Y. (2016). A quantitative approach to evaluate Website Archivability using the CLEAR+ method. International Journal on Digital Libraries, 17(2), 119–141. About Data Protection Directives and Regulations • Official Journal of the European Union. (2016). Regulation (EU) 2016/679 of the European Parliament and of the Council. L119, 4 May 2016, p. 1–88. • European Commission, Data Protection. (https://ec.europa.eu/info/law/lawtopic/data-protection_e)

17.5

Links and References

187

17.5.2 Tools and Systems About Preservation Planning • Plato http://www.ifs.tuwien.ac.at/dp/plato/intro/ • ETD+ Toolkit https://educopia.org/publications/etdplustoolkit About Data Management Planning • DMPonline (https://dmponline.dcc.ac.uk) • DMPTool (https://dmptool.org/) • UK Data Service (https://www.ukdataservice.ac.uk) About Data Replication • The LOCKSS Program (https://www.lockss.org) is an open-source, libraryled digital preservation system built on the principle that “lots of copies keep stuff safe.” • RAID (https://en.wikipedia.org/wiki/RAID) About Image and Video Compression • Caesium (https://saerasoft.com/caesium/) • TinyJPG (https://tinyjpg.com/) • Handbrake (https://handbrake.fr) About Copying Websites • WGet (https://www.gnu.org/software/wget/) • WinWGet (https://sourceforge.net/projects/winwget/) • WKTOpdf (https://wkhtmltopdf.org/) About Version Control • GIT (https://git-scm.com) • Bazaar (http://bazaar.canonical.com/en/) • Apache Subvesion—SVN (https://subversion.apache.org)

17.5.3 Projects About Value of Information • LIFE Project—http://www.life.ac.uk/ • Cost Model for Digital Preservation (CMDP) Project—http://www. costmodelfordigitalpreservation.dk/ • ENSURE Project—http://cordis.europa.eu/project/rcn/98002_en.html • ERPANET Project—http://www.erpanet.org/ • SCIDIP-ES—http://www.scidip-es.eu/ • 4C Project (Collaboration to Clarify the Cost of Curation), http://www. 4cproject.eu/

Chapter 18

The Meta-Pattern: Toward a Common Umbrella

18.1

Episode

It’s Sunday evening and Robert has just finished a global backup of his files and his blog. He feels tired; he comes out of his office and sits on the wooden rocking chair he had bought about 10 years ago from an antique shop in Tallinn. “All these tasks should be easier,” he ponders. “Why isn’t digital material more easily interpretable and manageable? Is it a matter of missing standardization, bad practices, or is there something else to blame? Have we just failed to pay attention to the fact that everything in technology changes very quickly, and so we should pay more attention to the interoperability and completeness of digital material? I examined so many files from the USB stick and I failed to achieve what I wanted. I didn’t get an overview of the contents of the entire USB stick, nor were there adequate metadata. I faced problems even with the encoding of symbols. Let alone provenance, clarity of semantics, lack of context. As for software, the situation was even worse. Software execution is not at all trivial, and at the same time it could be dangerous. Furthermore, the older the digital content is, the harder these problems become. For example, how to use old software written in programming languages that are no longer used, which were even made for hardware that does not exist today? And if all these problems occur for digital material of daily use, shouldn’t scientific knowledge, which is now also digitally recorded, be in better shape, since its verification by others is the distinctive characteristic that separates it from dogmatisms, ideologies, and obsessions?” Thinking about all these issues, Robert traces his fingers on the simple form of a flower carved on his chair’s arm. “The problems are rather too complicated to be solved using traditional ways,” he thinks. “More automation is needed. The existing technology should be further exploited; besides, the evolution of technology This chapter has been co-authored with Yannis Kargakis. © Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_18

189

190

18

The Meta-Pattern: Toward a Common Umbrella

created all these issues. So why not use more advanced methods of knowledge representation and reasoning, as well as artificial intelligence, to help us tackle all these issues?” With these thoughts, he rises from the chair. It is already late.

18.2

Technical Background

The examples of the previous chapters were based on individual files and folders. In this chapter, we abstract and generalize. In particular, we describe a general approach that can capture the previous cases. It can be considered as an agile approach based on the notion of task performability that is powered by knowledge-based reasoning services.1 We should inform the reader that this chapter is the most technically condensed chapter of the book and it is also addressed to those who would like to deepen their knowledge on the subject. The chapter is organized as follows: First (in Sect. 18.2.1), we relate the notion of pattern that we have used in this book with task performability. In Sect. 18.2.2, we discuss interoperability strategies and task performability. In Sect. 18.2.3, we discuss the basic tasks of migration and emulation from a dependency management point of view. In Sect. 18.2.4, we identify the basic requirements regarding automated reasoning for task performability. In Sect. 18.2.5, we show how to model tasks and their dependencies for enabling the intended reasoning. In Sect. 18.2.6, we discuss methodological issues for applying the described approach. In 18.2.7, we discuss how the scale of 5-star Linked Data (that was described in Sect. 8.2.3) is related to the “task-centric” approach that is described in this chapter. In this light, in Sect. 18.2.8, we discuss the case of blog preservation that we encountered in the previous chapter. In Sect. 18.2.9, we discuss the issue of information identity, since it is relevant to several tasks including those related to authenticity. Then, in Sect. 18.3 we draw the big picture, i.e., how the patterns presented in this book fit together, the FAIR Data principles (in Sect. 18.3.1), and, finally (in Sect. 18.3.2), we discuss in brief systems for digital preservation in organizations. Figure 18.1 illustrates some of the aforementioned concepts and how they are related. In short, each pattern is related to the execution or enactment of a task, where the latter is usually decomposable to subtasks, each of them using (or requiring) various resources that we call modules. There are various levels of interoperability; each one of them is related to one or more tasks. Much of the presented work has been done in the context of past EU projects (CASPAR, SCIDIP-ES, APARSEN). Pointers to the related publications, deliverables, and systems are given at the end of the chapter.

1

It is based mainly on the work of Tzitzikas et al. (2015).

18.2

Technical Background

191

hasSubTask

PaƩern

relatedTo

Interoperability

relatedTo

Task ExecuƟon or Enactment

of

*

Task

uses *

Module

*

Fig. 18.1 Interoperability, patterns, tasks, subtasks, dependencies

18.2.1 Patterns and Task Performability The ultimate objective of digital preservation is to preserve the ability of using digital objects of today in the long term. This includes the ability to use them on a different platform or system. For this reason, digital preservation has been termed “interoperability with the future.” The crux of the interoperability problem is that digital objects and services have various dependencies (syntactic, semantic, etc.) and we cannot achieve interoperability when the concerned parties are not aware of the dependencies of the exchanged artifacts. A bit deeper, each interoperability objective can be conceived as a kind of demand for the performability of a particular task (or tasks). As tasks we consider actions that can be applied on a digital object (e.g., render, edit, run, compile), each having its own dependencies. From this perspective, it is evident that digital preservation is intrinsically a dependency management problem.

18.2.2 Interoperability Strategies In general, we could identify two main strategies for interoperability: • Strategy 1: Reliance on standards. One approach for tackling the interoperability problem is standardization. This means that one strategy for achieving a particular interoperability objective is to develop and adopt a standard appropriate for that objective (standardization was discussed briefly in Sect. 1.3 and we have encountered various standards in the previous chapters, also the lack of standards as in the case of blog preservation discussed in Sect. 17.2.10). • Strategy 2: Toward a more agile approach. Alternatively, without standards, interoperability can be achieved only at a smaller scale and requires developing ad hoc solutions, something that is laborious and expensive. An emerging question is: Can we come up with processes that can aid solving the interoperability problem without relying on several and possibly discrepant standards, and without the effort of ad hoc solutions? What kind of models, processes, and services could support that? One way to approach this question is elaborated below.

192

18

The Meta-Pattern: Toward a Common Umbrella

We shall use a short running example to describe Strategy 2. Situation: Consider a user that would like to run a BASIC program (written in 1986) on his android smart phone (which runs a 2018 version of Android OS). Approach: We do not necessarily need dedicated standards for accomplishing the above scenario. A series of conversions and emulations could make the execution of the 1986 software on a 2018 platform feasible. However, the process of checking whether this is feasible or not could be too complex for a human. This is where advanced modeling and automatic reasoning services could contribute. Details: Suppose the software is written in BASIC programming language, whose source code is stored in a file named game.BAS. Some of the consequent questions are: (a) what can we do (to achieve this objective), (b) what should we (as a community) do, (c) do we have to develop a BASIC compiler for Android OS, (d) do we have to standardize programming languages, (e) do we have to standardize operating systems, virtual machines? Below, we investigate whether it is already possible to run it on Android by “combining” existing software, specifically by applying a series of transformations and emulations. To advance this example, suppose that we have at our disposal the following (as shown in Fig. 18.2(b): a converter from BASIC source code to C++ source code (say b2c++), a C++ compiler (gcc) for Windows OS, an emulator of Windows OS executable over Android OS (say W4A), a smart phone running Android OS, and the BASIC File (game.BAS). It seems that we could run game.BAS on a mobile phone in three steps: (1) converting the BASIC code to C++ code, (2) then compiling the C++ code to produce executable code, and (3) by running over the emulator the executable yielded by the compilation. Indeed, the series of transformations/emulations shown in Fig. 18.2(c) could achieve our objective. As mentioned earlier, one might argue that this is very complex for humans. Indeed this is true, and this is why it is beneficial that such reasoning be performed by computers, not humans. The work that is described below shows how we can model our information in a way that enables this kind of automated reasoning. Although the above scenario concerns software, the proposed direction and approach is not confined to software. Various interoperability objectives that concern documents and datasets can also be captured. For instance, below we characterize the patterns described in this book from a dependency point of view. Specifically, Table 18.1 provides the list of the patterns that have been described in this book so far. Each row corresponds to one pattern; the first column provides its identifier, the second contains the problem’s name, the next two columns show the corresponding artifact (in most cases a digital file) and the chapter number. The last

18.2

Technical Background

193

a. game.BAS

???

Android Smart Phone

Code in BASIC Programming Language

b.

game.BAS

c.

C++ Compiler for WinOS

Emulator of WinOS executable over Android OS

Code in BASIC Android Smart Phone Programming Language

game.BAS

step 1: conversion

step 2: compilaƟon

Converter from BASIC to C++

step 3: emulaƟon

Fig. 18.2 Running example. (a) The situation, (b) the available modules, (c) a series of conversion/ emulation to achieve our objective

column of the table defines the desired task in each pattern. Each task is spelled out informally and its dependencies are not shown in the table. More detailed examples regarding how the dependencies are modeled and managed are given in the subsequent sections. This perspective applies not only to the examples in this book but for any interoperability objective or challenge (like those described in APARSEN D25.1 Interoperability Objectives and Approaches), meaning that all these objectives and challenges can be construed as a kind of demand for the performability of a particular task.

18.2.3 Migration, Emulation, and Dependency Management Migration is the process of converting a digital object that runs on one platform so that it can run on another (non-obsolete) platform. Its purpose is to preserve the integrity of digital objects and to retain the ability for clients to retrieve, display, and otherwise use them in the face of constantly changing technology. Emulation is generally described as imitating a certain computer platform or program on another platform or program. It requires the creation of emulators,

194

18

The Meta-Pattern: Toward a Common Umbrella

Table 18.1 Index of the patterns Id P1

Problem name Storage media: Durability and access Metadata for digital files and file systems Text and symbol encoding

File/Artifact USB stick USB stick file system Poem.html

5

P4

Provenance and context of digital photographs

MyPlace.png

7

P5

Interpretation of data values

Todo.csv

8

P6

Executables: safety, dependencies Software decompiling External behavioral dependencies

destroyAll.ext

9

MyMusic.class yyy.java

10 11

myFriendsBook. war Roulette.BAS

12 13

MyExperiment

14

MyContacts.con

15

SecretMeeting. txt

16

The personal archive of Robert

17

P2 P3

P7 P8

P9

Web application execution

P10

Understanding and running software written in an obsolete programming language Reproducibility of scientific result

P11

P12 P13

(Proprietary) Format recognition Authenticity assessment

P14

Preservation planning

Chap. 4

6

Desired task To read the bits of a storage medium To read an entire file system To read and render meaningful symbols from bit strings To answer, for a digital object, the following questions: where (location), when (date), who (actor), why (goal)? To read meaningful attribute– value pairs Is it safe to run? Give me its runtime dependencies To get the source code To get information about the assumed ecosystem and the expected behavior To trace and resolve deployment and runtime dependencies To run it. To get and resolve the required dependencies

To answer provenance questions, to redo, to compare with what is reported, to answer trust questions like: is it real, valid, authentic? To find specification and software about this format To answer provenance questions (like the tasks of P4) plus trust-related questions: is it real, valid, authentic? To select what should be preserved, what are the required actions and what is their cost

where an emulator is hardware or software or both that duplicates (or emulates) the functions of a first computer system (the guest) in a separate second computer system (the host), so that the emulated behavior closely resembles the behavior of the real system. Popular examples of emulators include QEMU, Dioscuri, and bwFLA. There is currently a rising interest in emulators for the needs of digital preservation. Another related concept is that of the Universal Virtual Computer (UVC). It is a special form of emulation where a hardware- and software-independent platform has

18.2

Technical Background

195

been implemented, where files are migrated to UVC internal representation format, and where the whole platform can be easily emulated on newer computer systems. It is like an intermediate language for supporting emulation. In brief, and from a dependency perspective, we could say that the migration process changes the dependencies (e.g., the original digital object depends on an old format, while the migrated digital object now depends on a newer format). Regarding emulation, we could say that the emulation process does not change the “native” dependencies of digital objects. An emulator essentially makes the behavior of an old module available (actually by emulating its behavior). It follows that the availability of an emulator can “satisfy” the dependencies of some digital objects, but we should note that the emulator itself has its own dependencies that have to be preserved to ensure its performability. The same also holds for converters.

18.2.4 Requirements on Reasoning Services For realizing an agile knowledge-based approach for interoperability, and, thus, for digital preservation, we can identify the following key requirements: Task Performability Checking To perform a task we have to perform other subtasks and fulfill associated requirements for carrying out these subtasks. Therefore, we need to be able to decide whether a task can be performed by examining all the necessary subtasks. For example, we might want to ensure that a file is runnable, editable, or compilable. This should also exploit the possibilities offered by the availability of converters. For example, the availability of a converter from Pascal to C++, a compiler of C++ over Windows OS, and an emulator of Windows OS over Android OS should allow inferring that the particular Pascal file is runnable over Android OS. Consequences of a Hypothetical Loss The loss or removal of a software module could also affect the performability of other tasks that depend on it and, thus, break a chain of task-based dependencies. Therefore, we need to be able to identify which tasks are affected by such removals. Identification of Missing Resources to Perform a Task When a task cannot be carried out, it is desirable to be able to identify the resources that are missing. For example, if a user, say Robert, wants to compile the file HelloWorld.cc, his system cannot perform this task since there is no C++ Compiler. Robert should be informed that he should install a compiler for C++ to perform this task. Support of Task Hierarchies For example, if we can edit a text file, then certainly we can read it. It is therefore desirable to be able to define task–type hierarchies for gaining flexibility, supporting various levels of granularity, and reducing the number of rules that have to be defined.

196

18

The Meta-Pattern: Toward a Common Umbrella

Properties of Dependencies A dependency type may have its own properties. For instance, some dependencies are transitive, some are not. Therefore, we should be able to define the properties of each kind of dependency. Within this context, we do not focus on modeling, logging, or reasoning over composite tasks in general. We focus on the requirements for ensuring the performability of simple (even atomic) tasks, since this is more aligned with the objectives of long-term digital preservation. Nor do we focus on modeling or logging the particular workflows or derivation chains of the digital artifacts, e.g., using provenance models like OPM or CRMdig. We focus only on the dependencies for carrying out the desired tasks. Obviously, this method is less space-consuming, e.g., in our running example we do not have to record the particular compiler that was used for the derivation of an executable (neither its compilation time nor who achieved the compilation); we just care about the compiler one needs to have for future use. However, if a detailed model of the process is available, then the dependency model can be considered as a more simple and focused view of that model.

18.2.5 Modeling Tasks and Their Dependencies To assist understanding, Fig. 18.3 depicts the basic notions in the form of a rather informal concept map. In brief, for achieving an interoperability objective, we have to perform (execute or enact) one or more tasks over an object (module). In turn, to achieve performing a task over a module, we need one or more other modules. Each module has a module type, and module types can be hierarchically organized. Now

Fig. 18.3 Modeling tasks and their dependencies (informal map)

18.2

Technical Background

197

Analyze Uploaded Digital Objects

include

Upload Digital Object

include Define Profile End User

include extend

View/Edit/Export Profile

Select Task Mark Resolved Dependencies

Define Task & Dependency Rules Curator

For Plain Users

For Archivists

Define Converter/Emulator

Fig. 18.4 Use case diagram of Epimenides

conversion and emulation are special kinds of tasks, each having “source” and “destination” module types.

18.2.5.1

The Research Prototype Epimenides

Epimenides2 is a research prototype for proving the technical feasibility of the approach [presented in Tzitzikas et al. (2015)]. An overview of the supported functionality is given by the use case in Fig. 18.4. The application can be used by several users that can build and maintain their own profiles. Roughly, the profile of a user contains the list of modules that the user has. To be flexible, a gradual method for the definition of the profiles is supported. The knowledge base (KB) of this system currently contains 2225 RDF triples. The main scenario is described here. After logging in, the user can upload a digital object (a single atomic file or a compressed bundle of files) and select the task for checking its performability. The system then checks the dependencies and identifies the tasks that can be executed, as well as the missing resources for performing certain tasks. The curator can define new tasks for the system. After uploading a file, its type is identified by exploiting its file extension (alternatively, other tools that analyze the contents of files like JHove or JMimeMagic can be exploited). The KB contains the dependencies for some widely used types; therefore, the appropriate task-based dependencies are shown to the user. The user can then add those that they already have, and this is actually the method for defining their profile gradually. In this way,

2

http://www.ics.forth.gr/isl/epimenides

198

18

The Meta-Pattern: Toward a Common Umbrella

they do not have to define their profile in one go. The system stores the modules of each user (those modules marked as “I have them”) in the RDF storage. The profiles are stored using different graph spaces and a user can export a profile, or import a profile. Moreover, a user can maintain more than one profile. The user interface contains a menu divided into three sections, as shown in Fig. 18.4 (right). The first contains the main option of the application: “Upload Digital Object”. The “MANAGE PROFILE” section contains options available to any (ordinary) user. The user can also add/delete modules to their profile as well as export their profile or import the profile of a different user. The “MANAGE SYSTEM” section contains options for curators. Such a user can define tasks, emulators, and converters. To properly add a task, an emulator, or a converter, one has to provide extra information from which the application will produce the required rules. The user can add an emulator to their profile only if it has been properly defined from a curator (and, consequently, if the application has produced the required rules). Figure 18.5 summarizes the interaction between a user and Epimenides for checking the runnability over game.pas (the file of our running example). To determine the dependencies that are missing, and which are required to perform the selected task, Epimenides uses the dependency rules that are stored in the KB. Figure 18.6 shows an example of the procedure described above. This example corresponds to the case where the user cannot perform the runnability task for the uploaded file f.exe. For the interested reader, a short description of the knowledge representation and reasoning approach that is followed by Epimenides follows (details and examples are available in Tzitzikas et al. 2015).

Fig. 18.5 Using Epimenides for game.pas

18.2

Technical Background

199

Fig. 18.6 The gradual expansion in Epimenides

The KB can be seen as a set of facts (e.g., database tuples) and (Datalog) rules. For example, the following four lines constitute a Datalog program, where the first two lines represent two facts, while the last two lines contain two rules appropriate for defining the ancestor relation (which is a transitive relation). According to the semantics of this program, it holds that AncestorOf(Jim,Nick). ParentOf(Jim, Mary). ParentOf(Mary, Nick). AncestorOf(X,Y) ParentOf(X,Y). AncestorOf(X,Y) ParentOf(X,Z), AncestorOf(Z,Y)

Returning to the problem at hand, we can use a Datalog-based approach for the modeling of the following: • • • • •

Digital objects, type hierarchies, and profiles Task dependencies and task hierarchies Converters and emulators Important parameters Exceptions and special cases

In brief, for each real-world task, we define two intensional predicates: one (which is usually unary) to denote the performability of the task and another (which is usually binary) to denote the dependencies of the task (e.g., Read and Readable, Run and Runnable). To model a converter and a corresponding conversion, we have to introduce one unary predicate for modeling the converter (as it is

200

18

ConverterPascal2C++(p2c++)

The Meta-Pattern: Toward a Common Umbrella

AndroidOS(smartPhone) EmulatorWinAndroid(W4A)

WinOS(mycomputer) PascalFile(game.pas) Runnable(p2c++,mycomputer) Run(X) :Runnable(X,Y) C++File(X) :- PascalFile(X), ConverterPascal2C++(Y), Run(Y) WinExecutable(X) :C++File(X), C++Compiler(Y)

Run(p2c++)

Runnable(W4A, smartPhone) WinOS(SmartPhone)

C++Compiler(gcc) C++File(game.pas) WinExecutable(game.pas)

WinOS(X) :AndroidOS(X), EmulatorWinAndroid(Y), Runnable(Y,X)

Runnable(X,Y) :WinExecutable(X) WinOS(Y)

Runnable(game.pas, smartPhone)

Fig. 18.7 Proof tree related to the current scenario

done for the types of digital files) and one rule for each conversion that is possible with that converter (specifically one for each supported type-to-type conversion). To model an emulator (between a pair of systems), we introduce a unary predicate for modeling the emulator and writing one rule for the emulation. Regarding the latter, we can either write a rule that concerns the runnable predicate or write a rule for classifying the system that is equipped with the emulator to the type of the emulated system. In addition, and since converters and emulators are themselves modules, they have their own dependencies, and thus their performability and dependencies (actually their runnability) should be modeled too (as in ordinary tasks). Finally, special cases (e.g., exceptions, modeling of parameters) can be captured. Query answering and methods of logical inference are exploited for enabling the required inference services (performability, consequences of a hypothetical loss, etc.) that were described earlier. For instance, Fig. 18.7 shows how Runnable (game.pas, smartphone) is derived based on the facts that model this scenario and the rules related to conversion and emulation (more details are given in Tzitzikas et al. 2015).

18.2.6 General Methodology for Applying the Dependency Management Approach Here we describe in brief the methodology that one could follow for applying the approach described previously. It is an iterative process comprising six main steps (Fig. 18.8).

18.2

Technical Background

201

Step 1) Identify the desired tasks and objectives. This step strongly depends on the nature of the digital objects and the tasks that we want to perform on them. For instance, if we suppose our domain is software, we can identify the following tasks: Edit, Compile, and Run. Step 2) Model the identified tasks and their dependency types. If tasks can be hierarchically organized, then this should be done (more in Section 18.2.6.1). Step 3) Specialize the rule-based modeling according to the results of the previous step. Step 4) Capture the dependencies of the digital objects of the archive. This can be done manually, automatically, or semiautomatically. Tools like PreScan (described in Section 5.2.6) can facilitate this task. In addition, this can be done in various levels of granularity: object-level (e.g., for a particular object), type-level (e.g., for all files of type html), and collection-level (e.g., for a collection of images). Step 5) Customize, use, and exploit the dependency services according to the needs. For instance, task performability services can be articulated with monitoring and notification services. Step 6) Evaluate the services in real tasks and accordingly curate the repository (return to Step 1). Fig. 18.8 The main steps in the methodology

18.2.6.1

Layering Tasks

We should stress that the modeling approach presented allows modeling and organizing tasks hierarchically. This is quite natural, and we have seen that within a community and in relevant literature, a kind of layering is often provided. This increases the flexibility of the process and reduces possible redundancies. For instance, the Warwick Workshop, Digital Curation and Preservation: “Defining the research agenda for the next decade,” held in November 2005, noted that virtualization is an underlying theme, with a layering model illustrated as follows (Fig. 18.9):

202

18

The Meta-Pattern: Toward a Common Umbrella

Fig. 18.9 A layering model for virtualization

The common research issues that were identified at that point were as follows: Automation and virtualization

Develop language to describe data policy demands and processes together with associated support systems Develop collection-oriented description and transfer techniques Develop data description tools and associated generic migration applications to facilitate automation Develop standardized intermediate forms with sets of coder/decoder pairs to and from specific common formats Develop code generation tools for automatically creating software for format migration Develop techniques to allow data virtualization of common science objects with at least some discipline-specific extensions Formalize and virtualize management and policy specifications Further virtualize knowledge—including development of interoperable and maintainable ontologies Develop automatic processes for metadata extraction

The specific research topics that were identified at that time included those listed in Table 18.2. Returning to our task-based view of the problem, Table 18.3 lists some basic tasks. In some cases, the further we go down the list, the more complex the tasks become, i.e., some of these tasks rely on the ability to perform other tasks. The approach presented by Tzitzikas et al. (2015) is capable of modeling these tasks and their dependencies, as well as their hierarchical relationships.

18.2.7 Case Study: 5-Star LOD and Task Performability At the beginning of this chapter, it was mentioned that the approach introduced can capture the interoperability objectives not only of software but also of datasets. To

18.2

Technical Background

203

Table 18.2 Research topics (Warwick workshop) Virtualization

Automation

Support

Hardware

• Continuing work on ways of describing information all the way from the bits upwards, in standardized ways—“virtualization.” Work is needed on each of the identified layers • Achieving knowledge virtualization involving ontologies and other Semantic Web developments are required to enable the characterization of the applicability of a set of relationships across a set of semantic terms • Developing of data format description languages to characterize the structures present within a digital record, independently of the original creation application • Recording significant progress in dealing with dynamic data including databases and object behavior • Building Representation Information tools, probably via layers of virtualization to allow appropriate normalization, including mature tools for dealing with dynamic data, including databases • Expanding work on preservation strategies and support tools from emulation to virtualization • Developing increasingly powerful virtualization tools and techniques with particular emphasis on knowledge technologies • Elaborating protocols and information management exchange mechanisms, including synchronization techniques for indices, etc., to support federations • Standardizing APIs for applications and data integration techniques • Refining workflow systems and process definition and control • Developing simple semantic descriptions of designated communities • Standardizing registry/repositories for Representation Information to facilitate sharing • Developing methodologies and services for archiving personal collections of digital materials • Developing and standardizing interfaces to allow “pluggable” storage hardware systems • Standardizing archive storage API, i.e., standardized storage virtualization • Developing certification processes for storage systems • Undertaking research to characterize types of read and transmission errors and the development of techniques that detect and potentially correct them

make this clearer, here we focus on the case of Linked Data (that were described in Sect. 8.2.3). To this end, we will adopt the 5-Star Open Data3 rating (Fig. 18.10). An ensuing question is how the above scale is related to task performability. For this reason, below we discuss which task is assumed in each rating. Table 18.4 describes this mapping. Specifically, we consider a similar example as in http:// 5stardata.info/, i.e., we consider “the temperature forecast for Galway, Ireland, for the next 3 days.” It is evident that a higher star rating implies performability of more tasks. It also implies that the assumed tasks can be performed with less, or more easily resolvable, dependencies.

3

http://5stardata.info/

204

18

The Meta-Pattern: Toward a Common Umbrella

Table 18.3 Some basic tasks Task Retrieve the bits Access Render

Run Search

Link Assert quality Get provenance Assert authenticity Reproduce Update Upgrade/Convert/ Transform Fig. 18.10 5-star Open Data

Task description Ability to get a particular set of stored bits Ability to retrieve the bits starting from an identifier (e.g., a persistent identifier) Given a set of bits, ability to render them using the right symbol set (e.g., as it will be analyzed further in Sect. 18.2.9) for creating the intended sensory impression Ability to run a program in a particular computer platform Ability to find a digital object. Search ability can be refined based on the type of the object (doc, structured, composite) and its searchable part (contents, structure, metadata) Ability to place a digital object in context and exploit it. This may require combining data across difference sources Ability to answer questions of the sort: what is the value of this digital object; is it authentic Ability to answer the corresponding questions (who, when, how) Based on provenance and authentication Ability to reproduce a scientific result. This is crucial for e-Science Ability to update and evolve a digital object Ability to upgrade a digital object (e.g., to a new format) or convert its form make your data available on the web (whatever format) under an open license make it available as structured data (e.g., Excel instead of image scan of a table) use non-proprietary formats (e.g., CSV instead of Excel) use URIs to denote things, so that people can point at your data link your data to other data to provide context (for making it clear to the rest of the world what your data mean).

For a more concrete running example, suppose that we want to process the contents of a file containing the weather forecast (i.e., air temperature, surface wind, cloudiness, rainfall, and snowfall in millimeter) for Heraklion city for the following 3 days. The processing of the data will be made by a software agent (e.g., a specific application that produces weather statistics). We will describe the various tasks and their dependencies assuming that the original data fall into the five categories defined by 5-star Open Data (Fig. 18.10). : All the data are available as an image (in jpeg format and accessible through the Internet) as shown in Fig. 18.11. The first obvious task for the agent is the ability to retrieve the data. After retrieving the file (in other terms downloading), the agent should extract the data

18.2

Technical Background

205

Table 18.4 5-star Open Data and assumed tasks Rating 1 star 2 stars

3 stars

4 stars

5 stars

Assumed task or tasks Assumed task: Get the forecast data in digital form and in a readable way. This is related to the tasks “Retrieve”, “Render”, and “Access” described earlier Assumed (extra) Task: Get the structure of the data, i.e., one should be able to answer queries of the sort: how many rows and how many columns does this dataset contain, or what is the value of the cell [i,j]? Note that if the dataset was stored as a picture, to answer such queries, we would have to depend on the availability of image processing software (e.g., OCR). If, on the other hand, it is stored in “.xls”, then we depend on the availability of MS Excel. This task is also related to the task Search described earlier Assumed Task: As in 2 stars, but here we get information about the structure without relying on commercial software like MSOffice, but on a more general and open format, e.g., XML or CSV, which requires having just a text editor The provision of URIs allows citing the dataset, as a whole, but also as a particular piece of that dataset. Assumed Task: It is related to the task Link described earlier Here the “context links” allow answering questions of the sort: what, where, when, etc. This resembles provenance queries. Assumed Task: It is related to the tasks Link and Get Provenance described earlier

Fig. 18.11 A jpeg image containing the weather forecast for Heraklion city

from the image. This task depends on the existence of an appropriate OCR program. The latter will extract the data; however, all these data are just characters and numbers for the agent. Therefore, an additional file (e.g., in pdf format) containing the specification of the data should be retrieved and accessed from the agent. To access the contents of the specification, an appropriate reader for this format should be provided. To sum up, the following tasks are required for manipulating the data:

206

Task Retrieve(forecast.jpeg) Retrieve (forecastSpecification. pdf) ExtractData(forecast. jpeg) Read (forecastSpecification. pdf)

18

The Meta-Pattern: Toward a Common Umbrella

Task description Download the data from the Internet Download the specification from the Internet

Extract the data using an appropriate OCR program. This means that additional tasks and dependencies are required (e.g., OCR program execution dependencies) Read the specification using an appropriate application (e.g., PDF reader). This will add more tasks and dependencies

It is obvious that apart from the tasks described above, further tasks (and dependencies) are required for executing the above applications (OCR program, PDF viewer). : The data are available through the Internet in xls format. After retrieving the data, the agent should read the contents using an appropriate application (Excel). Similar to the previous category, the data carry no semantics, so additional information should be provided. In this case, however, the specification can be added with the structured data, so that the downloaded file also contains the specification that is required. To sum up, the following tasks are required for manipulating the data: Task Retrieve(forecast.xls) Read(forecast.xls)

Task description Download the data from the Internet Read the contents of the downloaded file using Excel

Apart from the above tasks, the runnability of Excel requires running some other tasks (and adding some more dependencies). : The data are available through the Internet in XML format. The first task is to retrieve them and then read the contents. The dependencies for reading the contents are simpler in this case (compared to the 2-star case) because no particular commercial application is required for reading them. The agent can use any text editor to read them (and the specification as well), which simplifies the dependencies for this task. To sum up, the following tasks are required for manipulating the data: Task Retrieve(forecast.xml) Read(forecast.xml)

Task description Download the data from the Internet Read the contents of the downloaded file using a text editor

Apart from the above tasks, the runnability of a text editor might add some more dependencies, which are nevertheless much simpler (compared to the runnability of Excel and OCR programs).

18.2

Technical Background

207

: This case is similar to the 3-star case. The only difference is that data are referenced using URIs, which makes no difference in terms of the tasks that are applicable over the data. However, citability is important for other applications and resources, e.g., for a researcher who, in a scientific paper, would like to cite this particular dataset, or for a service that collects weather forecasts for many places in the world. : The data are available through the Internet in RDF/XML format, and URIs are used to refer to them. The difference compared to 4-star data is that semantic information about the data is not included in the same file (e.g., temperature is measured in degree Celsius, rainfall in millimeters, etc.). This information is provided by linking the actual data with other data (or schemas) on the web. After retrieving the data, the agent reads the contents and manipulates the data. Although it is beneficial to link the data on the web, in terms of the tasks (and their dependencies) that are applicable on the 5-star data, this does not further simplify the dependencies. To sum up, the following tasks are required: Task Retrieve(forecast.rdf) Read(forecast. rdf)

Task description Download the data from the Internet Read the contents of the downloaded file using an application (i.e., RDF model reader, Protégé, etc.) that also fetches the scope notes of the corresponding classes. This means that the user will be able to see the actual data, as well as their proper semantic descriptions

From the above example, it is obvious that the required tasks for 5-star data are simpler to carry out compared to the 1-star (or no-star) data, in the sense that the dependencies of these tasks tend to be less complex; the readability of the contents of a file in xml or rdf format (5-star data) requires the availability of a text editor, while the readability of the contents of a jpeg file (1-star data) requires their extraction using sophisticated programs (e.g., OCR programs) with many more dependencies.

18.2.8 Case: Blog Preservation Recall the preservation issue related with Robert’s blog that was described in Chap. 17. All bloggers of the platform would like to preserve their blogs, just like Robert wanted to preserve his own blog. From the user’s side (which relates to a higher level of abstraction), each such blogger would like to perform the following tasks on their blog: • HT1: Full blog backup at local space • HT2: Full blog backup at a web archive • HT3: Migrate to a new blog platform X1

208

18

The Meta-Pattern: Toward a Common Umbrella

The differences in HT1–HT3 is that in HT1 the storage should be local, in HT2 the storage and access should be provided by some external party (a web archive service), while in HT3 the storage (and operation) should be provided by a new blog-hosting platform. Also note that the main objective of the HT3 task is not only the preservation of the contents of the past blog, but actually the smooth continuation of the operation of the blog (in a new platform) and, by inference, continuation of the performability of blog-related tasks, including ability to add new posts, to browse new and past posts by date or tag, to manage reader’s comments, etc. All the above can be modeled as tasks according to the modeling and reasoning approach described in the previous section, so as to enable answering questions of the sort: what tools are needed for taking a local backup, is it possible to migrate to new platform X1, will such a migration preserve everything from the old blog. For instance, Table 18.5 shows tasks related to this scenario:

Table 18.5 Tasks related to blog preservation High-level tasks HT1: Full blog backup at local space HT2: Full blog backup at a web archive HT3: Migrate to a new blog platform X1

Operational tasks TF1: Add new post TF2: Browse posts by date TF3: Browse posts by tag TF4: Approve/disapprove (manage in general) readers’ comments TF5: Update all (new and past) posts and comments TF6: Continuation of the notification services to friends and subscribed users

Low-level tasks T1: Readability of blog posts (this includes text, links, dates) T2: Readability of the images included in the posts, mainly of those that are stored in the platform that will be terminated T3: Readability of HTML formatting for rendering posts properly and this includes the layout, the fonts, the colors, the background image T4: Navigability of blog posts through dates T5: Navigability of blog posts through tags T6: Readability of readers’ comments (and this includes their textual part, as well as their metadata, which include date, commenter’s name, and profile image) T7: Readability of the set of friends and subscribers T8: Readability of blog statistics (blog usage analytics) T9: Readability of other auxiliary services, such as links included in the main page of the blogs as well as buttons for sharing and subscriptions Internal tasks I1: SaveAsHTML (URL source, URL target) I2: ExportToXML (URL blogUrl, Format f) I3: ImportFromXML(URL blogUrl, Format f)

18.2

Technical Background

209

• The upper-left part lists the high-level tasks, HT1–HT3. • The upper-right part lists nine tasks, T1–T9, which correspond to aspects of the blog to be preserved, i.e., they actually analyze what should be preserved. The list is in descending order of importance: – – – –

Tasks {T1, T2, T3} are the most important for the preservation of posts. Tasks {T4, T5} are related to the navigability of blog posts. Tasks {T6, T7} are related to the social aspect of the blog. Finally, tasks {T8, T9} are useful for historical reasons.

• The lower-left part lists six tasks, TF1–TF6, related to the operation of a blog. They are related to the migration case HT3, and therefore are not related or applicable to HT1 and HT2. • The lower-right part lists three internal tasks, I1–I3, which will probably appear in the rules that define the performability of HT1 and HT3. Depending on the granularity of the analysis, these rules may refer to the low-level tasks T1–T9. We should point out that blog preservation is a special case of preserving web-accessible resources. The approaches that can be followed for preserving a website that has been developed on top of a content management system—CMS (e.g., Joomla, Drupal, etc.) are almost identical. In the case of blogs, their contents (posts, themes, comments, etc.) correspond to websites, while the blog platform corresponds to CMS. The preservation of websites that have not been developed using a CMS is similar.

18.2.9 On Information Identity Let us now consider the task of equality testing, i.e., questions such as: is information object o equivalent to information object o0 ? In general, the question of the precise identification of information objects under change of carrier or migration is a difficult problem. There is a need to establish criteria for the unique identification of various kinds of information objects, independent from the kind of carrier or specific encoding. The problem of information identity is fundamental also for digital preservation. Curators and archivists would like to formally record the decisions of what has to be preserved over time and decide (or verify) whether a migration (transformation) preserves the intended information content. Information identity is also critical for reasoning about the authenticity of digital objects, as well as for reducing the cost of digital preservation. Below we summarize the approach described in Doerr and Tzitzikas (2012) about this issue. We could call “information identity question” the problem of deciding whether two information carriers (e.g., papers, digital files) have identical content, i.e., whether they carry the same information object. Although the identity question is relatively simple to answer for material objects, the immaterial nature of information objects makes it more complex. By “immaterial” we mean that the very same (identical) object can be found on multiple material carriers, e.g., a paper, a computer

210

18

The Meta-Pattern: Toward a Common Umbrella

disk, or a tattooed body part. But what is the substance of an information object? We could say that it is a set of features. Still, which of all the practically unlimited number of features of a particular physical thing actually make it up? For digital objects, the answer seems to be trivial at a first glance: there is a unique binary representation, which can be copied around without loss or alteration. But what if we create an Adobe PDF version from an MS Word document? The binary has radically changed, but in many of the practically relevant cases, such a change has not affected the intended information and our copyright on it. So, obviously the law has a concept of identity that is different from the concept of binary identity. And, finally, if we print it out, there are no bytes anymore, but we may still keep a copyright on it, but on what? For example, consider that we create a digital object o1 ¼ article.doc in MS Word in order to render some information. Suppose that we print it on paper using a printer pr and let o2 be its printout (physical object). Then we scan o2 using a scanner sc and let o3 ¼ article.gif be the resulting image. Then we apply an OCR (optical character recognition) tool, say ocr, on o3 and let o4 be the recognized text in ASCII format. We need to be able to identify whether o4 preserves the intended information object(s) originally encoded in o1. If, for instance, the intended information did not depend on fonts and page dimensions (plain language), it makes no sense to preserve more than the plain text from o1, o2, o4 in order to keep the information. The basic thesis of Doerr and Tzitzikas (2012) is that a notion of identity of an information object that conforms to legal practice, and the objectives of Digital Preservation, must be based on an analysis of the intended “sensory impression” rather than the binary form or the material embodiment of an information object. It is argued that the meaningful features are in most real-life cases well known to the author, and in relevant cases, such as scientific publishing, they are even formally prescribed and known to the publisher. The problem is rather to formalize and preserve this information. For this reason, an ontology is proposed providing the concepts and definitions necessary to demonstrate the feasibility of objectifying the sensory impression in order to assess the identity of information content. Figure 18.12 shows a small HTML page. We can see its binary contents, its text encoding, its sensory impression, and its representation in DOM (Document Object Model).4 The core elements of the ontology proposed by Doerr and Tzitzikas (2012) are shown in Fig. 18.13. An Information Carrier carries one or more Information Objects, and to specify what that means, we require that each Information Carrier be subject to one or more projections, i.e., processes whose output is one or more intended Sensory Impressions. In turn, a Sensory Impression is defined as a signal (single or multidimensional) that is typically analogue. Some may be received by technical sensors, some by human senses, and some by both. If the Information Object we are interested in is defined in terms of a finite, discrete arrangement of 4

https://www.w3.org/DOM/

18.2

Technical Background

211

Binary Contents

HTML Contents

29 28 0B BB 54 6B 38 3E EB B6 21 73 FE 11 99 35 86 37 3C 53 3E 51 6E 0A D1 79 FE 4D 23 4D C8 BD 1D A7 31 D6 59 EA D6 79 A2 9F 68 A3 23 82 65 AD B4 C5 58 4A 75 DA 12 42 B7 BB ED AA B4 90 82 65 B2 5B 7F 9C 55 87 0E F0 0F E3 CB 5D EA AC C9 A9 17 5E 0D ED 5D C4 69 C2 5C 67 CC 0A 1D 83 D8 27 11 B4 EC B7 4D 6D 46 E1 A9 C8 38 DA B1 67 FE ED CA 1C C7 3A 58

My Title

My Header My Link

Sensory Impression (2D signal)

DOM Document Root

Element

Element

Element

Element

Element

Fig. 18.12 An indicative HTML page in various formats (binary contents, HTML, sensory impression, DOM tree)

Fig. 18.13 The basic concepts and associations of the model

symbols, where each symbol belongs to a finite symbol set and is positioned by adequate arrangement rules, it is possible to extract from the Sensory Impression, a Symbol Structure, which is the substance of the carried Information Object. This extraction process can be just reading a paper with our eyes in sunlight, but also a mechanical process such as OCR. If the extraction process has reproducible results under sufficient quality conditions, we regard that the carrier does indeed carry this information.

212

18

The Meta-Pattern: Toward a Common Umbrella

The detailed ontology for objectifying the sensory impression is described by Doerr and Tzitzikas (2012). Their work does not aim to capture the behavior of digital objects, e.g., how a piece of software behaves or how a database is supposed to behave (e.g., its query capabilities). This exception (exclusion of behavior) also includes the “active” features in documents such as hyperlinks (and JavaScript, CSS for web pages), as in the blog preservation case. We could mention that Shannon in his theory (Shannon 1948) also assumes that the symbols in a communication session are fixed (and known a priori) and this indicates the prominent importance of shared symbols for successful communication (in our case successful and cost-effective preservation). As regards the various preservation strategies, we could mention canonicalization, which aims at assisting the problem of assessing whether the essential characteristics of a document have remained intact through a conversion from one format to another. Canonicalization relies on the creation of a representation of a type of digital object that conveys all its key aspects in a highly deterministic manner. Once created, this form could be used to algorithmically verify that a converted file has not lost any of its essential content. Canonicalization has been postulated as an aid to integrity testing of file migration, but it has not been implemented. According to Becker (2011), validating the actual content of objects before and after (or during) a preservation action is still one of the key challenges in digital preservation. The approach presented by Doerr and Tzitzikas (2012) can be considered as a concrete method to achieve canonicalization. The work presented by Cheney et al. (2001) also aims at understanding the principles of preservation and at modeling the information content present in physical or digital objects. However, the notion of semantics is not analyzed since it is modeled as a function from the objects to the contents space, but neither the domain nor the range of that function is analyzed. Essentially, their work mainly focuses on the dynamics of preservation and describes it from quite a high-level perspective. We should also mention the work of Tyan (2011), which presents an extensive review of the literature on “significant properties/characteristics.” It concludes that there is a lack of formal objective methodology to identify which characteristics within an information object are significant, and therefore should be preserved. Returning to the approach of Doerr and Tzitzikas (2012), we could identify two “information preservation” approaches: (a) model the information object (as it is proposed in that paper) and represent it in a language (e.g., in RDF/S) and (b) leave the original content as expressed in its (carrier) format and add (in its metadata) its “information format,” i.e., all required information that is sufficient for getting (a), i.e., for getting the intended information object. Note that EAST (described in Sect. 8.2.4) is a kind of what we call information format, i.e., a language for defining the exact (full) information format of data files. Since Representation Information (RepInfo, or other extra information) may depend on other RepInfo (or other extra and/or external information), the approach described in Sect. 18.2.1 elaborates on deciding what to put in a package. In brief, what to put in a package depends on (1) the object, (2) the assumptions about the designated community, and (3) the tasks (over the object) that we want to perform in the future. Technically, that problem was reduced to dependency management and it was implemented using Semantic Web

18.3

The Big Picture

213

(RDF/S and rule) technologies. Note that this dependency-management perspective is orthogonal (or complementary) to the aspect elaborated by Doerr and Tzitzikas (2012), in the sense that an Information Object (or an Information Format) can be defined as dependencies of a digital object. Moreover, the notion of task (and taskbased dependency) is quite general, enabling the projection (that is used in Doerr and Tzitzikas 2012) to be considered as one particular task. Finally, it should be clarified that the problem of deciding whether two information carriers have identical content, i.e., whether they carry the same information object, is orthogonal with the task of completing axiomatically provided assertions of equivalence; for example, Mountantonakis and Tzitzikas (2016) show how to complete the owl:sameAs relationships in the context of the entire Linked Data cloud.

18.3

The Big Picture

At the beginning of the book, we mentioned that we can analyze the problem of digital preservation according to: (a) the types of digital artifacts (e.g., documents, html pages, images, software, source code, data) and (b) the tasks that we want to curate (e.g., store, read, edit, run). Then, we introduced the notion of pattern, which is a frequently occurring problem, essentially corresponding to one or more type– task pairs. Patterns aim at describing, in a standard and concrete manner, the related activities that are necessary (or suggested) for preserving the performability of the corresponding type–task pair(s) and, thus, for avoiding, or mitigating, the corresponding digital-preservation-related risk. However, patterns are not independent, in the sense that they might depend on other patterns. We presented the relationships between patterns and we introduced a conceptual model for modeling tasks and their dependencies (the latter can be other tasks and/or digital artifacts). Finally, we have seen how we can model digital artifacts’ types and curation tasks using expressive and standard representation frameworks (like Semantic Web languages) and how we can then enable automatic reasoning services to facilitate task performability and gap identification. Figure 18.14 shows an overview of the path that we followed in this book. Granularity and Evolution It is worth making a couple of remarks about granularity and evolution. In the dependency management approach that we have just described, the notion of module is treated as an atom, i.e., as an undivided element. However, in many cases, a module can have an internal structure and this structure can be known and formally expressed. In such cases, we can refine the notion of gap, and instead of inferring “module x is missing,” the internal parts of x that are missing can be computed and provided. Examples of modules that fall in this category are models that formally express parts of community knowledge. Note that community knowledge is increasingly coded in a structured way. For instance, classification schemes, taxonomies, and thesauri are expressed in SKOS,5 while ontologies are

5

Simple Knowledge Organization System, http://www.w3.org/2004/02/skos/

214

18

The Meta-Pattern: Toward a Common Umbrella

The Digital PreservaƟon Problem Types of digital arƟfacts Texts Web pages Images Video Source Code Executables Databases SoŌware Systems Data Streams Data CollecƟons …

Digital PreservaƟon PaƩerns P1: Storage Media Durability and Access

Tasks on digital arƟfacts

X

Retrieve bits Access Render … Search and Find … Link … Run … Get Provenance of … Assert Quality of … Reproduce … Update … Perceive …

occurrence frequency, factorizaƟon

P2: Metadata for files and file systems

P5: InterpretaƟon of data values

P4: Provenance and context P6: Executables, safety, dependencies

P7: SoŌware Decompiling

P9: Web applicaƟon execuƟon P13: AuthenƟcity Assessment

P12: Proprietary Format RecogniƟon

P3: Text and symbol encoding

P8: External behavioral dependencies

P10: SoŌware in obsolete programming languages P14: PreservaƟon Planning

P11: Reproducibility of scienƟfic results Meta-PaƩern

for avoiding repeƟƟon

Modeling Tasks and their Dependencies Interoperability ObjecƟve

Atomic Task ExecuƟon

Interdependencies of PaƩerns P1

Task execuƟon

P2 Composite Task ExecuƟon

P4

P6

conceptual modeling P3

P7

P5

P8

P13 Module

TransfrormaƟon Task ExecuƟon

P12 P14

Module Type Conversion

P9

P10

P11

EmulaƟon

for realizaƟon

Provision of Advanced PreservaƟon Services Advanced PreservaƟon Services Core Knowledge Management and Planning Services

Knowledge Base

External Knowledge Bases

Fig. 18.14 The big picture

used to define the concepts of particular domains and their relationships. Methods and tools that extract and publish structured knowledge from text are also evidenced and some typical examples are: (a) the DBpedia, which publishes structured knowledge extracted from Wikipedia; (b) YAGO2, which extracts knowledge from Wikipedia, WordNet, and Geonames; and (c) Freebase, which extracts data from sources such as Wikipedia, ChefMoz, NNDB, and MusicBrainz. Another type of modules that fall in this category are software: RDF/S has been proposed as a data

18.3

The Big Picture

215

structure for software engineering, specifically for expressing software structure and dependencies. For example, there are tools that scan Java bytecode for method calls and create a description of the dependencies between classes and the package/ archive encoded in RDF, while other tools transform Maven POM (Project Object Model) files into RDF. It follows that more refined gaps can be identified also for software if its structure has been expressed explicitly. In general, we can say that RDF/S is currently the lingua franca for expressing these models. If we have explicit representations of these models, then comparison operators (also called diff or delta (Δ) operators) can be used for computing these, more refined, gaps. The general-purpose differential functions for RDF/S Knowledge Bases, like those described by Zeginis et al. (2011), as well as those by Lantzaki et al. (2017), that exploit blank node name anonymity for further reducing the delta, can be used. According to this view, a gap comprises a set or sequence of change operations. The same machinery, i.e., comparison operators, can be exploited also for tackling the requirements stemming from evolution: As the world evolves, these models evolve too; consequently, there is a need for effective methods for identifying the changes, for enhancing the understanding of evolution, and for identifying the consequences of these changes on task performability. Finally, the same machinery can be used to test the quality of a migration, i.e., by computing the diff between the source and the target object. This also makes sense for complex objects, like the blog that was described. The diff operator could be used to reflect what is lost in a possible migration. It could also be used for quantifying the difference, e.g., size of diff with respect to the size of the migrated object. That would be useful in case we want to evaluate various options for deciding which one to select. Inevitably in some cases, the migration of metadata to newer versions of ontologies can result in uncertainty as regards the specificity of metadata [as discussed by Tzitzikas et al. (2012) and further analyzed in Tzitzikas et al. (2014)].

18.3.1 The FAIR Data Principles for Scientific Data In the realm of scientific data, the community has already identified some basic requirements for their management, which, for the time being, are known by the term “FAIR data principles.” These principles are the result of the joint effort of different stakeholders, with representatives from academia, industry, funding agencies, and scholarly publishers with the aim of identifying a set of concise and measurable set of principles that will govern scientific data and will ensure their reusability. They provide guidance for scientific data management and stewardship and are relevant to all stakeholders in the current digital ecosystem. FAIR data principles are being built around four main pillars, which are used for constructing the abbreviated term FAIR: data should be Findable, Accessible, Interoperable, and Resuable. The guiding principles are described below.

216

18

The Meta-Pattern: Toward a Common Umbrella

• Findable – – – –

Data and metadata are assigned a globally unique and persistent identifier. Data are described with rich metadata. Metadata clearly and explicitly include the identifier of the data they describe. Data and metadata are registered or indexed in searchable resources.

• Accessible – Data and metadata are retrievable by their identifier using a standardized communications protocol. Moreover, the protocol is open, free, and universally implementable and allows for an authentication and authorization procedure, if necessary. – Metadata are accessible, even when the data are no longer available. • Interoperable – Data and metadata use a formal, accessible, shared, and broadly applicable language for knowledge representation. – Data and metadata use vocabularies that follow FAIR principles. – Data and metadata include qualified references to other data and metadata. • Reusable – Data and metadata are richly described with a plurality of accurate and relevant attributes. To this end, data and metadata should be accompanied by a data usage license, they should contain provenance information, and they should meet domain-relevant community standards. Making research data more FAIR will provide a range of benefits to researchers, research communities, research infrastructure facilities, and research organizations alike. The benefits will include: (a) gaining maximum potential from data assets, (b) increasing the visibility and citations of research, (c) improving the reproducibility and reliability of research, and (d) achieving maximum impact. It is not hard to see that FAIR principles can be attributed to tasks such as those that have been exhibited in this book, and they can be realized by considering the related patterns (and the corresponding technologies). The dependency reasoning services described in this chapter could aid their uninterrupted provisioning. Since the search process was not discussed in the book (but only its prerequisite, i.e., accessibility), we could just mention that a popular and effective interaction paradigm for searching, which is capable of exploiting the available metadata (those mentioned above under the guiding principle “findable”), is faceted search (for more see the references at the end of this chapter).

18.3

The Big Picture

217

18.3.2 Systems for Digital Preservation Although the issue of digital preservation concerns each one of us, there are organizations that have the assured responsibility to preserve specific digital content, such as libraries, archives, research institutes, etc. For this purpose, these organizations operate repositories; we could call them systems for digital preservation. These systems deal with the storage and documentation of digital material for its long-term preservation, and any other periodic action that is required, such as changing the storage medium and storage formats, virtualization-related actions, and more. In this context, the OAIS reference model (ISO 14721:2012) provides a type of checklist as to what they should not forget to record. It comprises an information model (shown in Fig. 18.15), which essentially categorizes the extra information (metadata that are required) and a functional model (shown in Fig. 18.16) that contains high-level tasks (ingest, archive, disseminate). OAIS does not propose any specific conceptual modeling approach or implementation technique. The way that high-level functions should be realized depends on the objects and the tasks whose performability should be preserved. In essence, OAIS proposes an encapsulation preservation strategy and, thus, distinguishes submission (ingestion), archival and dissemination information packages (SIPs, AIPs and DIPs, respectively). But what the AIPs and DIPs should actually contain is determined by the dependencies of the tasks we want to be able to perform on these objects, and what is now and in the future assumed to be known by the community to which this digital material is addressed. The analysis that was presented in the above

Fig. 18.15 The OAIS information model

218

18

The Meta-Pattern: Toward a Common Umbrella

Fig. 18.16 The OAIS functional model

sections offers a systematic and flexible way to see what needs to be placed in a package (if anything) and what is missing or is redundant. In any case, for the design and implementation of a system for digital preservation, we need to follow the usual process of analyzing and designing information systems, during which the objectives, the functionality, and the technological background will be specified. For instance, Moore et al. (2015) describe a policy-based data management system to apply and enforce preservation requirements, which uses the integrated Rule Oriented Data System (iRODS) data grid software (http://irods.org) as a platform to implement community-specific management policies, while Zeirau (2017) extends and refines OAIS-related notions for capturing real-world requirements (originating from Denmark) related to distribution. Moreover, it is not hard to see that each digital artifact has its own life cycle depending on its type, context, and operations that are applied to it. Since a digital artifact quite often has to be changed for the needs of digital preservation (e.g., after a migration to a newer format, or after metadata enrichment), we could say that the life cycle of digital artifacts is affected by digital preservation-related actions. With regard to archiving, for a digital artifact, sometimes a package called AIP (Archival Information Package) is created in the OAIS, as mentioned earlier, that contains the artifact plus related material (various kinds of metadata). However, since everything changes, such packages (AIPs) need to change as well. The life cycle of such AIPs in BnF (National Library of France) is discussed by Caron et al. (2017). There are several free and commercial platforms that can be used to create a system for information preservation, including Fedora, DSpace, CKAN, iRODS, Archivematica, and others, which can be customized based on the needs and policies of the organization. There are various pages on the web that list tools for digital preservation.6 There are several reports related to preservation systems that are in place in organizations of various kinds, including libraries, research centers, universities, national preservation services. For instance, Caron et al. (2017) discuss issues related 6

For example, https://dpconline.org

18.4

Questions and Exercises

219

to the life cycle of AIPs in the National Library of France. Tools for aiding the ingestion in the National Preservation Service in Finland are described by Lehtonen et al. (2017). The strategy of the Qatar National Library is sketched by Straube et al. (2016). The services that are offered by CERN for long-term preservation of high energy physics (HEP) data, with the Large Hadron Collider (LHC) as a key use case, are described in Berghaus et al. (2016). Digital preservation concerns are also encountered in universities, e.g., the case of the University of Melbourne is described by Weatherburn (2016), where emphasis is given on the preservation of research data and outputs. However, we have to note that the issue of digital preservation now concerns every citizen, not only specific organizations. The question that arises is: To what extent can the digital material of one citizen or organization be preserved if it is sought after by only one citizen or organization?

18.4

Questions and Exercises

1. Search the Internet to see whether there are standards for archiving emails. 2. Search the Internet to see whether there are standards for the preservation of tweets (i.e., of posts on the twitter social networking service). 3. Search the Internet to see whether there are standards for exchanging and archiving scientific papers. 4. Search the Internet and find two programming languages that are no longer used. 5. Find whether there is any formal specification of the programming language BASIC offered by Amstrad 464 (1986). 6. Find whether there is any emulator for programs written in Amstrad 464 BASIC. 7. Search the Internet and find converters from BASIC to C++. 8. Search the Internet and find converters from C++ to Java. 9. Check whether you can run Android on your personal computer. 10. The “renderability” of a file with extension “.docx” depends on MS Office. If we convert such a file to a file with extension “.odt”, what are the dependencies of the new file? 11. Find a dependency type that is transitive. 12. Find the APARSEN report D25.1 (Interoperability Objectives and Approaches) and locate models related to provenance. 13. Use the system Epimenides and attempt the following: a. Create a login as a demo user, load the profile “Scenario User A” and apply the scenario of the running example that is described in Fig. 18.2. b. Navigate on the applied rules through the “Explore Dependencies” option and create a diagram of the conversions and the emulations that are required in order to run the file game.pas. How does your diagram relate to the one in Fig. 18.7? (Answer: It is a reverse diagram).

220

18

The Meta-Pattern: Toward a Common Umbrella

c. Create a login as a demo user and load the profile “Demo User”. Then: (1) navigate the basic options of the system: “View Profile” and “Upload Digital Objects”; (2) try to answer why this user cannot render PDF files? (Tip: use “Load a demo zip” option); and (3) find what is required for running Java code. 14. Find tools for extracting tabular data from PDF documents. 15. Is the text font important for a poem? Is the text font important for a company logo? 16. Select some of your data that are public and rate them, i.e., how many stars (in the 5-star scale) does your data get? 17. Select one of the patterns in this book (e.g., one of P6–P13) and try to apply the first two steps of the methodology described in Sect. 18.2.6. 18. Find and pick two organizations in your country that have the responsibility to preserve digital material and find information about what systems they use and what platforms and tools have been used for building these systems.

18.5

Links and References

18.5.1 Readings About Interoperability • [APARSEN 2013] Alliance for Permanent Access to the Records of Science Network (APARSEN). “D25.2 Interoperability Strategies”, 2013. (urn:nbn: de:101-20140516189), https://doi.org/10.5281/zenodo.1256518 About Emulators • Bellard, F. (2005). QEMU, a fast and portable dynamic translator. In USENIX annual technical conference. FREENIX Track (Vol. 41, p. 46). • Van der Hoeven, J., Lohman, B., & Verdegem, R. (2008). Emulation for digital preservation in practice: The results. International Journal of Digital Curation, 2(2). About UVC • Lorie, R.A. (2001). Long term preservation of digital information. In Proceedings of the 1st ACM/IEEE-CS joint conference on digital libraries (pp. 346–352). ACM. About Provenance • Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., et al. (2011). The open provenance model core specification (v1. 1). Future Generation Computer Systems, 27(6), pp. 743–756.

18.5

Links and References

221

• Theodoridou, M., Tzitzikas, Y., Doerr, M., Marketakis, Y., & Melessanakis, V. (2010). Modeling and querying provenance by extending CIDOC CRM. Distributed and Parallel Databases, 27(2), pp. 169–210. • Strubulis, C., Flouris, G., Tzitzikas, Y., & Doerr, M. (2014). A case study on propagating and updating provenance information using the CIDOC CRM. International Journal on Digital Libraries, 15(1), pp. 27–51. About Automated Dependency Reasoning for Interoperability • Tzitzikas, Y., Kargakis, Y., & Marketakis, Y. (2015). Assisting digital interoperability and preservation through advanced dependency reasoning. International Journal on Digital Libraries, 15(2–4), pp. 103–127. • Vion-Dury, J.Y., Lagos, N., Kontopoulos, E., Riga, M., Mitzias, P., Meditskos, G., et al. (2015). Designing for inconsistency–the dependencybased PERICLES approach. In East European conference on advances in databases and information systems (pp. 458–467). Cham: Springer. About Information Identity • Doerr, M., & Tzitzikas, Y. (2012). Information carriers and identification of information objects: An ontological approach. arXiv preprint arXiv:1201.0385. Works Referred in Information Identify • Shannon, C. E., Weaver, W., & Burks, A. W. (1951). The mathematical theory of communication. • Becker, C., & Rauber, A. (2011). Decision criteria in digital preservation: What to measure and how. Journal of the Association for Information Science and Technology, 62(6), pp. 1009–1028. • Cheney, J., Lagoze, C., & Botticelli, P. (2001). Towards a theory of information preservation. In International conference on theory and practice of digital libraries (pp. 340–351). Berlin, Heidelberg: Springer. • Low, J. T. (2011). A literature review: What exactly should we preserve. How scholars address this question and where is the gap. About Services over Linked Data • Mountantonakis, M., & Tzitzikas, Y. (2016). On measuring the lattice of commonalities among several linked datasets. Proceedings of the VLDB Endowment, 9(12), pp. 1101–1112. About Knowledge Evolution-Related Issues and RDF Knowledge Bases • Zeginis, D., Tzitzikas, Y., & Christophides, V. (2011). On computing deltas of RDF/S knowledge bases. ACM Transactions on the Web (TWEB), 5(3), p.14. • Tzitzikas, Y., Lantzaki, C., & Zeginis, D. (2012). Blank node matching and RDF/S comparison functions. In International Semantic Web Conference (pp. 591–607). Berlin, Heidelberg: Springer.

222

18

The Meta-Pattern: Toward a Common Umbrella

• Lantzaki, C., Papadakos, P., Analyti, A., & Tzitzikas, Y. (2017). Radius-aware approximate blank node matching using signatures. Knowledge and Information Systems, 50(2), pp. 505–542. • Tzitzikas, Y., Analyti, A., & Kampouraki, M. (2012). Curating the specificity of metadata while world models evolve. Preservation of Digital Objects, p. 46. • Tzitzikas, Y., Kampouraki, M., & Analyti, A. (2014). Curating the specificity of ontological descriptions under ontology evolution. Journal on Data Semantics, 3(2), pp. 75–106. About FAIR Data Principles • Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., et al. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3. • FORCE11 (The Future of Research Communications and e-Scholarship) – The FAIR principles (https://www.force11.org/group/fairgroup) About Faceted Search • Sacco, G. M., & Tzitzikas, Y., (Eds.). (2009). Dynamic taxonomies and faceted search: Theory, practice and experience, Springer. • Tzitzikas, Y., Manolis, N., & Papadakos, P. (2017). Faceted exploration of RDF/S datasets: a survey. Journal of Intelligent Information Systems, 48(2), pp. 329–364. About Preservation Systems in Various Organizations • Caron, B., De La Houssaye, J., Ledoux, T., & Reecht, S. (2017). Life and death of an information package: Implementing the lifecycle in a multipurpose preservation system. In iPRES 2017 14th International conference on digital preservation, Kyoto, Japan. • Lehtonen, K., Somerkoski, P., Törnroos, J., Vatanen, M., & Koivunen, K. (2017). Modular pre-ingest tool for diverse needs of producers. 14th International conference on digital preservation, Kyoto, Japan. • Shiers, J., Berghaus, F. O., Cancio Melia, G., Blomer, J., Dallmeier-Tiessen, S., et al. (2016). CERN services for long term data preservation (No. CERNIT-Note-2016–004). • Straube, A., Shaon, A., & Ouda, M. A. (2016). Digital preservation with the Islandora framework at Qatar National Library. 13th International conference on digital preservation, Bern. • Weatherburn, J. (2016). Establishing digital preservation at the University of Melbourne. In 13th International conference on digital preservation, Bern. • Moore, R., Rajasekar, A., & Xu, H. (2015). DataNet Federation Consortium Preservation Policy ToolKit. In 12th International conference on digital preservation, Chapel Hill. • Zierau, E. (2017). OAIS and distributed digital preservation in practice. In 14th International conference on digital preservation, Kyoto, Japan.

18.5

Links and References

223

18.5.2 Tools and Systems About Dependency Management Services for Digital Preservation • GapManager. http://athena.ics.forth.gr:9090/Applications/GapManager/ • Epimenides. http://www.ics.forth.gr/isl/epimenides About Format Recognition • • • •

JHOVE http://jhove.sourceforge.net JMimeMagic https://sourceforge.net/projects/jmimemagic/ JAXB http://jaxb.java.net PreScan http://www.ics.forth.gr/isl/PreScan

Open and Free Platforms for Setting Up a Repository • • • • •

FEDORA (http://fedorarepository.org) DSpace (http://dspace.org) Archivematica (https://www.archivematica.org/en/) CKAN (https://ckan.org/) iRODS (https://irods.org/)

Referred EU Projects • • • •

CASPAR http://www.casparpreserves.eu/ SCIDIP-ES http://www.scidip-es.eu/ APARSEN http://www.alliancepermanentaccess.org/ PERICLES (Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics) http://pericles-pro ject.eu/

Chapter 19

How Robert Eventually Found Daphne

19.1

Episode

Although Robert had extracted as much information as he could from the contents of the USB stick, he still was unable to disclose the identity of the mysterious student. Moreover, journalists had started asking questions about the competition. The related forum on the website of the competition was already flooded with questions and rumors like “Why have the results not been announced yet?” “Probably none of the candidates succeeded.” “The exercises of the competition were too easy; there are hundreds of winners!” “The electronic system of the competition was problematic, all data have been lost!” “Hackers attacked the system and corrupted the recorded data.” May 31 MicroConnect has not issued any official announcement. Informally, it circulated that everything went well and that the evaluation process was progressing as planned. Robert realized that time was running out. MicroConnect should make an official announcement soon, otherwise the rumors would proliferate. Therefore, he starts thinking of alternative ways. After many fanciful ideas, he chooses the one that seems to be more effective. The idea is to involve the universities that had sent students to the competition. Each university could check the accounts of its students; specifically, it could compare the contents of the USB stick with the contents of each student account in the university. The rationale of his idea is based on the fact that people usually replicate their files from their computers to their USB sticks, so there is a great chance that the mysterious student had replicated some of the contents from their account to the USB stick.

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_19

225

226

19

How Robert Eventually Found Daphne

To materialize this idea, a dedicated software application should be developed, an application that would internally contain all the files of the USB stick and would be capable of comparing these files with the files of each account. However, this application should be executed in many universities, checking many thousands of student accounts. Robert thinks that the only way to do it in “silent” mode, without trouble, and without raising worries about the integrity of the competition, was to have this application executed by the administrators of each university. Robert explains the requirements to Scott (the technical administrator) and asks him to design and develop the application as soon as possible. The output of the application could be very simple; just a degree of similarity, ranging from 0 to 1 per user account. If the application encountered a file that was identical with a file of the USB stick, then it should definitely return 1. If not, then the program should compute a degree of similarity. To do so, the types of the files should be considered, e.g., for comparing PDFs, their textual contents should be extracted and compared, while for comparing images, other features should be extracted and compared. One week later, Scott informs Robert that the software application is ready. Robert decides to test it first on his computer. Scott explains to him how to set up the various parameters and how to run it on his own computer. After a few seconds, a screen appears informing them that 2% of the files have already been examined and that it would take approximately 45 min to complete the process. Before the expiration of that time, the application examines all the files on Robert’s computer, and his screen shows the results: only 2% similarity (0.02). The screen also lists the files accounting for this 2% degree of similarity. Afterwards, Robert copies one file from the USB stick somewhere on his laptop and executes the application again. He wants to see if the application finds it. Robert is puzzled because this time it took less than 5 min to complete the execution. Scott explains to him that he designed and implemented the application to “remember” previous executions, to avoid the comparison of files that have been compared previously as long as they have not been modified.1 He says that this feature was important for performance since he expects that a lot of accounts would contain some commonly used files, e.g., courses’ assignments, books, widely used software libraries. Scott then starts describing a plan for exploiting this application in a product of MicroConnect but Robert interrupts him. A smile appeared on Robert’s face because he saw the result he was hoping for. One of the files in his computer had 100% similarity with one of the files in the USB stick. Of course that was not enough to convince him; he was a well-known perfectionist. This time he deleted the previous file and copied one document file from the USB stick on the same folder in his computer. After replicating the file he opened it and started changing the contents. He changed some words, added a few new ones, he also removed whole phrases and paragraphs in some cases. He saved it and executed the application again. “Excellent,” he shouted. The application reported

1

This functionality is supported by the tool PreScan described in Chap. 5.

19.1

Episode

227

83% similarity of the updated document file with one of the files on the USB stick. Scott and his team had done excellent work. Robert then prepares a formal letter for sending it to the rector of each university. The letter states clearly from the beginning that the matter is confidential and that the described procedure should be kept secret. An accompanying document provided all the details of how a system administrator could download and run the software. Prior to sending all these emails, Robert considers asking David, an old friend who was the rector of a university. David was always very careful and Robert always trusted his opinion. Robert sends him a draft of the email he intends to send to the rectors and phones him. – – – –

David, what do you say? No way, Robert! No university will accept this process. But why? It’s a matter of privacy, Robert. It is not possible for a university to give to third party information about the contents of a student account. Think about it. – Ok, but what should I do? – Well. I have some ideas about this. You could change the process and the program so that only the university can see the name of the student account and your company only receives the percentage of similarity and the encrypted value of the corresponding username. In this way, only if they give you the decryption key you will be able to read the account name. “It sounds perfect,” Robert says relieved. “Thank you very much.” Without any delay, Robert and Scott make the necessary changes and trials, and then send the email with the revised procedure and application to the universities. Each rector that received this letter eventually agrees to carry out the suggested procedure after consulting the university legal offices (for checking that the process does not violate any regulation related to personal data protection) and to keep this process secret. Although the process is quite strange, there is no danger: the software would run locally with read-only access and without any network connection, without jeopardizing the personal data of the users. The results of the execution, pairs of username-degree of similarity, would be printed in the console and they would be stored in a small encrypted file that the administrator would have to upload to a particular secure server of MicroConnect. The username would not be readable by MicroConnect because it would have been encrypted using a key provided by the administrator of the local university. June 10–17 The universities start executing the downloaded software. So did Daphne’s university. The system administrator at Harvetton University is no longer Grace; she retired only a few days earlier. The new system administrator was Mary, a young lady who had been working as Grace’s assistant for the last 3 years. Mary installs the

228

19

How Robert Eventually Found Daphne

application, configures it appropriately, and starts its execution. After many hours of execution, the application prints on Mary’s screen the five accounts with the highest similarity values. The first one has a value of 100% and belonged to a student with account name “Daphne”. The reason for this 100% is because Daphne had copied a lot of files from her account to the USB stick. Mary was surprised by the high value of similarity. The second-most similar account had only 11% similarity. Inevitably, she wondered who the student with this account was. Without a second thought, she checks the university’s student database and finds that the owner of this account is a visiting undergraduate student in the context of an international student exchange program. “How is that possible?” she wondered. Daphne didn’t take part in the competition, and also she couldn’t do so because she was an undergraduate student. She thought that it could be a bug in the application, so Mary decides to run the application again. The results are the same. Mary is confused; she did not understand what is going on. A possible scenario is that Daphne copied the files in her account from the USB stick of another graduate student that participated in the competition, probably for educational purposes, and that’s why the application found Daphne instead of the graduate student. Or maybe the graduate student had those files in a public area in her account, Daphne copied them and then the graduate student for some reason removed them and now it’s only Daphne that appears to have those files. Although Mary was told by the head of the department not to say anything to anyone, she decides to call William, the head of the laboratory where Daphne was working. Mary briefly explains to William what she was doing and shows him the program’s results. William is also surprised. However, he couldn’t suggest or do anything as the process was confidential. Mary told William that they couldn’t in anyway alter the results because the application encrypted the results into a file that she has to upload to a particular server of MicroConnect at the end. William leaves the room perplexed. Mary then calls the head of the department to inform him that the process was over. The head in turn informs the rector and then he sends an email to Mary saying that she could send the results to MicroConnect. Mary presses the submit button and the ciphered results are sent to one of the servers of MicroConnect. The server had been programmed to send an automatic email to Robert whenever a submission contained a user account with similarity degree greater than 60%. Robert is seated calmly while enjoying a cup of green tea as he usually does at 10:20 every day. The monitor of his computer shows the list with the subjects of the emails that he hasn’t read yet. His eye stops on a line with a subject “MYSTERY GIRL-100%”. His breath stops; he frantically puts the cup down and grabs the mouse of his computer. He opens the email and sees:

19.1

Episode

229

University: Harvetton Username: x87sdf909jhkj23!34*()kjdfafi3jkfd Similarity: 100% “At last I found you!” he exclaims with joy. “I will go to Harvetton myself.” He immediately asks his secretary to communicate with the university rector and the department head to tell them that he is planning to visit their university that week and to ask them to confirm their availability. The meeting is planned for Thursday 11:00 at the Head’s office. Robert then sends them an email where he asks them to keep the visit secret and not to inform anyone about the results of the software. He stresses that this is crucial. Robert wants to keep everything secret to test the first reactions of the girl that owns the recovered username. June 20 Thursday has arrived. It is 10:45 when the rector enters the office of the department’s head. Robert arrives at 11:05 accompanied by a man for his personal security. Robert cordially thanks them for the collaboration and informs them of the purpose of the visit. The head of the department tells him that the “path” of the account indicates a particular laboratory in the university, and suggests they invite the head of that laboratory too. After a few minutes, William enters the office. They explain the purpose of Robert’s visit to him. William pretends to know nothing about the matter. William says that he feels really pleased that a student from his lab won the competition, and he volunteers to arrange a meeting later that day with all the students that participated in the competition. “Please not just them. I would prefer it if all the female students of the lab that have an account be present,” says Robert. William says “Great! We could go for lunch if you wish to, and after lunch, at around 14:00, all female students of the lab will be in the meeting room. I could arrange the meeting right now and in 5 min I could join you for lunch.” “Perfect!” Robert says. “Shall we?” the head asks, and everybody heads for the cafeteria. William goes to his office. He opens his laptop and starts writing an email. Message Subject: “Urgent: 2pm at the meeting room”. Message Body: Please be at our meeting room at 1:45. It is very important. It is related to the MicroConnect competition Please acknowledge the receipt of this email (if you cannot make it, please call me on my mobile asap). W

230

19

How Robert Eventually Found Daphne

Although he promised to invite all female students with an account, he sent the email only to the female graduate students of his lab. Although Mary had shown him that the matching username was that of Daphne, he could not believe that Daphne had anything to do with that competition. He had actually persuaded himself that one of his students was the winner but, for some strange reason, or software bug, the username of an unrelated undergraduate student appeared. William leaves his office and walks quickly toward the cafeteria. He cannot take his mind off from imagining the days of glory that his lab will enjoy after the winner is officially announced. “The lab will attract the best students in the country! Our research prototypes could become products by MicroConnect. . .” It is 14:00 and the small meeting room is almost full. The six female graduate students of the lab are there. Five of them had participated in the competition, but none of them had passed to the second round of evaluation. William, Robert, and the head of the department enter the room. Robert with a quick glance verifies that all of his graduate students are there. “Dear members of the lab. Thank you for being on time despite the short notice. It is our honor to have here with us Robert Johnson, the CEO of MicroConnect. I will not reveal the purpose of his visit. I will leave this to Robert.” Robert thanks William and looks at the students with a kind parental look. “I am very glad to be here with you today. But let’s first introduce ourselves. As you have already heard, I am Robert Johnson and I am the CEO of MicroConnect. Now it is your turn. Apart from your name, please tell me a few things about your home town and your current research interests.” In less than five minutes the introductions are completed. Robert listened and looked carefully at each girl trying to find out which one was the mysterious girl. Unfortunately, he did not hear anything that was in any way related to the contents of the USB stick. “Damn . . .” he thinks. Let’s try something different. “Thank you for the introduction and I wish you good luck in your efforts. I think it’s time to tell you why I am here. Well, I am glad to tell you that the winner of the competition that we organized last month is from your lab!” The students cannot believe their ears. “One of us,” they whisper and stare at each other. Robert observes their reactions hoping that this announcement would make the secret girl reveal herself. Unfortunately for Robert, all students seem to have the same reaction; he could not see any special sign. William is eager to finish this identification task and says, “I would kindly ask the five students that participated in the competition to stand there, beside the window.” He stands beside them and says to Robert, “Dear Robert, please tell us who the winner is. Let’s not prolong their agony anymore.” Robert smiles and said “Let us not be impatient. Do you know how long I have waited? William please tell me: are all female students of your lab in this room?” William says “Yes they are.” “I am asking you this because I am not sure that the winner is in this room,” Robert says.

19.1

Episode

231

William is bewildered. After five seconds of cold silence, William says, “What about interviewing each student separately? I am sure that in half an hour, you will know the winner.” At that moment the door opens and a girl’s face that looks in curiously appears. It is Daphne’s, who was looking for her colleagues. She couldn’t find any of them in the lab, that’s why she was wondering whether they were having their lunch in the meeting room. She sees her friends in the room but she also notices that William and the head of the department are present. At that point she starts wondering whether they had a scheduled meeting that she had forgotten, but then she notices Robert. She recognizes Robert; anybody can recognize him. If it is not a conventional visit, then this certainly has to do with the competition, she thinks. She freezes, looking at everyone in the room, while all eyes are on her. “Not now Daphne. Could you please come later?” William says aloud. “Of course,” she responds and turns around to leave the room. “Wait a moment. . .” Robert’s voice sounds like an echo in the room, “your name is Daphne?” She turns around and answers yes while nodding her head affirmatively. “Where are you from?” he asks her again. “From Greece, Sir,” she answers immediately. Robert takes out the USB stick from his pocket. He raises his hand up holding the stick, so that everyone can see it, and then asks, “Daphne, is this yours?” Daphne looks carefully. It took her only one second to recognize her USB stick, the stick with the faded color from the extensive use that she had brought from her country some years ago. It was definitely her own stick, but Daphne for a moment questions herself, “should I admit it, or . . .”, but she could not find any reason for lying. She looks at Robert and says: “Yes, it is mine.”

Chapter 20

Daphne’s Dream

20.1

Episode

Robert looks at Daphne with a fatherly gaze and says: “When I saw the USB I thought that I would get your identity in a few minutes. However, it was a nightmare for me. I felt really desperate. I was astonished by the difficulty of interpreting someone else’s data, running someone else’s software, and getting information about the context of the data. And as it turned out, I did not find out anything that way. We had to develop dedicated software based on the contents of your USB stick and then we ran it in all universities of the country to locate you! But please tell me, what happened in the competition? What did you do to participate? How did you manage to not leave any trace?” Daphne explained the entire story. She started by describing her thoughts when she heard the announcement of the competition and ended by saying what Grace did for her. Robert smiles and says: “What a story! My sincere compliments for the solution you submitted in the contest. Your solution was the only correct solution! The committee was impressed by the clarity of your algorithm and the good software design. But let’s focus on the future. As you know, I have decided to retire, and the next CEO of MicroConnect for the next 3 years will be the winner of the competition. You won the competition! So the question is would you like to take over MicroConnect?” Daphne did not expect this. She is surprised, to say the least. She feels satisfaction, joy, but also fear since that decision would change her life entirely and she does not know if she is ready for such a radical change. Dozens of thoughts spark a few seconds in her mind until she exclaims with enthusiasm and embarrassment: – Yes, why not?

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_20

233

234

20

Daphne’s Dream

She tries rather unsuccessfully to control her voice and to sound calm and decisive. Robert’s hearty pat on her shoulder makes her feel that Mr. Johnson understood completely everything that had crossed her mind. Robert smiles and says: “I am really glad. Daphne, I hope you enjoy your new role. I am sure you will do great. I would like to give you some advice based on my personal experience all these years. The most important thing is that you have a vision that you really like. The vision will give you the power and the ideas to carry out your daily duties and to tackle all difficult times which you will certainly encounter. Lead your team, be part of the team. Keep making the objectives clear; the members of your team will each find the path for reaching the goals as long as they understand your vision and its merits. Always try to find time to learn things you don’t know, especially difficult ones. If you stop learning, then soon you will become arrogant with your team and you will be unable to predict the required effort, difficulty, and time for the tasks you assign to them. Be brave, at least braver than your team, but not risk-prone. Respect everyone. But we will have enough time to discuss all this soon. Will you come with me tomorrow to our headquarters so I can introduce you to your new colleagues?” “Yes!” Daphne takes her bicycle to return at her apartment, while thinking of what had just happened. The acceptance of this new position will probably change her life. That would mean staying far away from her family, friends, and country. At that time her gaze focuses on a slogan that someone has carved on the trunk of an oak “YOLO,” meaning “You Only Live Once.” Indeed this is true, it is almost impossible to have such an opportunity in the future. “If I decline the position I will probably regret it in the future. My decision was right. I will try it. If I don’t like it, I will return to my country.” June 21 That night Daphne had a dream. Robert’s adventure with her USB stick inspired her. She envisioned ways that could mitigate the problems of digital preservation. The dream was long and enjoyable and she woke up feeling fresh. While making her coffee she takes some blank sheets of paper and starts writing down what she rememberes from her dream. Dream Notes Production and Storage of Information The production and storage of information is done with consideration of the future requirements of digital preservation. Even text editors, during typing, (continued)

20.1

Episode

235

offer the user advanced autocompletion services, which are based not only on lexical dictionaries, but also on ontologies, domain-specific vocabularies, other datasets and services, and exploit the current context. These services are not based only on the lexical level, but also on the semantics and pragmatics levels. The same is done for data coming from sensors or input devices. In general, focus is placed on the notion of assimilation, i.e., any process that generates data is enriched with steps for assimilating this data to the existing corpus of data, knowledge, and social context. Information and Formats Operating systems and communication networks can provide each sequence of bits in different formats. This is no longer a responsibility of the applications. Information systems are alleviated from such low-level tasks. Instead, they focus on “information formats” (graph-based structures, as discussed in Sect. 18.2.9). The formats are used only for serialization and for the physical layer. Moreover, it is the responsibility of the operating systems and the communication networks to keep all the details about how each item (where an item can have various levels of granularity) has been produced (e.g., who the initial creator was, who edited it, what processing has been applied, and so on). The exchange of information across different platforms is now easy and input and output interoperability has been significantly upgraded. Service Providers Users will be able to use simple terminals of limited capabilities for every operation. Users are not required to install applications themselves. The required installations or deployments happen in the background, they are not even noticeable to the user. This enables using every service even from very small devices (sensors, eyeglasses, watches). The providers offer services of storage, access, usage, and guarantee task performability. Moreover, all frequently used applications (which are essentially sets of composite and interdependent tasks) have been virtualized, and this is beneficial not only for the sustainability of the applications, but it also enables changing the physical provider of the entire application or parts of the application by the users themselves. Sophisticated reasoning is employed for combining converters and emulators and offering interoperability. The various migration options are offered automatically and each one of them is accompanied by a quantitative and qualitative summary of what is preserved and what is not. The notion of insurance becomes rudimentary. For some kinds of data, the insurance is obligatory, for example, the case with car insurance. A new sector for such digital insurances is expected to emerge. (continued)

236

20

Daphne’s Dream

Software The theory of programming languages, as well as software engineering, has significantly evolved. Programmers can now use programming languages that offer a wide variety of properties that can be checked easily and efficiently. Applications are created quite easily not only by combining building blocks, but also by reducing/restricting the functionality of other existing applications. The end users can themselves improve an application while it is running. Trust: Agents (Users, Sensors) and Processes Everything active has a certificate enabling its authentication. Data and information is associated with authenticated and trusted agents and processes. Data and information that is not connected is considered untrusted. Daphne finishes the notes and says “Let’s get down to business!” She starts packing a small suitcase and calls a taxi to go to the airport. As soon as she arrives at the airport, she finds Robert waiting for her. He asks her: “You are going to learn a lot of new things, but remember that the company is expecting to learn a lot of things from you as well. So are you ready for your new role?” “I am ready and I already have some proposals for handling some digitalpreservation-related issues. I have some ideas about how to assemble the pieces together.” “Hmm. . . That’s very interesting!” “Yes, someone I know told me his experiences and I figured out some solutions. . .” says Daphne. Robert smiles, and they walk together toward the gate. And everyone lives happily ever after.

20.2

20.2

Questions and Exercises

237

Questions and Exercises

1. Think of what would be required to achieve Daphne’s vision from your perspective, i.e., based on your background, experience, and role. 2. For each chapter whose title starts with “File”, i.e. for Chaps. 7–17, try to specialize the “postulates” of the dream of Daphne, that is, try to describe what the technology could have offered for each particular case, so that Robert did not have the difficulties he encountered. 3. Let yourself dream!

Chapter 21

Epilogue

Fairy tales are short stories that contain imaginary characters and sometimes folkloric fantasy characters such as fairies, dragons, magic or enchantments, etc. Usually they have a happy ending, though not all fairy tales end happily. Although fairy tales are used as a source of entertainment, they also serve as a way to kindle peoples’ imagination and as a way to share specific values with young children. Cinderella is one of the most loved stories of all time. It has been narrated numerous times, and made a lot of young girls wish they had been in Cinderella’s position, and eventually be a princess when they grow up. Despite the not-so-real events and characters in the fairy tale (the Fairy Godmother, the magic that transforms a pumpkin into a golden carriage, etc.), the story is didactic. In a nutshell, the real message of the story is that justice will prevail, even if it seems that underprivileged characters cannot do anything about it. The ancient Greek tragedies taught us that there is always Deus ex machina1 to intervene and resolve a seemingly unsolvable situation. When we decided to write a book on digital preservation, we did not want to write yet another book with only technical material, in the sense that digital preservation is not a very “impressive” and “attractive” subject, since it mainly refers to future needs rather than current (everyday) needs. For this reason, we decided to tell our story as a fairy tale. Moreover, instead of describing only the current approaches and solutions, we decided to stress the questions that digital preservation raises, and the storytelling method helped with that. To make this more familiar to the readers, instead of writing a novel fairy tale, we decided to create a modern version of the wellknown Cinderella fairy tale. Similar to the original version of the fairy tale, at least the version written by Charles Perrault that is the most popular, we believe that our modern variation has also a didactic aspect. In our case, the message we want to convey is this: we should care about the preservation of our digital heritage today, in order to make it usable

1

https://en.wikipedia.org/wiki/Deus_ex_machina

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9_21

239

240

21

Table 21.1 Relation of the characters and objects with respect to Cinderella fairy tale

Cinderella’s Stick Daphne Competition Daphne’s colleagues Grace Electronic Access Card Robert Johnson USB Stick Scott William

Epilogue

Cinderella Cinderella Royal Ball Cinderella’s stepsisters Fairy Godmother Pumpkin carriage Prince Glass slipper Prince’s assistant Cinderella’s stepmother

in the long term. For this reason, we injected in our modern variation a few digital preservation problems, which were then technically described in corresponding technical background sections. We should mention of course that the characters and their names are fictional and the same applies to the companies and their products; however, most of them are based on the characters of the original Cinderella fairy tale. The most obvious relation is that of Cinderella and Daphne. Indeed Daphne is a modern Cinderella that loses her USB stick, in a similar way that Cinderella loses her glass slipper, and Robert struggles to find her using it as a clue, similar to the prince in the Cinderella fairy tale. Table 21.1 shows the characters and some objects of the modern fairy tale and the corresponding characters of the Cinderella fairy tale on which they were based.

21.1

Synopsis of Episodes

Daphne is a 22-year-old student. Since her first days in high school, she knew that she wanted to study computer science, and now she is already one step before of her graduation from the Computer Science department. Her good grades and her diligence offered her internship to Harvetton University in the USA where she could work for a few months. After the excitement of the first days, melancholy set in. She soon realized that she was only assigned trivial and effortless tasks and she was not allotted anything interesting. She had the feeling that she was treated as the “servitor” of the lab, because she was an intern and she was the only one that had not received her diploma yet. During that period, USA was the Mecca of computer science, with a plethora of companies having their headquarters there. Furthermore, from time to time, companies were recruiting new employees from universities. This computer science “kingdom” has its own “prince,” Robert Johnson, a charismatic person who led the biggest software company worldwide, MicroConnect. One day, some astonishing news spread out. Robert announced his retirement from MicroConnect. The most amazing thing, however, is the fact that he announced

21.1

Synopsis of Episodes

241

a rather strange process for choosing the new CEO of MicroConnect. Since he was a “prince,” he was going to organize a modern “royal ball.” He invited female candidates that had a diploma in computer science to apply for a competition. The winner of the competition would be the new CEO. The news spread all around the world, and everyone was willing to apply. So was Daphne. However, she could not apply, because the rules of the competition were clear; she should own a computer science diploma. Not only could she not apply for the competition because of the rules, but she also had to take over the tasks of her colleagues, because every single one of them was preparing for the competition. The day of the competition arrived and Daphne was very sad, feeling utterly deprived. She was very disappointed from the behavior of her colleagues and supervisors. She wished she could have a chance to prove her worth. And then, her “Fairy Godmother” appeared, Grace. Grace was responsible for the competition and decided to give Daphne a chance. She registered her in the competition. The “golden carriage” that would allow her to join the competition was an electronic card that would give her access to one of the examination rooms. Before leaving, Grace warned her to leave the examination room before 01:00, because the card would be deactivated and she would not be able to leave the building. Daphne arrived at the competition venue some minutes before the start of the examination, and plugged her USB stick in the computer she was assigned. It was close to midnight when the examination ended and Daphne was surprised to see that she had passed the first evaluation round. However, she remembered Grace’s words. She had to rush because the electronic card would “magically” deactivate. She ran to leave the building; however, she forgot her “glass slipper” USB stick plugged in the computer she had used in the examination room. The day after, Robert was amazed to see that only one of the thousand participants reached the maximum score, with much difference from the second one. He decided that she should be the next CEO. However, there was a problem; the organizers of the competition could not find the name of the participant. The only clue they had was a USB stick that was plugged in the computer from which the winning solution was submitted. Robert was so eager to find the mysterious girl that he decided to inspect the contents of the USB stick in order to find its owner. It proved very difficult to do so, and he faced a lot of problems. At last, he decided to follow a trial-and-error method for finding the owner of the USB stick. Similar to the prince in the fairy tale of Cinderella, who tried the lost glass slipper on all females of the country, Robert compared the contents of the USB stick with the data of all female students around the world, until he found the “Cinderella” in his fairy tale. And everyone lived happily ever after.

Index

A Abandonware, 167 Accessibility, 2 Access permissions, 34 Accountability, 54 Activities, 15, 54 Actors, 54 Adapters, 24 Administrative, 32 AIPs, 217, 219 Alan Turing, 87, 91 Alfred Tarski, 141 Algorithm, 86, 87, 89, 233 Alliance for Permanent Access to the Records of Science Network (APARSEN), 190, 193 Amstrad 464, 130 Analogue computer, 3 Ancient, 3, 13, 239 Anonymous, 133, 142, 171 Ant, 100 Antikythera mechanism, 3 Anti-malware, 88 Antivirus, 84, 88–89, 91 APP, 119 Appraisal, 175 Archive file format, 99 Archive on demand, 171 Archives, ix, 54 Aristotle, 141 Associations, 54, 64 Asymmetric cryptography, 158 Audiovisual, 174 Audit, 12, 139, 144 Augmented Reality Games, 175, 186 Authentication, 2, 54, 156, 158, 159, 194, 236

© Springer Nature Switzerland AG 2018 Y. Tzitzikas, Y. Marketakis, Cinderella’s Stick, https://doi.org/10.1007/978-3-319-98488-9

B Backup, 179 Backup process, 180 Basic, 123 Bibliographic data, 70 Bit organization, 71 Blockchain, 160 Blog, 172, 181, 182 BMP, 51 Branches, 181 Build automation, 99 Bus, 24 Bytecode, 86, 98

C C++, 124 Cables, 24 Canonicalization, 212 Carrier, 209, 212 Cascading Style Sheets (CSS), 44, 114, 120, 170, 212 Cassette tape, 125 Central processing unit (CPU), 86 Certificate, 159, 236 Certification, 12, 139, 144, 203 CGM, 51 Chain of custody, 53 Checksum, 156–157, 163, 164, 179 Chrome, 44, 142 CIDOC CRM, 54 Cinderella, x, xi, 13, 15 CKAN, 218 .class, 95, 96, 102 Client, 98, 119, 120, 159

243

244 Client–server, 98 Cloud, 22, 24–25, 27, 49, 57, 116–117, 120 Comma separated values (CSV), 61, 78, 148, 151 Communication channel, 161 Communication session, 212 Communications protocols, 24 Community, 1, 90, 117, 119, 138 Competitive, 167 Compiler, 97, 98, 102, 124, 130 Complexity theory, 87 Compress, 173 Compression algorithm, 51 Computability theory, 87 Computable function, 173 Computer games, 175 Computer network, 108 Computer viruses, 88, 91 Concepts, 54 Conceptual model, 54, 55 Conferences, 138 Conjecture, 134 Connectors, 24 Conseil Européen pour la Recherche Nucléaire (CERN), 25 Consultative Committee for Space Data Systems (CCSDS), 74 Consumption, 167 Contacts, 39, 147, 148, 166, 171 Content type, 33 Context, 7, 9, 33, 35, 36, 51, 56, 77, 102, 162, 194, 205, 233, 235 Converter, 124, 125, 212, 235 Copyright, 156, 161, 162, 164, 179, 210 Corrupted, 22, 156, 157 Creative Commons, 117 Credentials, 16, 106, 109, 119, 161 Credibility, 155 CRMdig, 54 Cryptography, 160, 163 Cultural, Artistic and Scientific knowledge Preservation, for Access and Retrieval (CASPAR), 35, 38 Cultural heritage, 139, 162 Curation, ix, 143 Currency, 123, 160

D 3D, 51 Data analysis, 141 Data description record (DDR), 71, 72 Data Entity Dictionary Specification Language (DEDSL), 72, 81

Index Datalog, 199 Data Management Plans (DMPs), 178, 179 Data mining, 89 Data protection, 186 Data science, 141 DBpedia, 70, 214 Decompiler, 96, 98, 102, 103 Decompressor, 173 Dependencies, 7, 9, 84, 90, 91, 96, 98, 99, 194, 203, 204, 206, 207 Dependency management, 130, 191, 212 Description language, 53, 173 Descriptive, 32 Detached metadata, 32 Detectability of viruses, 92 Deus ex machina, 239 Devices, 24, 33, 54, 108, 109, 116, 117, 235 Diary, 167, 181 Didactic, 239 Differential, 181 Digital cameras, 24, 51 Digital disasters, 88 Digital electronics, 172 Digital image processing, 55 Digital images, 51 Digital objects, 54, 55, 96 Digital rights, 35 Digital signatures, 157–159 Digital uprooting, 171 Dioscuri, 89, 194 DIPs, 217 Disk drives, 24 Document Object Model (DOM), 210 Dogmatism, 134 Domain Name System (DNS), 109 Domain names, 109 Drivers, 26, 166 DSpace, 218 Dublin Core, 54, 78

E Eavesdropping, 160 Ecology, 167 Economic growth, 172 Elections, 160 Electronic lab notebook (ELN), 141 Electronic signatures, 157 Embargo periods, 179 Embedded metadata, 53 Emotions, 183 Emulation, 89, 91, 124, 125, 193–195 Emulator, 91, 235

Index Encapsulation, 217 Encoding (charset), 40 Encryption, 27, 96, 159 Endian, 71 Enhanced Ada Subse T (EAST), 71, 81 Entity Relationship Diagram, 63 Epimenides, 93, 130, 131, 197–200, 223 Error-correcting codes, 157 Errors, 156 e-Science, 204 Essential characteristics, 212 Ethical, 179 European Space Agency (ESA), 74, 140 Event driven programming, 118 Events, 54 Evidence, 161, 163 Exabytes, 25 Exchangeable image file format, 51 Exchange of images, 53 Executable (EXE), 83, 84, 86, 88, 91, 119 Exif, 51–53 Experiment, 96, 134 Export, 80, 87, 169, 182, 184, 198 Extending, Mapping and Focusing (EMF), 51

F Fabricated data, 142 Faceted search, 216 FAIR, 215 Fake, 155 Faked experiments, 138 Features, 31, 53, 89, 113, 210, 212, 226 Fedora, 218 Fibonacci, 126, 128 File format, 149 File header, 149 File name, 33, 49 File name extension, 33 File system, 9, 32, 34, 36, 194 Firefox, 44 Fixed-format electronic documents, 53 Flow of control, 53, 118 Folder, 30, 32, 36, 85, 133 Fonts, 53, 210 Formal grammar, 45 Format, 9, 35, 44, 51, 53, 57, 65, 70, 74, 89, 119, 147–151, 194, 204–207, 210, 212, 235 Fotini, 15–17, 22, 26, 78, 133, 225, 233–236 Freshness, 35

245 G GDFR, 35, 38 Geneva, 160 Genuine, 156 Geographical coordinates, 49 GIS, 63 Global Ozone Monitoring Experiment, 140 Gödel, K., 135, 141 Gottlob Frege, 141 GPS, 53 Graphics Interchange Format (GIF), 51

H Halting problem, 87 Hash, 156 Heraclitus, 1 Heuristic, 88 Hierarchical naming system, 109 Horn rules, 65 HTTP, 65 HTTPS, 159, 161 HyperText, 44

I Identical content, 209 Identification, 31, 35, 57, 88, 209 Identity of information content, 210 Identity of information objects, 209 Image file (iso), 84 Image file formats, 51 Import, 87, 171, 182, 198 Incompleteness theorems were, 142 Information format, 212 Information identity, 156, 209 Information objects, 54 Ingestion, 217, 219 Insight, 140 Instance matching, 67 Integration of data, 70 Integrity, 2, 157 Integrity testing, 212 Intel, 172 Intellectual property, 141, 161, 162 Intellectual Property Rights, 179 Intelligibility, 2, 98 Intermediate representation, 97 Internationalized Resource Identifier (IRI), 64 Internet, 108, 117

246 Internet Assigned Numbers Authority (IANA), 33 Internet Protocol (IP) address, 108, 110 Interoperability, 193, 235 Interoperability objective, 193 Interpreter, 53, 86, 97, 124, 126, 130 Intractable, 87, 90 IPv4, 109 IPv6, 109 ISO standard, 54, 139

J JAR, 99, 119 Java, xi, 47, 95, 96, 102, 103 Java class, 105 JavaScript, 44, 120 Java virtual machine (JVM), 98 JMimeMagic, 197 Joint Photographic Expert Group (JPEG), 51 Journals, 138 JSON-LD, 63 JSTOR/Harvard Object Validation Environment (JHOVE), 57, 148, 149, 152, 197, 223 Jupyter, 141 Justice, 239

K Key, 22, 53, 65, 87, 117, 158, 160 Keyboards, 24 Knowledge, ix, xi, 1, 32, 34, 38, 63, 64, 130, 140, 189, 190, 195, 202, 203, 235 Kolmogorov complexity, 172

L Laboratory notebooks, 141 Lady Gaga, 155 Latitude, 70, 72, 76 Layering, 201 Ledger, 160 Legal, 157, 163, 210, 227 Library of Congress, 150, 152 License, 117, 179 Lifecycle, 113, 116, 177, 218, 219 Lineage, 53 Linked Data, 62, 65–67 Linked Open Data (LOD), 65–70, 77 Linux, 33 Live-code, 141 Location information, 53 Logical inference, 200 Longitude, 70, 72, 76

Index Loss, ix, 1, 29, 166, 177, 180, 195, 200, 210 Lossless compression, 51

M Machine actionable DMP, 179 Machine code instructions, 86 Machine language, 97 Machine learning, 89 Mac OS, 91, 119 Magic number, 149 Malicious, 83, 84, 88, 89 Malware Museum, 88, 92 Manufacturers, 166 Marketplace, 138 Markup language, 44 Maven, 96, 99–101 Media durability, 22, 26 Media usage and handling, 22, 26 Metadata, 33, 36, 49, 51–53, 56, 57, 90 Metafile formats, 51 Meteorology, 63 Methodology, 201, 212 Microcomputers, 126 Migration, 193–195 Minimal description, 173 Missing resources, 195, 197 Module type, 196 Moon, 155 Moore, G., 172 Moore’s law, 167, 171, 172 Multimedia degradation, 51 Multipurpose Internet Mail Extensions (MIME) type, 33 Music, 1, 22, 34, 95–97, 185 Mutual authentication, 159

N N3, 64 Namespaces, 66, 69 Natural resources, 167 Navigation, 170, 171, 181 Navigational structure, 182 Network, 24, 33, 109 Network Common Data Form (NetCDF), 63, 70 Nickname, 155 Nonrepudiation, 157 N-Triples, 63

O Object code, 97 Object-oriented, 64, 98

Index Obsolescence, 22, 26 Obsolete digital media, 25 Open Provenance Model (ProvDM), 54 Optical Character Recognition (OCR), 205–207, 210 Ontology, 78, 235 Ontology matching, 67 Open Access, 138 Open Preservation Foundation, 149 Open Provenance Model (OPM), 54 Open Science, 138 Operating system, 49, 57, 95, 123, 147

P Packaging, 74 Page description language, 51 Painter, 57 Painting, 57 Parallel ports, 24 Parsing, 45 Password, 161 Pattern, x, 6, 7, 9, 26, 36, 46, 56–57, 77–78, 90, 102, 109–110, 119–120, 129–130, 142, 151, 162, 184, 192, 194 Pay as you go, 117 Payment, 160 PCLa, 51 Pedigree, 53 Peer-to-peer, 160 Peripherals, 24 Personal data, 183, 227 Phone calls, 161 Photograph, 49, 56 Pixel, 51 Planned obsolescence, 167 Pointing devices, 24 Pollution, 167 Portable Document Format (PDF), 51, 53, 206 Portable media players, 24 Portable Network Graphics (PNG), 51 PostScript, 51, 53 Power supply, 24 Pragmatics, 235 PreScan, 35, 38, 149 Preservation strategies, 212 Preserve malware, 87 Printers, 24 Privacy, 27, 159 Private key, 158 Processing levels, 140 Processor, 126 Profiles, 197–199 Project Object Model (POM), 100, 101, 215 Pronom, 35, 38

247 Proof, 87, 134 Proprietary, 149 Provenance, 2, 7, 9, 32, 35, 36, 49, 54–56, 62, 71, 74, 77, 78, 98, 99, 101, 117, 124, 135, 136, 140, 142, 156, 194, 205 Proxy, 109, 120 Public key, 158

Q QEMU, 89, 194 Quality, 23, 137, 156, 167, 178, 211 Quantum, 160, 163 Quantum cryptography, 160 Query federation, 67 Query languages, 63, 64

R RAM, 125 Raster data, 51 RDF/S, 63 RDF/XML, 63, 64, 66, 68, 74 Recommendations, 5 Recovery, 180 Registries, 35 Regulation, 4, 167, 182, 186 Reliable, 23, 139, 180 Remix, 117 Rendering, 2, 40, 46 Reproducibility, 54, 138, 142 Reproduction, 53, 161 Reputation, 161 Requests for Comments, 5 Resource Description Framework (RDF), 63–65, 207 Retracted scientific studies, 142 Reverse operation, 98 Reviewers, 138 Reviews, 137 Revision, 181 Reward, 160 3-2-1 rule, 181 Rule languages, 63, 64 Rules, 33, 65, 195, 198–200

S Sandbox, 89 Scanner, 210 Scanning, 35 Schema.org, 66 Scholarly peer review, 138 SCIDIP-ES, 190, 223 Scientific computing, 141

248 Scientific data, 63, 70, 74, 138 Scientific publishing, 137–138 Scratch, 117–119 Scripting, 86 Scripting languages, 44 Secure, 159 Semantic network, 64 Semantics, 64, 65, 72, 77, 206, 207, 235 Semantic Web Rule Language (SWRL), 64 Sensor, 140 Sensory impression, 204, 210, 211 Sepia filter, 55 Serial, 24 Serialization, 45, 149, 150, 152, 235 Server, 83, 106, 110, 120, 159, 161 Service-oriented Architecture (SOA), 117 Shannon, 212 Signatures, 88 Significant properties, 212 Similarity, 226–228 Sincerity, 156 SIPs, 217 Size reduction, 51 Smartphones, 24 Social life, 161 Software, 1, 7, 9, 21, 22, 24, 32, 35, 51, 53, 63, 84, 86, 88–90, 95, 96, 99, 102, 108–110, 117, 120, 123, 126, 129, 130, 194, 204, 205, 233 Software virtualization, 117 SPARQL Protocol and RDF Query Language (SPARQL), 63, 64 Specificity, 215 Standard, 24, 43, 44, 51, 53, 65, 151 Standard Archive Format for Europe (SAFE), 74, 81 Standardization, 4, 191 Storage devices, 32 Storage media, 22, 23, 25, 26, 32, 172 Streams, 33 Structural, 32 SVG, 51 Symbols, 9, 40, 43, 44, 46, 47, 77, 194, 212 Symbol set, 204 Syntax, 71 System for Digital Preservation, 218 Systems for Information Preservation, 217–219

T Tags, 39, 40, 52, 53, 181 Task hierarchies, 195 Task performability, 195, 235

Index Temperature, 70, 72, 76, 143, 203, 204, 207 Termination analysis, 87 Theorems, 141–142 Touched, 155 Tragedies, 239 Transactions, 159 Translate, 46, 97 Transmission, 156 Transparency, 142 TriG, 64 Triplets, 64 Trunk, 181, 234 Trust, 9, 65, 194, 236 Trust network, 161 Trustworthy digital repositories, 12, 139, 144 Turing machine, 87 Turnover, 167 Turtle, 63

U Undecidable, 87, 90 UNICODE, 63 Unified Modeling Language (UML), 55, 66 Uniform Resource Identifiers (URIs), 63, 77 Universal Serial Bus (USB), 9, 15, 17–19, 21, 22, 24, 26, 27, 30, 31, 36, 39, 41, 50, 84, 120, 143, 194, 233 Universal Virtual Computer (UVC), 89, 92, 194, 220 Username, 161

V Vector formats, 51 Version, 44, 53, 84, 95, 96, 105, 110, 118, 120, 126 Video games, 24, 175 Virtualization, 89, 201–203 Virtual environment, 89 Virtual machine, 84, 89, 95, 96, 98 Visual Basic, 126 Visual programming, 117 Vocabularies, 235

W Warwick Workshop, 201, 203 Web archiving, 109, 135, 137, 170 Weblog, 181 Web of Trust, 65 Web Ontology Language (OWL), 63, 64 Website, xi, 119, 121, 135, 159

Index Website copiers, 182 Well-formed, 149 Wikidata, 35, 38 Wikipedia, 13, 70, 214 Windows OS, 33 Wisdom, 140–141 WMF, 51 Write once, run anywhere (WORA), 98 Wrong data value, 143

249 X XML, 64, 78, 100, 151, 206, 207 XML Formatted Data Unit (XFDU), 74, 81 XPath, 64 XQuery, 64

Z ZIP, 24, 99