Practical Ontologies for Information Professionals 9781783301522, 9781783300624

Practical Ontologies for Information Professionals provides an accessible introduction and exploration of ontologies and

274 56 3MB

English Pages 192 Year 2016

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Practical Ontologies for Information Professionals
 9781783301522, 9781783300624

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page i

Practical Ontologies for Information Professionals

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page ii

Every purchase of a Facet book helps to fund CILIP’s advocacy, awareness and accreditation programmes for information professionals.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page iii

Practical Ontologies for Information Professionals

David Stuart

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page iv

© David Stuart 2016 Published by Facet Publishing 7 Ridgmount Street, London WC1E 7AE www.facetpublishing.co.uk Facet Publishing is wholly owned by CILIP: the Chartered Institute of Library and Information Professionals. David Stuart has asserted his right under the Copyright, Designs and Patents Act 1988 to be identified as author of this work. Except as otherwise permitted under the Copyright, Designs and Patents Act 1988 this publication may only be reproduced, stored or transmitted in any form or by any means, with the prior permission of the publisher, or, in the case of reprographic reproduction, in accordance with the terms of a licence issued by e Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to Facet Publishing, 7 Ridgmount Street, London WC1E 7AE. Every effort has been made to contact the holders of copyright material reproduced in this text, and thanks are due to them for permission to reproduce the material indicated. If there are any queries please contact the publisher. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library. ISBN 978-1-78330-062-4 (paperback) ISBN 978-1-78330-104-1 (hardback) ISBN 978-1-78330-152-2 (e-book) First published 2016 Text printed on FSC accredited material.

Typeset from author’s files in 10/13 pt Minion Pro and Myriad Pro by Facet Publishing Production. Printed and made in Great Britain by CPI Group (UK) Ltd, Croydon, CR0 4YY.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page v

Contents

List of figures and tables..................................................................................vii 1

What is an ontology?..................................................................................1 Introduction ..........................................................................................................................1 The data deluge and information overload...............................................................1 Defining terms......................................................................................................................4 Knowledge organization systems and ontologies ..................................................5 Ontologies, metadata and linked data......................................................................15 What can an ontology do?.............................................................................................17 Ontologies and information professionals ..............................................................21 Alternatives to ontologies..............................................................................................22 The aims of this book.......................................................................................................24 The structure of this book..............................................................................................25

2

Ontologies and the semantic web ..........................................................27 Introduction........................................................................................................................27 The semantic web and linked data.............................................................................27 Resource Description Framework (RDF) ...................................................................28 Classes, subclasses and properties .............................................................................30 The semantic web stack..................................................................................................31 Embedded RDF..................................................................................................................42 Alternative semantic visions .........................................................................................46 Libraries and the semantic web...................................................................................47 Other cultural heritage institutions and the semantic web ..............................49 Other organizations and the semantic web............................................................50 Conclusion...........................................................................................................................51

3

Existing ontologies ..................................................................................53 Introduction........................................................................................................................53 Ontology documentation..............................................................................................53

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page vi

VI

pRACTICAL OnTOLOgIES

Ontologies for representing ontologies ...................................................................54 Ontologies for libraries....................................................................................................63 Upper ontologies ..............................................................................................................68 Cultural heritage data models......................................................................................70 Ontologies for the web ...................................................................................................71 Conclusion...........................................................................................................................78 4

Adopting ontologies................................................................................79 Introduction........................................................................................................................79 Reusing ontologies: application profiles and data models ...............................79 Identifying ontologies.....................................................................................................83 The ideal ontology discovery tool ..............................................................................89 Selection criteria................................................................................................................92 Conclusion...........................................................................................................................95

5

Building ontologies..................................................................................97 Introduction........................................................................................................................97 Approaches to building an ontology.........................................................................97 The twelve steps .............................................................................................................100 Ontology development example: Bibliometric Metrics Ontology element set .......................................................................................................................127 Conclusion ........................................................................................................................135

6

Interrogating ontologies.......................................................................137 Introduction .....................................................................................................................137 Interrogating ontologies for reuse...........................................................................138 Interrogating a knowledge base...............................................................................139 Understanding ontology use .....................................................................................148 Conclusion ........................................................................................................................154

7

The future of ontologies and the information professional...............155 Introduction .....................................................................................................................155 The future of ontologies for knowledge discovery ............................................155 The future role of library and information professionals .................................158 The practical development of ontologies .............................................................162 Conclusion ........................................................................................................................164

Bibliography ...................................................................................................165 Index................................................................................................................179

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page vii

List of figures and tables

Figures 1.1 Section of the British national Bibliography graph visualized using RDF gravity..........................................................................................................................11 1.2 A graph of Jesus and his twelve apostles.................................................................18 2.1 David hates Apple graph................................................................................................29 2.2 David hates Apple, but knows Bob who loves Apple ..........................................30 2.3 The semantic web stack..................................................................................................32 2.4 An example of an RDF graph ........................................................................................41 3.1 A simple person and place ontology using RDF and RDFS................................56 3.2 nature.com data categories as SKOS play tree visualization.............................60 3.3 FRBR entities and relationships representing the intellectual content.........65 3.4 Structuring intellectual content in FaBiO.................................................................66 4.1 Linking between Schema.org and other vocabularies as shown on Linked Open Vocabularies .............................................................................................82 4.2 Word cloud of subject headings of ontologies in BARTOC................................85 4.3 A search for ‘person’ within the Falcons Ontology Search..................................87 5.1 WebVOWL visualization of FOAF ..............................................................................125 5.2 First draft of the Bibliometric Metrics Ontology, with two classes and provisional relationships..............................................................................................128 5.3 Second draft of the renamed Bibliometric Indicators Ontology...................130 5.4 Screenshot of protégé 5.0 with the Entities tab selected ................................131 5.5 properties associated with the Bibliometric Indicators Ontology................133 5.6 Bibliometric Indicators Ontology (BInO) – v. 0.1 .................................................134 6.1 number of reusing vocabularies in rank order ....................................................149

Tables 3.1 Dublin Core Terms properties.......................................................................................63 3.2 Comparison of schema:person with foaf:person...................................................76 5.1 Overview of steps in different ontology development methodologies .......99

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page viii

VIII

pRACTICAL OnTOLOgIES

5.2 Different entities and concepts identified with different spotter algorithms .........................................................................................................................113 6.1 The most common properties associated with schema:Book .......................152

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 1

CHAPTER 1

What is an ontology?

Introduction Today more data and information are being produced and shared than ever before; data is streaming forth from new online social behaviours as well as high-specification digital tools and instruments. If we are to extract the maximum value from this data then we need to make use of the most appropriate tools and technologies. Ontologies, formal representations of knowledge with rich semantic relationships, are one such tool, and the focus of this book. is chapter provides an introduction to ontologies, and considers their increasing importance to information professionals. Following a brief overview of the growing information overload and data deluge, the chapter considers the various definitions that have been applied to the term ‘ontology’ and how ontologies differ from associated and overlapping information concepts such as controlled vocabularies, taxonomies, metadata and knowledge bases. Finally, the chapter considers the potential of ontologies for information retrieval and discovering ‘undiscovered public knowledge’, and the role of the librarian in the development, maintenance and curation of ontologies.

The data deluge and information overload It is important to start with an understanding of the changing information landscape, reminding ourselves of why we need new tools and technologies, and why it is no longer acceptable to continue with the way things have always been done. We are awash with a wide variety of information and data, but due to the tools that we are currently using the value of much of the data is going to waste. As John Naisbitt (1984, 17) put it, ‘We are drowning in information, but starved for knowledge’. Information is coming from a wide variety of sources. ere has been an explosion in the publishing and sharing of text across the whole of the communication spectrum, from the informal to the formal. Traditional formal publications, such as books and journals, have been joined by e-books and e-journals, with new publishing models based on combinations of self-publishing and open access: the number of self-published

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 2

2

pRACTICAL OnTOLOgIES

titles published in the USA rose from 85,468 titles in 2008 to 458,564 titles in 2013 (Bowker, 2014); whilst Chen (2014) estimated that the proportion of articles published in the previous year available as open access had either passed or was very close to 50%. In the middle of the formal–informal spectrum of publishing is the grey literature: white papers, reports, technical papers and other, more informal, publications. Whereas once this grey literature could be costly to create and had limited circulation, desktop publishing soware and electronic publishing on the web have put it within reach of a wide range of individuals and organizations. But the growth in these numbers has been dwarfed by the growth of social media and other informal publishing, where the associated numbers are oen in the hundreds of millions if not billions: there are 1.49 billion active Facebook users each month (Facebook, 2015); and over 500 million updates are sent on Twitter on a typical day (Twitter Engineering Blog, 2013). No one can hope to read anything but the smallest fraction of this information, even within the smallest of fields. ere is a need for new tools to help with information retrieval, increasing precision without excessively impacting recall. e narrative text has also been joined by increasing quantities of other text, such as computer code and data sets, as well as rich media (i.e., images and video). Although the lack of data sharing within the academic community has been labelled as the ‘dirty little secret’ of open science data promotion (Borgman, 2012, 1059), the potential of open data and open code to transform the rate of scientific progress (Hey, Tansley and Tolle, 2009) and to encourage more open and accountable governments and encourage citizens’ participation (Raman, 2012) has led to numerous open programs and policies. Governments have signed up to open data charters promising data to be open by default (Cabinet Office, 2013) and funding agencies and journals are increasingly stipulating the need for open data and open code (e.g., Nature, 2014). It is not enough, however, that data and code are open; they need to be findable and reusable by those who want to make use of them too. Whilst the growth of open data may have been slower than some would like, growth in the number of images and videos shared has exploded: since its launch in 2010, over 30 billion images have been shared on Instagram (Instagram, 2015); in May 2014 Snapchat reported 700 million photos sent per day (Techcrunch, 2014); and YouTube counts billions of views every day as people watch hundreds of millions of hours of video (YouTube, 2015). is media is also increasingly of higher quality, part of the trend towards increasingly high specification digital tools and instruments. By 2007 83% of mobile phone cameras had digital cameras, and over the years the specification of these cameras has increased dramatically. By 2012 there were mobile phones with 41 megapixel cameras available, many times more powerful than the first camera phones with 0.1–1 megapixels. e rise of increasingly high specification mobile phone cameras reflects an increase in digital data collection at increasingly high-level

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 3

WhAT IS An OnTOLOgy?

3

specifications across a wide range of disciplines and professions. Data per 360 degree scan in computed tomography has gone from 57.6 kB in 1972 to 0.1–1GB by 2010 (Kalender, 2011), whilst the rise in quality and fall in price has increased the number of scans made and the areas outside medicine where computed tomography may be used (e.g., archaeology and paleontology). When the first human genome was declared complete in 2003 it had been a mammoth project taking over ten years and costing US$3billion; now we have entered the US$1000 genome era, where the cost of sequencing the human genome has fallen to a price where it may play a role in predictive and personalized medicine (Hayden, 2014). Projects such as the 100,000 Genome Project are now sequencing thousands of genomes to identify genetic causes for a wide range of human diseases (www.genomicsengland.co.uk/the-100000genomes-project).e content in any single human genome, however, is dwarfed by the amount of data produced by big science projects such as the Large Hadron Collider, where 19 gigabytes of data were created in the first minute and thirteen petabytes (1015 bytes) in the first year (Brumfiel, 2011). With so much data available, and in increasingly large chunks, it becomes increasingly important that we are accessing and downloading only the most relevant data for analysis. As well as the data people are making a conscious decision to share, there are also the vast digital trails we all increasingly leave as an increasing proportion of our lives are lived online, and processes are digitized. Mobile phones can not only capture pictures, but have built in GPS and accelerometers to track location and movement. Phone (or VOIP) calls can now simply be captured in their entirety, to index or playback in full at a later date if necessary. With the internet as the first port of call for our information needs we are leaving trails of information about the searches we are carrying out, the pages we are visiting and the links we are following. is information is not only restricted to the log files of a single site, but may be aggregated by advertising companies and content providers across multiple sites, enabling the building of increasingly complex profiles on individuals for the tailoring of increasingly personalized advertising and services. As data storage and processing prices have fallen it is no longer necessary to be selective in what we capture: increasingly we capture everything and then search the captured information for what we need later. A process that is epitomized by notetaking soware designed for capturing ‘everything’ and ideas such as life streaming. Wearable technology, such as Google Glass, streamlines the process, as it is no longer necessary to even go to the trouble of taking a smartphone from a pocket. Data inevitably produces more data. e data that is captured is oen indexed, analysed, or combined to spawn more data. A file may be indexed, the contents analysed according to different criteria (e.g., searching for patterns or antecedents), and be accompanied by an ever growing quantity of descriptive, access, and preservation

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 4

4

pRACTICAL OnTOLOgIES

metadata. As new questions are asked, and new methods of data analysis developed, the same data set can continue to produce ever increasing quantities of data. We have entered the era of Big Data. ere are vast amounts of structured and unstructured data available, and there are new challenges to ensure that we make use of this data. Neither the exponential growth of science nor the problems of information overload are particularly new problems. e growth and communication of science began to be explored scientifically in the 1950s and 60s, and its exponential growth was one of the subjects of Derek J. de Solla Price’s (1963) seminal Little Science, Big Science. e history of scientific publishing can be seen as one of trying to help researchers overcome the problem of information overload, first with publication of specialist journals, then with specialist abstract and indexing services. However, the web has provided a step-change in the publishing of information. When Ziman (1969) wrote of the problem of having to wade through ‘tomes of irresponsible nonsense’ without peer review, he would have had no idea how large these tomes of irresponsible nonsense would become. e web requires new tools and methods to help users engage with the information that is available, and its brief history has already been one of rapid innovation: from directories to search engines, from information searching to information discovery. We no longer expect always to have to search for the information that we require, but are instead alerted to information we may require, either through the filter of social network sites or algorithmic suggestions (e.g., Google Scholar). ose who successfully find ways of managing the information overload, and of making use of the increasing quantities of data available, will have the competitive advantage. Whether that is the company gathering competitive intelligence on its rivals, the researcher looking for new ways to encode and analyse data, or the international non-governmental organization looking for efficiencies in sharing information. Ontologies are one way of helping to tame some of the problems identified above, providing a structure for this information in such a manner that it can be read automatically and unambiguously, and shared more widely.

Defining terms Whenever writing on a specialist subject it is generally advisable to start by defining your terms, as all too oen we follow the example of Humpty Dumpty when he says in Lewis Carroll’s rough the Looking Glass: ‘When I use a word, it means just what I choose it to mean – neither more nor less’. Even within the smallest of fields the same term may have multiple meanings, some of which may be conflicting, a feature that is true for both ‘ontology’ and concepts such as data, information and knowledge, which the ontology is trying to encode.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 5

WhAT IS An OnTOLOgy?

5

Defining data, information, knowledge and wisdom Most topics in information science can’t be discussed for long without running into the terms data, information, or knowledge. Unfortunately the terms are notoriously hard to define, and attempts at capturing knowledge within the library and information science community (e.g., through knowledge management) have sometimes been controversial for seemingly being little more than rebranding exercises. Data, information, knowledge and wisdom are oen conceptualized as a four-step pyramid, from data at the bottom, through information and knowledge, to wisdom at the top. is model was popularized by Ackoff (1989), but analysis of how the terms are used (Rowley, 2007; Zins, 2007) finds them to be the subject of wide-ranging and oen overlapping definitions. Rather than thinking of them as distinct terms, it is more useful to think of them as overlapping areas on a continuum from highly structured and codified information at one end (data) to highly personal tacit understanding at the other (wisdom). Data is the ‘building blocks’ of information and knowledge (Kitchin, 2014), although much of the information and knowledge that we have can seem quite detached from the underlying data. Whereas the route from data to knowledge may seem quite direct in the hard sciences, within the arts and the humanities the relationships between abstract ideas and concepts that form information and knowledge are less readily structured. Ontologies emerged as a way of capturing knowledge, and codifying it in a highly structured manner as data, and this may be applied to knowledge in any discipline. . . . knowledge is inherently complex and the task of capturing it is correspondingly complex. us, we cannot afford to waste whatever knowledge we do succeed in acquiring. Neches et al., 1991, 54

Knowledge organization systems and ontologies Ontologies are one of a number of different knowledge organization systems that have been developed within the information profession to improve information discovery. ese knowledge organization systems are also variously known as ‘taxonomies’ or ‘controlled vocabularies’, depending on the sector within which they are used. Whereas cultural heritage institutions err more towards ‘controlled vocabularies’, the commercial sector tends to use the term ‘taxonomies’. Harpring (2013, 13) defines a controlled vocabulary as: ‘an organized arrangement of words and phrases used to index content and/or to retrieve content through browsing or searching’, very similar to Hedden’s broad definition of a taxonomy in her introduction to e Accidental Taxonomist:

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 6

6

pRACTICAL OnTOLOgIES

. . . any knowledge organization system (controlled vocabulary, synonym ring, thesaurus, hierarchical term tree, or ontology) used to support information/content findability, discovery, and access. Hedden, 2010, xxii

ere is also a more narrow use of the term taxonomy, in the sense it refers to a hierarchical set of terms (Hedden, 2010; Harpring, 2013), such as the Linnaean taxonomy of biological classification, most people’s first introduction to the term. Within this work the term controlled vocabulary is preferred rather than taxonomy, partly due to the potential for confusion caused by the dual meaning, but also due to the author’s own background within library and information science. Controlled vocabularies have both advantages and disadvantages. Advantages of a controlled vocabulary include improved recall and greater precision through reducing polysemy (van Hooland and Verborgh, 2014). Recall, the proportion of relevant documents that are retrieved out of all the relevant documents in a collection, is increased by the reduction of the number of terms associated with a particular concept. For example, the Dublin Core Metadata Initiative Type Vocabulary is a controlled vocabulary of 12 terms: collection, dataset, event, image (still image and moving image), interactive resource, physical object, service, soware, sound, and text. Without a controlled vocabulary, a wide range of resources that adhere to each of these types could have been referred to differently. e ‘text’ resource type includes letters, books, theses, reports, newspapers, and poems, as well as a host of other texts primarily designed for reading. To ensure the recall of all the associated text resources would require entering all the possible terms. Polysemy refers to multiple meanings for the same term. A controlled vocabulary enables distinctions to be made between the different terms. For example, ‘Apple’ may refer to the fruit, the technology company, a computer created by the technology company, or the record label founded by the Beatles. Within the Library of Congress Subject Headings the fruit has the term ‘Apples’ and the computer is ‘Apple computer’, whilst in the Library of Congress Name Authority File the technology company is ‘Apple Computer, Inc.’ and the record label is ‘Apple Records’. ere are also a number of disadvantages to controlled vocabularies: the cost, the complexity, the slow evolution, and their subjectivity (van Hooland and Verborgh, 2014). Controlled vocabularies are not only expensive to create in the first place, but also to maintain as new names and terminology enter a field. In some situations the slow speed of change may be simply due to limitations in resources; in other situations there may be conflict between the terminology of conservative and progressive perspectives. For example, a comparison of the style guides of le- and right-wing newspapers can be particularly enlightening regarding

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 7

WhAT IS An OnTOLOgy?

7

their associated politics. Controlled vocabularies are inevitably subjective, and reflect the world view of the creators at a particular time, and different people in more enlightened times inevitably baulk at previous decisions, especially when there are prohibitively large legacy costs to rectifying previous decisions. For example, the Dewey Decimal Classification system is infamous for class 200 – religion, where seven out of the ten divisions relate to the Bible or Christianity: • • • • • • • • • •

200 Religion 210 Philosophy & theory of religion 220 e Bible 230 Christianity 240 Christian practice & observance 250 Christian pastoral practice & religious orders 260 Christian organization, social work, & worship 270 History of Christianity 280 Christian denominations 290 Other religions.

Although there have been attempts to extend many of the other religions in DDC in recent years, particularly Islam (Idrees, 2012), the Dewey legacy nonetheless supports the perception of it being Christian-centric. Some of the most widely used forms of controlled vocabularies within the information profession are subject headings, authority files and thesauri. It is worth considering each of these types of controlled vocabulary, and their limited nature, for comparison with the more expressive nature of ontologies: Subject headings are a controlled set of terms designed to describe the subject or topic of a resource, whether it is book, article, or data set. Popular examples include the Library of Congress Subject Headings (http://id.loc.gov/authorities/subjects.html) and the Medical Subject Headings (MeSH) (www.nlm.nih.gov/mesh/meshhome. html). Subject heading lists ensure that the same term is used to describe a work, rather than multiple similar terms. Authority files are sets of preferred headings. As well as preferred subject headings, there may be preferred organization names, person names, and place names. History is replete with people, places, and organizations that have different names at different times, and successful information retrieval requires the consistent use of terms and relationships between the alternatives: those looking for information on Mark Twain may also want to retrieve information on Samuel Clemens, whilst those researching Constantinople may also wish to retrieve information on Istanbul. Well known examples include the authority files of the major national libraries (e.g., Library of Congress, British

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 8

8

pRACTICAL OnTOLOgIES

Library and Bibliothèque Nationale de France). VIAF (Virtual International Authority File) (http://viaf.org) is a project from several national libraries designed to link together the separate authority files of the libraries into one virtual authority file. A thesaurus, like a taxonomy (in the narrower sense of the term), provides hierarchical relationships between concepts (i.e., broader and narrower terms), as well as equivalence and associative relationships. A typical entry in a thesaurus might include all three types of relationship, as in the example below for information science: Information Science Broader terms:

Narrower terms:

Use instead of: Related terms:

Sciences

Computer Science

Library Science Informatics

Information Industry

Information Processing Information Skills

Knowledge Management

Knowledge Representation

Library Education

e above example is based on ‘Information Science’ in the ERIC (Education Resources Information Center) thesaurus (http://eric.ed.gov). e relationships within a thesaurus enable a reader to traverse from one concept to another more easily, helping to find related content. Other well known examples of thesauri include the Getty esaurus of Geographic Names (www.getty.edu/research/tools/vocabularies/ tgn), the Art & Architecture esaurus (www.getty.edu/research/tools/vocabularies/ aat), and the esaurus for Graphic Materials (www.loc.gov/pictures/collection/tgm) from the Library of Congress. Today controlled vocabularies should also be compared with tagging, which came to prominence with the rise of social media and social networking sites. e vast size and diversity of the web, and its users, drove the need for an approach to classification that was equally global and diverse in outlook, and could be applied by members of the public as well as information professionals. Tagging, the application of uncontrolled terms to online resources, has been incorporated into a large number of services with varying degrees of success. Whilst many of the sites for bookmarking web resources (e.g., del.icio.us) have fallen out of favour, it nonetheless continues to have an important role within sites that are focused around user-generated content: for example, the tagging of images in Flickr and Instagram, and the use of hashtags in Twitter (so called because of the ‘#’ used to denote the tag). In comparison to a

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 9

WhAT IS An OnTOLOgy?

9

controlled vocabulary, tagging is likely to have reduced recall and lack precision, but where the scale of the web is concerned there may be few alternative options. An ontology is like a thesaurus, in that there are multiple types of relationship between terms, but it can be non-hierarchical, with a far richer set of relationships, and typically holds a far greater variety of information. e richness of the relationships and information means that it is not only suitable for indexing resources, but may be a knowledge base for knowledge discovery in its own right.

Defining an ontology Ontologies first emerged in the Artificial Intelligence (AI) community, borrowing the term ‘ontology’ from philosophy, where ontology is concerned with the study of being or existence. e term was adopted by the AI community in the 1980s for computational models that can enable automated reasoning (Gruber, 2009), having recognized that ‘capturing knowledge is the key to building large and powerful AI systems’ (Neches et al., 1991, 37). Today the most widely used definition of ontology is Gruber’s (1993, 199) definition: ‘an explicit specification of a conceptualization’. This has been criticized for its broadness, incorporating both simple glossaries and ‘logical theories couched in predicate calculus’ (Gruber, 2009, 1964), and also for its focus on subjective concepts rather than entities as they exist in reality (Smith, 2004). Nevertheless, an ontology might be considered a near-synonym with knowledge organization system or taxonomy (in the broad sense).This continuum from informal vocabularies to formal ontologies has been reiterated by the World Wide Web Consortium (W3C) in their introduction to ontologies: ‘There is no clear division between what is referred to as “vocabularies” and “ontologies”’ (W3C, 2013). The broadness of the definition is an important part of the inclusiveness of ontologies for information professionals. It is not just a subject for the AI community, but rather all those involved in the codifying of knowledge, including librarians, archivists, museum workers and domain experts. Nonetheless, a more specific definition is useful for distinguishing between those ontologies that are the primary focus of this book and other examples of controlled vocabularies. Within most definitions of ontologies the distinctive feature of ontologies is the richness of the relationships between terms. For Hedden (2010, 12), an ontology ‘can be considered a type of taxonomy with even more complex relationships between terms than in a thesaurus . . . it aims to describe a domain of knowledge, a subject area, by both its terms . . . and their relationships’. Within an ontology a person does not have to just be related to an event: they may be present at an event, organize an event, take part in an event, be an authority on an event, or possibly instigate an event.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 10

10

pRACTICAL OnTOLOgIES

An example of the richness of the information associated with a particular entity in an ontology is provided below with an author record: Ranganathan, S.R. (Shiyali Ramamrita), 1892-1972 event:

family name: given name:

has created:

name:

type:

has contributed to:

same as:

1892

1972

Ranganathan S.R.

Colon classification / S.R. Ranganathan

The five laws of library science / S.R.

Ranganathan

S.R. Ranganathan Agent

Person

An essay in personal bibliography / A.K. Das

Gupta

49268668

The above record is based on the British National Bibliography record for S.R. Ranganathan. It expresses two types of relationship between the author and his associated works: has created, and has contributed to. With the exception of the name, family name, and given name values, each of the properties on this record links to another record for the particular instance, for example, The five laws of library science: The five laws of library science / S.R. Ranganathan bnb:

description:

GB6417211

2nd ed originally published (B58-927) Madras

Library Association; Blunt 1958.

edition statement:

2nd ed. reprinted (with minor amendments)

is part of:

Ranganathan series in library science; no 12

type:

creator:

language:

publication event:

same as:

subject:

BibliographicResource

Ranganathan, S.R. (Shiyali Ramamrita), 1892-1972 eng

Asia Publishing House, 1964

GB6417211

020

Again, many of the properties have their own associated records, creating a huge graph

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 11

WhAT IS An OnTOLOgy? 11

of related resources, joining previously disparate authority lists and classification systems. Figure 1.1 shows the graph produced by just the author and instance records mentioned above.

Figure 1.1 Section of the British National Bibliography graph visualized using RDF Gravity

Explicit specifications of conceptualizations are important if computers are to successfully communicate with one another without ambiguity, and there is less ambiguity and more scope for drawing inferences if the explicit specifications build upon one another in a more formal manner. ‘Formal’ rather than ‘explicit’ is used in a number of definitions of ontologies: ‘An ontology is a formal specification of a shared conceptualization’ (Borst, 1997,11); ‘Ontologies are formalized vocabularies of terms, oen covering a specific domain and shared by a community of users. ey specify the definitions of terms by describing their relationships with other terms in the ontology’ (W3C, 2012). Others, however, have preferred to combine the two terms: ‘An ontology is a formal and explicit specification of a shared conceptualization’ (Jakus et al., 2013, 29). Whilst a formal ontology would seem to necessitate an ontology being explicit, an explicit ontology does not necessarily need to be particularly formal. e use of relationships in defining terms is a particularly important part of the semantic web due to its distributed nature, with organizations likely to be adhering to different vocabularies.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 12

12

pRACTICAL OnTOLOgIES

As well as the richness of the relationships and their explicitness, there is another distinctive feature of ontologies that is widely acknowledged: that they should be a representation of the structure of knowledge, not just a set of indexing terms. Willer and Dunshire (2013, 112) define an ontology as ‘a formal representation of the structure of knowledge and information, and Allemang and Hendler (2011, 1) point out that semantic models are sometimes called ontologies. Although Harpring (2013) acknowledges certain similarities between thesauri and taxonomies and ontologies, she considers them to have fundamentally different goals: …ontologies use strict semantic relationships among terms and attributes with the goal of knowledge representation in machine-readable form, whereas thesauri provide tools for cataloguing and retrieval. Harpring, 2013, 26

e goals of knowledge representation and information retrieval do not have to be mutually exclusive, however, and the same ontology may be used for both. In fact the richness on the relationships may allow for far richer querying and information retrieval. Within this book a fairly broad definition of ontology, albeit not quite as broad as that of Gruber (1993), is taken: An ontology is a formal representation of knowledge with rich semantic relationships between terms.

Such ontologies may be more or less formal, depending on the extent to which they define terms with relation to one another and incorporate axioms, and no distinction is made as to whether an ontology is designed either for information retrieval or as a knowledge base. Such a simple definition, however, glosses over the parts that comprise an ontology.

The parts of an ontology The definition of an ontology provided above is designed to be inclusive, although it is sometimes necessary to distinguish between different ontologies that fall within this definition. As with Willer and Dunshire’s (2013) definition, it is sometimes used to distinguish the structure of the ontology from the instances. For example, a book ontology might not be expected to include any information about particular books, but rather provide the necessary structure for describing books and the relationships between them and associated types of objects. In other situations an ontology might

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 13

WhAT IS An OnTOLOgy? 13

refer to both the structure and the instances, in much the same way as a thesaurus of place names includes the names of places, not just the possible relationships between them. Whether an ontology developer is interested primarily in classes or instances may be expected to differ considerably depending on the discipline. For example, Arp, Smith and Spear, who are primarily interested in the representation of scientific research, believe an ontology is ‘concerned with representing universals’ (2015, 17). However within the arts and humanities it may be the particular facts that are important rather than the general theories, and the general theories do not necessarily have widespread agreement. e W3C Library Linked Data Incubator Group (2011) makes a distinction between metadata element sets and value vocabularies within data sets, with the metadata element set providing the structure for holding the information (e.g., Dublin Core element set) and the value vocabularies providing the values for these elements (e.g., an authority list of author names or place names). is book also distinguishes between the structure and the values of ontologies, although it uses slightly different terminology: • ontology element set • ontology instances. e ontology element set and ontology instances combine to form an ontology data set or knowledge base. e term metadata is one that is already overburdened within the information profession, and may cause confusion when distinguishing between more traditional approaches to cataloguing and the rich semantic nature of ontologies. Metadata is also strongly associated with a particular type of record within the information profession (e.g., a bibliographic record describing a book), and it is important that ontologies are more inclusive than this. ‘Instances’ is a more inclusive term than ‘value vocabulary’, which seems primarily appropriate for existing controlled vocabularies, whereas an instance may be used to refer to any concept or thing within an ontology. A concept is generally an abstract idea that is then given a label, some of which are more concrete than others (e.g., ‘Paris’ may be considered a more concrete concept than ‘Love’), but which are nonetheless abstract. Concepts form the basis of most traditional knowledge organization systems, but ontologies can also deal with more concrete things. As well as the abstract idea of Paris, the one that each of us holds in our minds, with associations of romantic getaways, literary salons or fashion shows, there is the actual physical city with specific boundaries, activities and population at any particular moment. Concepts and things

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 14

14

pRACTICAL OnTOLOgIES

are oen blurred within ontologies, but there is nonetheless a wide range of information associated with any particular concept or thing that is not part of many controlled vocabularies. Following Hedden’s (2010, 69) use of ‘term record’, instance record is used to describe all the pieces of information associated with a particular concept or thing, or resource record within the context of the semantic web. For ease of reading, once an ontology element set or an ontology data set has been introduced as such in this book, the subsequent text may simply refer to it as an ‘ontology’ or either an ‘element set’ or a ‘data set’. Ontologies differ greatly, but all represent a formal representation of knowledge with rich semantic relationships between terms.

Types of ontology Just as we reach a point where the reader is likely to believe they have an understanding of what an ontology consists of, it is necessary to introduce a range of additional terminology that has been adopted to describe types of ontologies. Here we briefly describe four of them: lightweight ontologies; upper ontologies; application profiles; and ontology languages. Usability is an important consideration when it comes to the creation of ontologies, but as Murdock, Buckner and Allen (2012) ask: ‘. . . usability by whom or by what?’ Some have argued that ontologies are ‘unsuited to the rough-and-tumble of real-world applications once they get beyond a certain level of complexity’ (Brewster and O’Hara, 2007, 565). Lightweight ontologies are ontologies that are designed for ease of use, processable by machines but also accessible to humans, focusing on core classes (i.e., types of entities) and properties rather than constraints and axioms (Rocha da Silva et al., 2014). ese may be particularly important in the humanities, where concepts are far less concrete or widely agreed upon. It is lightweight ontologies that have the widest use, especially on the semantic web, and are the type of many of the ontologies within this book. An upper ontology (also known as a foundation ontology) is a general all-inclusive ontology that can theoretically connect all others. Such an ontology can aid ontology interoperability and alignment, and provide a starting point for developing more specific domain ontologies (Opalički and Lovrenčić, 2012). Examples of upper ontologies include Suggested Upper Merged Ontology (SUMO) (www.adampease. org/OP), OpenCyc (www.cyc.com/platform/opencyc) and the Basic Formal Ontology (http://ifomis.uni-saarland.de/bfo). Whether a single, universal ontology is feasible or desirable for representing the myriad of views and perspectives from different domains is open to debate, and is oen ignored in the linked data approach to a semantic web. In this work the focus is less on upper ontologies, and more on what may be referred

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 15

WhAT IS An OnTOLOgy? 15

to as middle-level ontologies, those that are not designed to be universal but are nonetheless designed to accommodate data from a large number of domains. ese include Europeana Data Model and CIDOC-CRM, both of which are returned to in Chapter 3, along with one upper ontology, the Basic Formal Ontology. Application profiles have been defined as: ‘. . . schemas which consist of data elements drawn from one or more namespaces, combined together by implementors, and optimized for a particular local application’ (Heery and Patel, 2000). ey reflect the practical application of ontologies to meet real-world needs that may differ considerably from strict standards described in the original documentation. Increasingly, however, attempts have been made to accommodate the differences in the requirements of the standard makers and the implementers. Dublin Core Terms were developed with application profiles and the semantic web in mind (Baker, 2012), whilst Resource Description and Access (RDA) has both constrained and unconstrained properties, with the unconstrained properties being independent of the overarching Functional Requirements for Bibliographic Records (FRBR) model and having no explicit range or domain. Dublin Core Terms, RDA, and the FRBR model are all returned to in Chapter 3. There are also a range of ontology languages, or meta-ontologies (Stewart, 2011, 126), ‘formal languages used to construct ontologies’ (Kalibatiene and Vasilecas, 2011). Each of these languages may allow for different levels of expressiveness and comprehensiveness, and there have been a number of comparisons of the different languages over the years (e.g., Gómez-Pérez and Corcho, 2002; Kalibatiene and Vasilecas, 2011). Whilst there are a number of traditional ontology languages and web-based ontology languages, and there will undoubtedly be new entrants into the market in the future, the ontology languages focused on in this book are primarily the W3C recommendations for the semantic web: Resource Description Framework (RDF), RDF Schema (RDFS), and Web Ontology Language (OWL). In Warren et al.’s (2014) survey of ontology use, of the 65 respondents answering the question of which language they used, 58 stated OWL, 56 RDF and 45 RDFS. There are well known ontologies that have been published in other languages, e.g., SUMO was written SUO-KIF, itself a variation of the Knowledge Interchange Format (KIF), (Niles and Pease, 2001) and OpenCyc makes use of Cycl (Matuszek et al., 2006), but the potential of the semantic web for bringing together distributed data means that there is often a semantic web version of the ontologies too. The structuring of the semantic web is returned to in more detail in Chapter 2.

Ontologies, metadata and linked data e definition of an ontology provided above overlaps with both metadata and linked

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 16

16

pRACTICAL OnTOLOgIES

data, and it is important to recognize the similarities and the differences between the different concepts, and how they overlap. Metadata is generally defined as ‘data about data’, and information professionals within cultural heritage institutions have traditionally focused heavily on the creation of metadata to describe the objects within their respective collections. Extensive standards and methodologies have independently been created for cataloguing and classifying objects within each type of institution, whether archive, museum, or library, with the metadata elements reflecting those aspects considered most important within the community’s culture. is may be the importance of the fonds to the archival community, reflected in the ability of Encoded Archival Descriptions (EADs) to not only describe an archive collection but also increasingly smaller parts of the collection in a hierarchical fashion, or through the extensive history of a specific object that is possible through the Categories for the Description of Works of Art (CDWA). e traditional distinction between metadata and data breaks down, however, as we move from real-world objects to digital objects and many (oen computer scientists) will say there’s no point in distinguishing between the two, it’s all just data. As van Hooland and Verborgh (2014, 3) put it: ‘Just as you can always add an extra Lego piece on top of another, you can always add another layer of metadata to describe metadata.’ Within this work the term metadata is limited to its traditional sense, a set of elements used to describe a distinct resource, not a part of the resource itself. Where the resource that is being published is a dataset, and if the dataset has been published as linked data and the metadata has been published as linked data, then it may be meaningless to distinguish between the two. Linked data is the best practice for publishing structured data on the web (van Hooland and Verborgh, 2014), which is generally agreed to be in accordance with the four linked data principles set out by Tim Berners-Lee: 1. Use URIs as names for things 2. Use HTTP URIs so that people can look up those names 3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL) 4. Include links to other URIs so that they can discover more things Berners-Lee, 2006

Linked data is an approach to data interoperability which offers an alternative to having an upper ontology (Murdock, Buckner and Allen, 2012). It cuts through the complexity of understanding the relationships between different terms for types of object and attributes used within different data sets by allowing the direct linking between the terms and instances rather than understanding the relationship via an

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 17

WhAT IS An OnTOLOgy? 17

upper ontology. It is not necessary to know that ‘watercolourist’ and ‘oil painter’ are linked via the concepts ‘painter’ or ‘artist’ – instead the person J. M. W. Turner in one data set may be linked to the person J. M. W. Turner in the other directly. It is important to recognize, however, that not all ontologies are encoded as linked data, and not all linked data is an ontology. An ontology does not have to be published on the web or necessarily follow the graph data model of the semantic web’s Resource Description Frame (RDF); instead it may only be used on a private network (or even a single computer) and follow a proprietary format. Alternatively, a wide variety of data may be published as linked data without being an ontology, although when linked data may be considered an ontology and when it isn’t is open to debate. Two factors that may be used in distinguishing between linked data that is an encoded ontology and linked data that isn’t an ontology are dynamism and exhaustiveness. An ontology is a formal representation of knowledge – it is not the same as a dynamic database of information; whereas the library catalogue may be considered an ontology data set or knowledge base, with rich relationships between authors and their works, the circulation aspect of an integrated library system would not be. ‘Formal’ also suggests that an ontology is not an ad hoc piece of data marked up as linked data; marking up the relationships between all the members of the PreRaphaelite Brotherhood in accordance with a particular element set might be considered an ontology, whereas someone marking up the contact details on their website would not be (although the element set used to make up the contact details might be).

What can an ontology do? Hedden (2010, 15) identifies three principal purposes for taxonomies, each of which equally applies to ontologies: indexing support, retrieval support, and organization and navigation support. In addition to which, an ontology can also act as a knowledge base.

Indexing support Despite advances in automatic indexing, human cataloguing and indexing continues to be an important part of the information profession, and controlled vocabularies can ensure consistency in the terms that are applied. An ontology enables an indexer to think more broadly about the terms that are applied, with a wider range of associated terms applicable.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 18

18

pRACTICAL OnTOLOgIES

Retrieval support Information retrieval is the other side of indexing and cataloguing; the same terms that are used to index a document can then be used to retrieve it. e ontology, however, has a couple of advantages over other controlled vocabularies: less ambiguity and the potential of complex queries and inference. All controlled vocabularies are designed to be as unambiguous as possible, distinguishing between potentially confusing terms through the use of subdivisions, attributes, and scope notes. ey are, nonetheless, subject to human error, both in their design and their implementation, and a richer set of relationships with other terms offers less room for ambiguity. e rich set of relationships within ontologies also allows for more complex queries to be created for information retrieval. Whereas traditional search is built upon Boolean operators and faceted search, ontologies allow for increasingly complex graph matching. Ontologies can be represented by a graph consisting of concepts and the relationships between them; for example, Figure 1.2 shows the twelve apostles of Jesus and the relationships between them as a graph. philip

Bartholomew

John

Thomas Matthew

has Brother

James

James

Thaddaeus

Andrew Simon has Brother Simon

has Apostle

Judas Iscariot

Jesus Figure 1.2 A graph of Jesus and his twelve apostles

For the sake of ease, within Figure 1.2 each of the people is represented by his name rather than a unique identifier which has a name as an attribute, and the fact that Simon (brother of Andrew) was subsequently called Peter is overlooked. is simple graph only includes two types of relationship ‘has Apostle’ and ‘has Brother’, and yet already graph matching enables the retrieval of results for more complex queries. If such an ontology

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 19

WhAT IS An OnTOLOgy? 19

had been used in the cataloguing of a set of religious texts that have been variously ascribed to Jesus and his apostles it would now be possible to retrieve results as well as information from the ontology, by matching query graphs against the knowledge graph and including variables for the unknown data that the query should retrieve. Matching the following graph against the graph of Jesus and his twelve apostles would retrieve all the apostles for the unknown VARIABLE: Jesus

has Apostle

VARIABLE

Simon, Andrew, James and John would be found to match the unknown VARIABLE 1 (and VARIABLE 2) in the following graph: VARIABLE 2 has Brother Jesus

has Apostle

VARIABLE 1

e graph matching doesn’t have to be built on explicit relationships alone, but may also be built on inferred relationships. ‘Inference’ refers to the drawing of new relationships from a data set based on existing relationships and a set of rules. At its simplest it may be an understanding of what type of thing an entity is, based on its relationship with something else. For example, if in a bibliographic ontology the information is encoded, and the relationship ‘has written’ is only being used to express the relationship between an author and a work, then the fact that Charles Dickens is an author and Hard Times is a work can be inferred from the information. More extensive rules can allow for greater inference. For example, a genealogy ontology may encode two facts, that , and . If, as is normally the case, the is only used where the target is a male, it may be inferred that both Cain and Abel are male. An additional rule stating that sons of the same parent are brothers would also allow this information to be inferred.

Organization and navigation support ‘Organization and navigation support’ is about the ability to find information through browsing rather than searching, following the relationships between terms to find related concepts. For example, the online store Amazon.com has an extensive

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 20

20

pRACTICAL OnTOLOgIES

taxonomy through which a shopper may browse all the way from the general to the highly specific: Books

Politics & Social Sciences Social Sciences

Library & Information Science Library Management

Without much experience of a particular taxonomy it may be difficult to find the desired subject in an extensive taxonomy. Different people will inevitably make different decisions about the structure of a taxonomy for similar materials, and users of the taxonomy will have to learn the taxonomists’ idiosyncrasies. For example on the Amazon.co.uk site ‘Library & Information Sciences’ is found under ‘Reference’ rather than ‘Social Sciences’ (as it is on Amazon.com) and has no further subdivisions: Books

Reference

Library & Information Sciences

– whilst Amazon.ca contains 18 narrower terms than ‘Library & Information Science’ in comparison to Amazon.com’s five. Although a taxonomy may be wrong in many different ways, there is no single correct taxonomy. An ontology has a more complex set of relationships than a thesaurus, which creates additional challenges for enabling the browsing of resources. Whereas a thesaurus may be kept separate from the content, e.g., running down the side of the page, an ontology may be incorporated throughout the structure of the page. For example, the BBC has developed the Programmes Ontology element set (www.bbc.co.uk/ ontologies/po) to facilitate access to the vast data set about the corporation’s programme output and associated individuals, this information is encapsulated within a whole web page rather than one small part of it. Visiting the regular URI for the long running radio soap opera e Archers (www.bbc.co.uk/programmes/b008ncn6) will provide a typical HTML page of information about the series; adding .rdf to the end (www.bbc.co.uk/programmes/b006qpgr.rdf) will provide the underlying information in a machine-readable format.

The ontology as a knowledge base An additional purpose, unique to ontologies amongst controlled vocabularies, is the

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 21

WhAT IS An OnTOLOgy? 21

ontology as a knowledge base. e rich web of knowledge within an ontology and the ability of inferences to be drawn on existing relationships mean that ontologies can be a rich store of knowledge, not just a means to retrieve knowledge from resources indexed with a particular ontology. Certain general ontologies, such as DBpedia, draw together a wide range of information into one data set, and may be queried to produce results in a form that has not been compiled previously. Current approaches to information retrieval are limited in their ability to discover new information (Stock et al., 2012).e use of an ontology as a knowledge base, as well as increasingly sophisticated information retrieval, is also likely to help with the discovery of undiscovered public knowledge. Undiscovered public knowledge is the idea that the discovery of new knowledge does not have to be based on the investigation of the real world of physical objects and events, but also through the interrogation of objective knowledge. Swanson (1986) identifies three forms this undiscovered public knowledge may take: 1) A hidden refutation: the hypothesis and its refutation may not both be known to any one person. 2) A missing link in the logic of discovery: if no one person knows that A causes B, and B causes C, then the inference that A causes C cannot be known. 3) Combination of multiple tests: a meta-analysis of multiple weak tests may nonetheless provide a strong result.

Each of these is fundamentally an information retrieval problem: ensuring that both hypothesis and refutation are found by a search; ensuring subsequent statements are found; ensuring that all available tests of sufficient quality are identified. Ontologies can undoubtedly improve information and knowledge retrieval, and help with the mining of undiscovered public knowledge, in an increasingly automated fashion.

Ontologies and information professionals is book is being published at a pivotal point in the history of ontologies. On the one hand the web and the development of semantic web technologies have provided the opportunity for ontologies to be adopted by more people in more places than ever before – bringing together data from around the world into one huge data set that can be queried by anyone. On the other hand the ideals of a semantic web have had to adapt to the practicalities of human abilities, recognizing the importance of publishing data even if it is not accompanied by robust formal ontologies. is book will not only emphasize the importance and potential of ontologies, but also the importance of the community of information professionals contributing to

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 22

22

pRACTICAL OnTOLOgIES

the development of new, and increasingly useful, ontologies. Murdock, Buckner and Allen (2012) point out that one of the problems with ontology development is the need for ‘double experts’, those with knowledge of ontology design and subject domains. e community of information professionals have a long tradition of being ‘double experts’, oen coupling a postgraduate information science degree with a subject specialism, and are ideally placed for a role in facilitating access to the web of data and the development of ontologies. e role is particularly important if we are to avoid the risk that an ontologist’s imposition of a domain ontology masks how practitioners construct meaning (Pike and Gahegan, 2007). Knowledge and experience of using knowledge organization systems is a prerequisite for many jobs within the information profession, and the need for knowledge of ontologies more specifically, is only likely to increase in the future. As well as taxonomists and ontologists, for whom the development and maintenance of controlled vocabularies may be a full-time role, knowledge and experience of ontologies is also necessary as part of a wider skill set in cataloguing, metadata and curation roles. For those working as a taxonomist for a global information service, a metadata librarian in a university library, a digital asset cataloguer in a commercial company or a records manager in a non-profit organization, it is increasingly difficult to overlook the importance of ontologies. e focus of the ontologies in this book is on those that are being used on the semantic web. ere are, of course, many bespoke and proprietary ontologies used within commercial organizations, attempting to bring together the disparate information created by departments and units, but those that are of greatest interest are those that provide the opportunity to share more data than ever before and develop new insights from across the world.

Alternatives to ontologies It is important to recognize that ontologies have limitations, and that there are alternative ways of capturing and analysing data. Some of the limitations can be traced to the fundamental assumptions that are made when encoding knowledge within ontologies. Brewster and O’Hara (2007) note two such assumptions: first, the monolithic nature of knowledge that is continually added to; and second, that concepts are the fundamental units of ontologies, and these are manipulated with language. Although there may be few, if any, Kuhnian paradigm shis (Kuhn, 1970) that invalidate the whole of an ontology, there will nonetheless be changing perspectives on the meanings and relationships of individual concepts. is is especially true outside the sciences, where the meaning of concepts and the relationships with associated concepts can be open to vigorous debate. ere is also much that is difficult

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 23

WhAT IS An OnTOLOgy? 23

or impossible to put adequately into words –so-called tacit knowledge (Polanyi, 1966) – although Shadbolt and Smart (2015) suggest that rather than tacit knowledge being seen as something that is impossible to articulate, it should be seen as something that is more easily articulated in some situations than others. Approaches to knowledge representation can be broadly categorized as either topdown or bottom-up (Pike and Gahegan, 2007). Whereas ontologies can oen be considered top-down models of the world, especially when considering the creation of universal ontologies such as OpenCyc, the development of linked data and the semantic web allow for a more bottom-up approach with competing ontologies and potentially conflicting perspectives. However, even bottom-up approaches to capturing knowledge from the data that is available have limitations. e sheer quantity of information available on the web provides, and necessitates, alternative ways of capturing data, through automatic reading and natural language processing (NLP). NLP can be used both to extract terms for an ontology or thesaurus, and apply terms from an ontology or thesaurus during indexing; the difference between structured and unstructured data is becoming increasingly blurred (van Hooland and Verborgh, 2014). NLP has its limitations, however, and depending on the content and purpose of the NLP it is better categorized as a semi-automatic rather than an automatic process. NLP is not the principal subject of this book, but it is likely to play an increasing role in the development of ontologies in the future, and the subject is returned to in Chapter 5. Neither the limitations of ontologies, nor the alternatives, dismiss the need or importance of ontologies. Rather, they help us understand where and when ontologies are appropriate. It may be that in some situations a simpler form of controlled vocabulary is more appropriate, either a thesaurus or an authority list. Data may be better stored in a list, a spreadsheet or a relational database than as a graph, whilst certain types of tacit knowledge may be better captured through video than by trying to put it into words. Brewster and O’Hara (2007) note that criticisms have been made that ontologies demand too much work and are too rigid, but such criticisms have been made about many core information activities, such as cataloguing and classification in the age of the web, and what we find is that most oen new technologies complement rather than replace existing technologies. Rather than search engines replacing the library catalogue, the library catalogue is increasingly integrating its own information services with the web and, increasingly, the semantic web. Rather than ontologies replacing earlier forms of controlled vocabularies, they complement them, providing an increasingly powerful tool for information retrieval and knowledge representation.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 24

24

pRACTICAL OnTOLOgIES

The aims of this book ere are three main aims for this book. e first is to demonstrate to the information professional the importance of ontologies for knowledge discovery. e second is to demonstrate the important contribution information professionals can make to the development of ontologies. Finally, the book aims to provide a practical introduction to the development of ontologies for information professionals. is introductory chapter will, hopefully, already have gone some way to demonstrating the importance of the development of robust and widely used ontologies in the fight against information overload, and the role of the information professional in the process. ese ideas will continue to be developed and reinforced throughout the rest of the book. In addition to demonstrating the importance of ontologies and the role of the information professional, the book is also designed to be a practical introduction. It will introduce some of the existing dominant ontologies that are likely to be of interest to the information profession, as well as the methods and tools necessary for building new ontologies and interrogating existing ontologies. LaPolla’s (2013) survey found the implementation of semantic web compliant catalogues was hindered by a lack of funding, best practice and awareness of the associated concepts. Whilst the book can do little about the lack of funding, it will contribute to both discussion on best practice and increase familiarity with many of the basic concepts. Although a majority responding to LaPolla’s survey had some familiarity with semantic web concepts, this fact is clouded by the fact that it was a self-selecting survey and it seems likely that those with little interest in the semantic web didn’t bother with the survey. Even amongst those who completed the survey, whereas the vast majority were either very familiar or somewhat familiar with the concept of the semantic web (90.16%) and linked data (95.52%), familiarity with more specific technologies necessary for implementation were far lower: Web Ontology Language (OWL), 53.21%; Simple Knowledge Organization Systems (SKOS), 43.59%. No single book could provide an exhaustive introduction to the practicalities of ontology use and development. Whole books have been written on technologies that have been covered here in one or two pages; there is a huge variety of soware available for ontology development; new ontologies are being developed (as well as old ones falling into disuse); and old standards are changing while new ones are introduced. Nonetheless, the underlying methods of ontology development change more slowly than the specifications, and by focusing on the underlying theory the skills related to one set of technologies can be applied to others.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 25

WhAT IS An OnTOLOgy? 25

The structure of this book e rest of this book consists of six chapters, from introducing the semantic web and some existing ontologies, through adopting, building and interrogating ontologies, to the future of ontologies:

Chapter 2 – Ontologies and the semantic web Ontologies have gained added significance in recent years through the adoption of an increasingly semantic web. Chapter 2 provides an introduction to the semantic web and the role of ontologies, and how ontologies have been increasingly adopted in a wide variety of libraries as well as other cultural heritage institutions and commercial organizations.

Chapter 3 – Existing ontologies ere is a wide variety of ontologies that have been developed, and knowledge of the dominant ontologies, their applications and their differences is increasingly essential to the information professional. Chapter 3 considers some of the main ontologies, including those ontologies used for representing ontologies, those widely adopted by libraries and those widely used on the web.

Chapter 4 – Adopting ontologies e reuse of existing ontologies is important for both the integration of data across different systems and to avoid the repetition of work. Chapter 4 considers the tools that are available for identifying existing ontologies, how the ontologies (or elements thereof) can be combined in the creation of application profiles, and some of the criteria that should be considered when selecting ontologies.

Chapter 5 – Building ontologies It is increasingly important that information professionals are not only users of existing ontologies, but that they build their own ontology for particular applications. Chapter 5 provides both a methodology for building an ontology and an overview of some of the tools that are available, before leading the reader through the development of a simple ontology with Protégé, the most popular (and free) soware for ontology development.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 26

26

pRACTICAL OnTOLOgIES

Chapter 6 – Interrogating ontologies Ontologies are not only of interest for the structure they provide, but also for the data that they contain. Chapter 6 provides an overview of tools available for interrogating semantic web ontologies, both through Simple Protocol and RDF Query Language (SPARQL) and web crawlers, to gain new insights.

Chapter 7 – The future of ontologies and the information professional e final chapter looks to the future of ontologies and the role of the information professional in their development and use. e future of ontologies will undoubtedly be a mixture of lightweight and more formal ontologies, and their development is likely to be integrated with other technologies such as Natural Language Processing and potentially crowdsourcing workflows. e contribution for the library and information professional to ontology development also has the potential to change, expanding from the bibliographic ontologies that will undoubtedly occupy them in the short term to the development of niche subject specific ontologies in the long term.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 27

CHAPTER 2

Ontologies and the semantic web

Introduction Interest in ontologies has grown rapidly in recent years due to the adoption of an increasingly semantic web. e web is by no means the only place where ontologies may be implemented, but it is the use of ontologies on the semantic web that is the primary focus of this book, as they have the greatest potential, and as such are likely to be of greatest interest to the modern library and information professional. e chapter starts with an introduction to the semantic web and its most recent incarnation as linked data, before considering more closely the standards that have been adopted for structuring the semantic web. Finally, the last part of the chapter looks at how ontologies have been increasingly adopted in a wide variety of libraries as well as other cultural heritage institutions and commercial organizations.

The semantic web and linked data e semantic web is about moving from a web of documents to a web of data, from one that is primarily designed to be read by humans to one that can be read by machines. It first started gaining widespread attention in 2001 with publications in Nature (Berners-Lee and Hendler, 2001) and Scientific American (Berners-Lee, Hendler and Lassila, 2001). e web has put vast quantities of information at our fingertips, but much of this information is unstructured and it requires a lot of effort to gather and analyse the information resources that we need. For a simple informational query we pay little attention to the effort required. If we want to know what time a show starts it is generally simple enough to enter the name of the theatre and browse the pages for show times. But as queries require collecting data from multiple sites, the task can quickly become arduous. Wanting to know which shows are playing in a five-mile radius of where I am and which start aer 8p.m. would require aggregating information of multiple sites, or at least visiting a site that had aggregated that information on my behalf. Some types of information have many aggregating sites (e.g., hotel and holiday information), but there is a vast amount of

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 28

28

pRACTICAL OnTOLOgIES

information that may not be commercially viable for aggregation. Also the aggregators are not necessarily aggregating the information that you want aggregated; you may want to know the length of the show, the suitability for a particular age group, or the accessibility of the venue. If, however, each site makes its data available in an appropriately structured format, it becomes simple for this data to be gathered and queried automatically by a wide range of web agents and services, each of which can query for the information that they require. e original vision of the semantic web promised a future where an increasing number of online activities could be accomplished automatically as automated agents carry out tasks on people’s behalf, not only retrieving information, but potentially carrying out simple transactions, albeit non-financial ones in the short term. Despite recognition of the potential of a semantic web its initial adoption may be considered to have been quite slow: most of our online activities still require a significant amount of human involvement. Nonetheless, progress on the semantic web has been made, not only with the establishment of new specifications, but also with the establishment of a new paradigm for the publishing of data: linked data. Linked data prioritizes the publishing of data in a machine-readable format rather than the underlying concepts (van Hooland and Verborgh, 2014), and the simplicity of the approach has encouraged the publishing of data on the semantic web by a wide range of individuals and organizations. ere is also a lot of interaction by people with semantic web technologies that is hidden from view. Most people’s experience of a semantic web is not through automatic agents but through the knowledge bases that have been created by the major search engines and incorporated into their search results. For example, Google has been incorporating its Knowledge Graph (www.google.com/intl/bn/insidesearch/ features/search/knowledge.html) knowledge base into its search results since 2012. More recently it has been working on a Knowledge Vault, where the facts are extracted from the web automatically and it offers unprecedented collection of facts (Hodson, 2014). Nonetheless, even with such a vast collection of facts, it is organized according to the same manner as the semantic web – in RDF triples.

Resource Description Framework (RDF) e Resource Description Framework (RDF) is a conceptual model for making statements about resources through RDF triples. RDF triples are a way of expressing and relating information as three-part statements or ‘facts’ structured in a simple subject-predicate-object format. For example, in plain English, a triple could be:

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 29

OnTOLOgIES AnD ThE SEMAnTIC WEB

29

David is the subject, hates is the predicate, and Apple is the object. e idea that such simple facts may be used to encode all human knowledge may be hard to believe but, as has been observed by Novak, ‘if the structure and function of all organisms that live or have lived on earth can simply be coded by triplet sequences of four nitrogenous base pairs A, G, T and C, there is no reason for a knowledge record to be more complex’ (cited in Jakus et al., 2013). Modelling the data in such a way makes it suitable for the distributed nature of the semantic web, as ‘anyone can make statements about any resource’ (W3C, 2004). If someone else disagrees with this statement believing , or wants to add an additional associated statement such as , then there is nothing to stop them – a process that would be much more difficult if the initial data was structured in a table or a relational database on the web. Of course, having plain text can lead to ambiguity. Aer all, most people reading the triple will know many people called David (both fictional and real) and may associate multiple different objects and organizations with Apple. ere are aer all, apples the fruit, Apple the music label established by the Beatles and Apple the technology company. People reading the triple may presume that the David referred to, is the author of the book, and based on the subject of the book may presume David is more likely to have strong feelings on a technology company than on a type of fruit or a music label. But none of this is explicit, and for computers to understand the information, and for multiple graphs to be joined together, it needs to be made explicit. To make the triple explicit, URIs may be used to represent each of the ‘resources’; on the semantic web anything that can be represented is referred to as a resource. e graph in Figure 2.1 shows a graph for . Resources are represented by ovals, and literals are represented by rectangles. In this case the ‘hates’ relationship is expressed between URIs representing David and Apple on a fictitious social network, with label from the RDFS vocabulary used to provide human-readable labels to those URIs. e relationship also takes the form of a URI, allowing a relationship to be used from an existing ontology (albeit in this case also a fictitious one). Once subjects, predicates, and objects are unambiguous, multiple facts can be combined into a single graph that can then be queried by a computer. For example, David

Apple

rdfs:label

rdfs:label

www.socialnetwork.com/David

www.relationships.com/hates

www.socialnetwork.com/AppleTech

Figure 2.1 David hates Apple graph

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 30

30

pRACTICAL OnTOLOgIES

two additional pieces of information have been added to the graph: David knows Bob; and Bob loves Apple (Figure 2.2). David

Apple

rdfs:label www.socialnetwork.com/David

rdfs:label www.relationships.com/hates

www.socialnetwork.com/AppleTech

www.relationships.com/loves

www.socialnetwork.com/AppleMusic

foaf:knows

www.socialnetwork.com/Bob rdfs:label

rdfs:label

Bob

Apple

Figure 2.2 David hates Apple, but knows Bob who loves Apple

Luckily the use of URIs means that it is possible to distinguish between Apple the music label and Apple the technology company, and we do not expect it to be a source of friction between Bob and David. In this graph the ‘knows’ relationship has been taken from the existing FOAF (Friend of a Friend) ontology. FOAF is one of the most established ontologies on the web, with many of the properties regularly being reused across the web. As popular ontologies are reused it is possible for additional tools and services to make use of the data, as people will already know what the property means. FOAF is returned to in Chapter 3.

Classes, subclasses and properties Whilst the semantic web is designed to let anyone make statements about any resource, ontologies are needed to constrain what can be said. is is achieved through classes, subclasses and properties. A class is a set of things with properties in common. For example, an ontology may have a ‘person’ class, an ‘event’ class, or a ‘place’ class. Properties are the attributes associated with particular classes. For example, a person is likely to have a name, and depending on the type of ontology properties may also have a sex, date of birth, place of birth, e-mail address, job title, or any other type of attribute that could be associated with a person. Ontologies may also state the cardinality of a property (i.e., the number of times it may be associated with a particular entity) and state the type of objects a

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 31

OnTOLOgIES AnD ThE SEMAnTIC WEB

31

property can have as a target. For example, place of birth may either be restricted to a literal (i.e., string of text) or a link to another resource (e.g., an entity of the place class). Following typical semantic web style the property and class names used throughout the rest of the book make use of CamelCase (spaces between words are replaced by capital letters as the beginning of each word), class names are capitalized, and a CURIE (compact URI) style is adopted to show where a resource comes from a common data set or ontology. For example, foaf:Person refers to a class called Person from the foaf ontology, whereas foaf:familyName and foaf:age refer to properties from the same ontology. Where an example makes use of a single fictitious vocabulary or data set, and there is no advantage in coining a fake prefix, it is simply omitted, e.g., :colourOfHair would refer to a property from such a fictitious ontology. e variety of properties that could be associated with a broad class such as Person are endless, even if many of the potential properties are highly unlikely to be of much use for most situations or apply to most people: numberOfTeeth; hasCircusSkill; hasPoliticalAffiliation; hasMurdered. Subclasses enable distinctions to be made between the properties that can be associated with different subsets of a class. For example, whilst every Person may have a name and a date of birth, hasCircusSkill may be associated with the Person subclass, Clown. An entity of the type Clown would inherit properties associated with both Clown and Person. Similarly, properties may have subproperties. For example, there may be a ‘knows’ property associated with a Person class, enabling a relationship to be expressed between one person and one or more other people. But there may also be associated subproperties of knows, e.g., hasSon, hasEmployee, hasMentor. is allows ontologies to be queried at different levels of granulation. Decisions about what is, or is not, a class, subclass or property are not absolute and different people may make different decisions for different ontologies. If an agent only ever has one address in an ontology, then it may be that address properties can be incorporated into the agent class. If, however, an agent has multiple addresses of different types it makes more sense to group the associated properties together in a class of their own. A number of technologies are necessary in going from an abstract idea about representing a fact as a triple, collecting these triples in classes, and encoding these triples in such a manner that they are widely understood and services can be built upon them. e necessary steps are generally illustrated with a semantic web stack.

The semantic web stack e semantic web stack (also known as the semantic web layer cake) is used to represent the architecture of the semantic web. It has been visualized in a number of

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 32

32

pRACTICAL OnTOLOgIES

ways since it was first proposed, and specific information about the necessary technologies for the different stages has been included as specifications have been defined and recommended. Figure 2.3 is based on the Wikipedia version of the semantic web stack as of April 2015.

User Interfaces and Applications Trust proof Unifying Logic Ontologies: OWL

Rules: RIF/SWRL

Taxonomies: RDFS Data interchange: RDF

Cryptography

Querying: SPARQL

Syntax: XML Identifiers: URI

Character set: UNICODE

Figure 2.3 The semantic web stack (based on http://en.wikipedia.org /wiki/Semantic_Web_Stack)

Identifiers and character sets: URIs and Unicode At the bottom of the semantic web stack are a number of already widely adopted technologies for encoding the characters, identifiers and syntax of the semantic web: Unicode, URIs. For the information professional, Unicode simply means that RDF is encoded in text, the sort that can be opened in Notepad on Windows or TextEdit on a Mac. e URIs (uniform resource identifiers) are globally unique identifiers, the most common of which are the URLs (uniform resource locators) that web users type in the address bar of their browser to get the web page they are interested in, e.g., http://www.bbc.co.uk. URIs also include URNs (uniform resource names) that are location-independent identifiers. For example, the URN urn:isbn:9781783300624 may refer to the book Practical Ontologies for Information Professionals, without it providing a location for information about that book. Increasingly we talk about IRIs (internationalized resource identifiers) rather than URIs, as it is no longer necessary for resource identifiers to be restricted to

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 33

OnTOLOgIES AnD ThE SEMAnTIC WEB

33

the characters of the Latin alphabet. Due to issues regarding the potential lack of support of non-English character sets by semantic web tools, as well as a global recognition of the Latin alphabet, URIs are nonetheless currently preferred to IRIs in the development of ontologies and the term URI is used throughout. Although any URI may be used as part of the semantic web, and it is not necessary for anything to be returned for a particular URL, it is recommended that they are nonetheless dereferenceable when used for linked data (Sauermann and Cyganiak, 2008). at means that when a URI is entered for a particular resource, information associated with that resource is returned. For example, the URI http:// www.davidstuart.co.uk/resource/123 may be used to refer to a resource in a data set, but trying to retrieve data from the page would return a HTTP 404 ‘file not found’ error message. Obviously it is far more useful if the page returns the associated resource record. ere are, however, issues that arise from creating derefenceable URIs that need to be considered: most noticeably dealing with content negotiation, and distinguishing between a web page and the resources that the page describes. Content negotiation is the HTTP process by which an HTTP client retrieves its preferred content. For example, a web browser will generally indicate that it prefers HTML and entering a URI in a web browser retrieves an HTML version of the page (if one is available). URIs are oen also used to indicate to a server that a different version of a resource is required. For example, the BBC Programmes Ontology generally provides an HTML version of a resource (e.g., www.bbc.co.uk/ programmes/b008ncn6) when viewed through a web browser; however, adding .rdf to the end (www.bbc.co.uk/programmes/b006qpgr.rdf) provides the underlying information in an RDF/XML serialization (discussed in more detail below). URIs can also have an important role in distinguishing between the page and the resource that the page describes. For example, when we dereference a URI representing Queen Elizabeth II, we do not expect to retrieve the Queen herself from the internet, but rather a page of information about the queen, and it is important to be able to distinguish between the two. ere are two approaches to doing this, hash URIs and 303 URIs (Heath and Bizer, 2011). With 303 URIs (also known as slash URIs), when resource URIs are requested the server responds with a 303 see other status code, redirecting the client to the URI of the page associated with the requested resource. is approach has been taken with the publishing of the DBpedia data set. Entering http://dbpedia.org/ resource/Elizabeth_II in a browser will redirect to the page http:// dbpedia.org/page/Elizabeth_II. Hash URIs make use of the fragment identifier, which may be used in URIs to distinguish between parts of a document, to distinguish between the page and the resource. For example, if DBpedia had adopted hash URIs then a resource identifier could be distinguished from a page identifier (http:// dbpedia.org/Elizabeth_II) through the use of a fragment identifier, typically

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 34

34

pRACTICAL OnTOLOgIES

#this

(http://dbpedia.org/Elizabeth_II#this). Although best practice should enable the distinction between pages and resources, and the use of fragment identifiers is simple to implement, oen such distinctions are overlooked. Whether URIs are dereferenceable or not, one additional decision needs to be made in the coining of URIs: whether they should be descriptive or opaque. e difference between an opaque URI and a descriptive URI is the difference between http:// dbpedia.org/resource/Elizabeth_II and http://dbpedia.org/resource/ 0128422. Without retrieving the associated resource record the first URI is far more selfexplanatory to a person reading the URIs, which may aid in both comprehending and creating RDF triples, although it should be remembered that for a computer the two URIs are equally meaningful. Whilst the comprehensibility of descriptive URIs might make it seem as though they are the natural choice, there are a number of reasons why opaque URIs might be more appropriate, and they have been incorporated in a wide range of ontologies, e.g., CHEBI (www.ebi.ac.uk/chebi) and RDA (www.rdaregistry.info). It may be that opaque URIs are adopted to allow for the evolution of terms (van Hooland and Verborgh, 2014), to prevent the promotion of any single language, or simply because it is the simplest way to ingest the data in a system.

Syntax: XML XML (Extensible Markup Language) is a markup language that is both machine and human readable. Although it is by no means the only format used for sharing semantic web data, and is increasingly being challenge by alternative formats, RDF/XML is longest established of the semantic web serialization formats and most semantic web data is available in an XML format. A simple XML file to describe a book may be structured as below:

Practical Ontologies for Information Professionals David Stuart 9781783300624

Structuring the file as above, however, does not provide unique references for the element names. e term author is applied in widely different ways in academia; whereas in the humanities author might be expected to imply the person has had an active role in composing a journal article, in the sciences it can oen be applied to dozens of people who have had a role in carrying out an experiment, most of whom will not have had a role in writing up the results. If the data is to be shared between

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 35

OnTOLOgIES AnD ThE SEMAnTIC WEB

35

multiple organizations across the web, then it is important that it is clear which vocabulary the element names are taken from. Importantly XML allows the use of CURIEs with name spaces, which allows the referencing of common elements, so that concepts that have already been developed may be reused. For example, in the example below, dc:title combines the namespace represented by dc (i.e., http://purl.org/dc/elements/1.1/) with the identifier title so that a program parsing file will recognize that it refers to the URI http://purl.org/dc/elements/1.1/title:

Practical Ontologies for Information

Professionals

David Stuart

9781783300624

XML may also be used to encode RDF triples, in a format known as RDF/XML:



Practical Ontologies for Information

Professionals

David Stuart





In the above example the book itself is represent by a URI: http://www4. wiwiss.fu-berlin.de/bookmashup/doc/books/9781783300624. e rdf: Description element is used in conjunction with an rdf:about attribute to set the subject of the triple. e names of the child elements provide the predicates:

dc:title, dc:creator, dc:identifier. e object of the triples is provided by values in the elements for literals, and the rdf:resource attribute for resources.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 36

36

pRACTICAL OnTOLOgIES

In the above case the book is represented by a URI from the RDF Book Mashup (http://wifo5-03.informatik.uni-mannheim.de/bizer/bookmashup/). e RDF Book Mashup returns an RDF file for any ISBN number through querying the Amazon API, thus providing a dereferenceable URI for most books. e same resource may in fact be represented by multiple URIs, and most widely distributed publications are. For example, dereferenceable URIs for books can be found at: • British National Bibliography – http://bnb.data.bl.uk • Open Library – http://openlibrary.org • WorldCat – www.worldcat.org. e same information has been encoded below, although this time the creator has been replaced by a URI representing the author within the Library of Congress, and the file has been shortened by replacing the rdf:Description element with the type of the resource, i.e., dc:BibliographicResource:



Practical Ontologies for Information

Professionals





Subjects oen have rdf:type predicates, and RDF/XML allows them to be included more concisely (see www.w3.org/TR/REC-rdf-syntax/#section-Syntax-typed-nodes). Putting the last two examples within W3C’s RDF validator (www.w3.org/RDF/ Validator) will nonetheless produce the same four triples. It is important to distinguish between the data that is being encoded, and how it is being encoded – its serialization. RDF/XML is not the most user-friendly serialization of RDF, but it is default RDF serialization and it is therefore important to have an idea of what it is showing, and not to be thrown when an rdf:Description is replaced by the type of the resource. e triples within the RDF/XML serialization are not necessarily immediately obvious to the untrained eye. e simplest way to determine whether or not an RDF

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 37

OnTOLOgIES AnD ThE SEMAnTIC WEB

37

file has been correctly formed is to paste into the W3C RDF Validator. If it is a particularly large RDF file, however, it may be necessary to validate the file on the desktop. One way to do this is through Apache-Jena (https://jena.apache.org). e Apache-Jena Java framework RIOT package enables a file to be validated from the command line, whilst the ARQ package can be used to query an RDF file (SPARQL, the RDF query language, is discussed in more detail later in the chapter, and returned to in more detail in Chapter 6).

Alternative serializations Although RDF/XML was the recommended serialization for the semantic web, because web infrastructures were used to representing information in HTML and XML (Allemang and Hendler, 2011), increasingly alternative, more user-friendly, serializations are being made available. ese include Turtle, N-Triples, N-Quads, and JSON-LD. Some have argued that Turtle is now the ‘de facto’ serialization (van Hooland and Verborgh, 2014, 210), whilst the popularity of JSON would seem to bode well for the future of the relatively new JSON-LD serialization. Oen the same data is made available in multiple serializations. For example, the Library of Congress currently makes its linked data available as RDF/XML, N-Triples, and JSON-LD. RDF may also be changed from one serialization to another simply through the online RDFTranslator (http://rdf-translator.appspot.com) or through command line tools such as RDF2RDF (www.l3s.de/~minack/rdf2rdf/) and the Apache-Jena Java framework RDFcat package. Each of the different transformation tools can transform to and from different serializations, so whilst the RDF-Translator online form is the easiest tool to use, if you want to transform data into the Turtle serialization it is not suitable. e different serializations have different advantages and disadvantages. e simple Practical Ontologies for Information Professionals bibliographic record is provided below in the N-Triples format:

.

“Practical Ontologies for Information Professionals” .



.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 38

38

pRACTICAL OnTOLOgIES .

N-Triples provide an extremely simple serialization to understand. Each line represents one triple formed on URIs or literals for the subject, predicate, and object of the triple, finishing each line with a full stop. However as it doesn’t allow for CURIEs or the nesting of resources, it can be quite verbose. is is exemplified in the example above by the need for each triple to occupy multiple lines. e popularity of N-Triples has nevertheless been built upon in the form of N-Quads, which is like N-Triples, but also has a fourth part, indicating the context of the triple. Context may be important where data is collected for different graphs. For example, two different organizations may make differing, or even contradictory claims about the same resource. It is important to be able to distinguish where the data is from. Turtle is not only important because it is a more user-friendly way of incorporating CURIEs and nested references than RDF/XML, but also because the Turtle syntax forms the basis of SPARQL, the RDF query language: @prefix rdf: .

@prefix dc: .

a dc:BibliographicResource ;

dc:title “Practical Ontologies for Information Professionals” ;

dc:creator ;

dc:identifier .

In the Turtle example above it is unnecessary to express the full URI for each of the Dublin Core relationships, and the repetition of the subject is related by the fact that the line ends with a semi-colon rather than a full stop. e final RDF serialization discussed here is the most recent, JSON-LD. In this case it has been created with RDF-Translator: {

“@context”: {

“dc”: “http://purl.org/dc/elements/1.1/”,

“rdf”: “http://www.w3.org/1999/02/22-rdf-syntax-ns#”,

},

“rdfs”: “http://www.w3.org/2000/01/rdf-schema#”, “xsd”: “http://www.w3.org/2001/XMLSchema#”

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 39

OnTOLOgIES AnD ThE SEMAnTIC WEB

39

“@id”: “http://www4.wiwiss.fu-

berlin.de/bookmashup/doc/books/9781783300624”,

“@type”: “dc:BibliographicResource”, “dc:creator”: {

},

“@id”: “http://id.loc.gov/authorities/names/nb2011021410”

“dc:identifier”: {

}

},

“@id”: “urn:ISBN:9781783300624”

“dc:title”: “Practical Ontologies for Information Professionals”

JSON is an increasingly popular format for the sharing of data, which is far faster to process than XML and is directly supported by JavaScript whilst remaining humanreadable (Nurseitov et al., 2009). JSON-LD is one of two serializations for publishing RDF as JSON, although unlike RDF/JSON (Alexander, 2008) JSON-LD is a W3C recommendation (W3C, 2014). As JSON-LD is totally compatible with JSON, it enables data publishers to continue using existing tools, and may help with the wider adoption of semantic web technologies (Lanthaler and Gütl, 2012). Importantly, as JSON is supported by JavaScript, it can be included within HTML web pages, rather than requiring a different format when embedding RDF (see section on embedded RDF below).

Data interchange and taxonomies: RDF and RDFS e RDF vocabulary provides the basic foundations for constructing the semantic web; enabling the making of basic triple statements, and layering them on top of one another. For the semantic web to be useful, however, it is not enough to have a vast unconstrained web of triples with no structure; the triples need to have some constraints to be useful. RDF Schema (RDFS) provides the vocabulary necessary for providing the first level of constraints, providing a common language for the description of classes, subclasses and the properties within and between them: ‘Just as Semantic Web modeling in RDF is about graphs, Semantic Web modeling in the RDF Schema Language is about sets . . . RDFS provides some guidelines about how to use the graph structure in a disciplined way’ (Allemang and Hendler, 2011, 125). RDFS also introduces a simple level of inference to the semantic web by providing a common vocabulary for stating the range and domains of properties. However, there are limits to how much can be stated with RDFS. For example, whilst you can assert

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 40

40

pRACTICAL OnTOLOgIES

that something is a member of a certain class, you can’t state that it’s not a member of another class (Allemang and Hendler, 2011). For that, and many other constraints, you need OWL.

Ontologies: OWL OWL (Web Ontology Language) provides an expressive ontology language for the semantic web, and it became a W3C recommendation in 2004. As it became widely adopted by different groups and organizations, a number of limitations were identified and addressed in OWL 2, which was announced in 2009.Whereas RDFS may be used for representing simple triples, the domain where they may be used, and their targets, OWL allows more formal reasoning rules, enabling the production of more knowledge by inference (van Hooland and Verborgh, 2014). Inference doesn’t have to be based on multiple relationships, but may be built upon a simple predicate. For example, if a triple has a particular predicate, and that predicate is only used to relate one particular class to another particular class, then those classes may be inferred. Such simplistic inference is possible with RDFS, but OWL allows more complex inference built upon the more expressive relationships that can be stated. Amongst other things, OWL enables statements about one class being disjoint with another (e.g., on a restaurant menu meat-based dishes and vegetarian dishes would be disjoint classes), about a class being a union of two other classes (e.g., non-vegetarian options would be a union class of meat-based dishes and vegetarian dishes), about intersections of two classes (e.g., the intersection of vegetarian dishes and pizza dishes would provide vegetarian pizza class), the expression of inverse properties (e.g., hasIngredient is the inverse of isIngredientOf), and also for distinguishing between distinct individuals (e.g., expressing that Robbie Williams and Robin Williams refer to two different people and are not alternative names for the same person). ere are three profiles for OWL 2: OWL 2 EL, OWL 2 QL, and OWL 2 RL. Each of these exchanges some of the expressiveness of OWL 2 for efficient reasoning (Grau et al., 2008). From the perspective of an introduction to ontologies for information professionals, where the focus is on lightweight ontologies, it is not necessary to get bogged down in the differences between these profiles. Readers should, however, be aware of their existence and the reason for their existence. e OWL 2 vocabulary is returned to in more detail in Chapter 3.

Querying: SPARQL SPARQL (Simple Protocol and RDF Query Language), pronounced ‘sparkle’, is an RDF query language designed for querying the semantic web. A SPARQL query represents

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 41

OnTOLOgIES AnD ThE SEMAnTIC WEB

41

a graph of known and unknown variables that is matched against a queried graph (or graphs) by a SPARQL query engine, which then returns the requested parts of the queried graph that match the initial query. Figure 2.4 illustrates a simple graph against which a query may be matched. cheese Jane

likes likes

knows John

likes likes

wine crisps beer

Figure 2.4 An example of an RDF graph

For example, SPARQL may be used to retrieve those UNKNOWN resources from the graph in Figure 2.4 where: graph Query

Resources retrieved for ?UnKnOWn

?UnKnOWn likes crisps John likes ?UnKnOWn

John crisps, beer

Triplestores will oen have a SPARQL endpoint where queries may be sent automatically; some also have a query form for manual entry. Creating SPARQL queries is more complicated than entering a few keywords into Google, and relies on an understanding of the associated ontology element set; however, these complex queries can collate answers that could not have been created otherwise. e potential of SPARQL, and a number of examples, are explored in more detail in Chapter 6. As is oen the case with web standards, a number of different variations have emerged for specific situations (Malik, Goel and Maniktala, 2010), so unless a triplestore states that it is SPARQL 1.0- or SPARQL 1.1-compliant, it may be necessary to investigate a particular triplestore’s documentation.

Rules: RIF, SWRL, SPIN Rule languages can be used to enhance the ontology languages of the semantic web, enabling the description of rules and relationships that can’t be in encoded in OWL or another ontology language (Rattanasawad et al., 2013). However, whereas many of the

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 42

42

pRACTICAL OnTOLOgIES

other technologies in the semantic web stack discussed up to this point have been widely adopted, there is no single dominant rule language. Rather than developing a one-sizefits-all rule language for the semantic web, W3C created the Rule Interchange Format (RIF) to facilitate the rules being exchanged between different rule languages. As with the more intricate aspects of OWL, the information professional is unlikely to require a deep understanding of RIF or some of the more popular rule languages (e.g., SWRL, SPIN), but it is useful to have a basic understanding of the role they can have in the semantic web. Semantic Web Rule Language (SWRL) was introduced by W3C in 2004, and extends OWL axioms to allow rules to be incorporated into an OWL knowledge base (Rattanasawad et al., 2013). Importantly, from the perspective of an information professional starting to investigate the potential of rules, SWRL rules are supported by a number of ontology editors, including Protégé, currently the most popular ontology editing soware (Warren et al., 2014), which is discussed in more detail in Chapter 5. SPARQL Inferencing Notation (SPIN), http://spinrdf.org, is a simple SPARQL-based rules language, designed to make use of the fact that SPARQL has been widely adopted across the semantic web, and was a submission to W3C in 2011 (Knublauch, Hendler and Idehen, 2011), although five years later it was still not a recommendation. Both SPIN and SWRL have been tested in a number of different environments. For example, SPIN has been used for formalizing accounting regulations in connection with XBRL (eXtensible Business Reporting Language) an open data format for exchanging business information (O’Riain et al., 2015) and for identifying data quality problems on the semantic web (Fürber and Hepp, 2010). SWRL has been used for reasoning on anti-diabetic drugs selection (Chen et al., 2012) and pear diseases (Sun and Liang, 2014). Inferencing from rules, like inferencing from OWL or RDFS, requires the use of an inference engine, and different rule languages are supported by different inference engines. Rattanasawad et al. (2013) make a comparison of some of the available languages and inference engines, although it is likely that the decision to use one particular rule language rather than another may be based more on existing soware that is being used than the theoretical underpinnings of the language. Inference engines are discussed in more detail in Chapter 6.

Embedded RDF e brief introduction to RDF may have instilled the notion that the semantic web is somehow separate from the web that people use regularly. However, whilst RDF triples may take the form of an RDF/XML document on the web, returned aer a process of a content negotiation with a content management system, or may be retrieved from the endpoint of a triplestore, they may also be embedded within the HTML of a web page.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 43

OnTOLOgIES AnD ThE SEMAnTIC WEB

43

ere has been a long history of people trying to incorporate semantic information into web pages, including the SHOE (Simple HTML Ontology Extensions) project and the (KA)2 initiatives in the 90s (Corcho, 2006). Such projects have now been superseded by a number of different approaches to embedding RDF triples within HTML over the years, including N3 in HTML, eRDF (embedded HTML), RDFa, and JSON-LD. N3 in HTML and eRDF are now only of historic interest, but serve to emphasize the time and effort that has been spent trying to integrate RDF into HTML web pages. In 2003, a lifetime ago in terms of RDF and the semantic web, Powers (2003, 12) suggested that whilst the issue of whether RDF should be contained within the HTML or within separate files arose again and again, ‘the consensus seems to be heading towards defining the RDF in a separate file and then linking it within the HTML’. Over a decade later things are still no clearer in certain situations, although the embedding of RDF is likely to have been given additional impetus with the development of JSON-LD. First, however, it is necessary to turn our attention to RDFa. RDFa (RDF in HTML Attributes) is the best established of the attempts to embed RDF triples into HTML. Originally RDFa version 1.0 was only compatible with XHTML, however RDFa version 1.1 is now compatible with HTML 5 (www.w3.org/TR/rdfa-inhtml). A simple HTML document, incorporating RDFa markup is shown below.

About David Stuart





About David Stuart

David Stuart works at King’s College London.



His first book was called Facilitating Access to the Web of Data.





Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 44

44

pRACTICAL OnTOLOgIES

e RDFa example includes five triples shown below in the Turtle format: @prefix dc: .

@prefix foaf: .

@prefix rdau: .

dc:creator “David Stuart”@en; foaf:primaryTopic .

dc:created ; rdau:P60095 “King’s College London”@en .

dc:title “Facilitating Access to the Web of

Data”@en.

Although the RDFa example above includes a prefix for rdau, neither the prefixes for Dublin Core nor those for FOAF are necessary. rdau references the unconstrained properties of the RDA element set, and P60095 specifically refers to a property for having an affiliation with an organization. RDA (Resource Description and Access) is the suggested replacement of the Anglo-American Cataloguing Rules (AACR2), designed for a semantic web, and is discussed in more detail in Chapter 3. Predefined prefixes have been made available for some of the most widely used vocabularies on the semantic web as well as for many of the W3C vocabularies, and these do not need to be included within the prefixes of an HTML document. Although RDFa can be incorporated within HTML, it is by no means a simple process, and even a simple piece of code like the one above is likely to necessitate a trip to an RDFa validator before it can be published with any confidence. e W3C RDFa validator is available at www.w3.org/2012/pyRdfa/Validator.html. RDFa Lite (www.w3.org/TR/rdfa-lite/) is a minimal subset of RDFa, suitable for most situations, that is designed to ease the process of understanding RDFa. In comparison to RDFa, the advantages of JSON-LD are immediately obvious:

{



About David Stuart



About David Stuart

David Stuart works at King’s College London.



His first book was called Facilitating Access to the Web of

Data.





Not only does JSON-LD provide a simple format for structuring the graph, but most importantly the graph is kept separate from the web page content designed for human reader, so that it is simple for existing RDF content to be transformed

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 46

46

pRACTICAL OnTOLOgIES

and included within an HTML page. Both creating and updating the JSON-LD semantic data is undoubtedly far simpler than creating and updating the RDFa semantic data, and although RDFa has the advantage of being the first mover, the greater simplicity of JSON-LD seems likely to quickly overcome that slight advantage. However, the integrated nature of RDFa, whilst complicating the creation process, may contain a hidden advantage: helping to make sure the machine-readable and the human-readable data are both aligned and up to date. is is an idea that gained credence within microformats, one of the alternative approaches to including semantic information on the web.

Alternative semantic visions e RDF vision of the semantic web is by no means the only approach that has been taken to an increasingly semantic web. e two other most notable methods are microformats and microdata. Microformats have been one of the easiest ways of incorporating semantic information into a web document, and were described by Singer (2009, 36) as the ‘lower lying fruit’ of a semantic web. ey were designed to make use of existing HTML 4 attributes to include semantic information within web pages for a limited number of widely used types of information (e.g., contact details and calendar events). e simplicity of the approach and its first principle of ‘human-readable first and machine-readable second’ (Suda, 2006, 2) to build trust and reliability into the content, undoubtedly helped with the adoption. However, the limited number of vocabularies developed, and the centralized nature of the vocabulary development (due to the lack of a namespace equivalent), means that it is coming under increasing pressure from fuller frameworks such as RDF and Microdata. Although Microformats2 was launched in 2012, making use of HTML 5 elements, it has been suggested that it is ‘too late to turn the tide’ against RDFa and Microdata (van Hooland and Verborgh, 2014, 207). Using a snapshot of the web in 2012, however, Bizer et al. (2013) found microformats to be used on 2.5 times as many sites as RDFa and microdata combined. Unlike microformats, microdata is not limited in the vocabularies it can use, and has the capability of incorporating much of the RDF model. For example, the microdata example below incorporates the five triples incorporated in the RDFa and JSON-LD examples above (transformed using http://rdf-translator.appspot.com):













It is difficult to see what extra microdata provides to a semantic web, as it is limited to HTML 5 and, as van Hooland and Verborgh (2014) note, whilst any vocabulary could be used, only schema.org is used in practice. e RDF vision of a semantic web has the most scope for widespread adoption, especially as developments mean that it can be applied to both large data sets in RDF/XML or one of the other widely adopted serializations, as well as to small pieces of data on single web pages through the use of RDFa. For any vision of the semantic web to be achieved, however, there needs to be a lot of work on the development of appropriate vocabularies. e development of controlled vocabularies is something that the community of library and information professionals have a lot of experience of, and they are increasingly interested in engaging with the semantic web.

Libraries and the semantic web ere has been a lot of interest within the library community in publishing data according to linked data principles, and there are now a number of regular groups and events dedicated to the topic, for example: • LODLAM – Linked Open Data in Libraries Archives and Museums (http://lodlam.net) • Semantic Web in Libraries (http://swib.org) • IFLA 2014 Satellite Meeting: Linked data for libraries (http://ifla2014satdata.bnf.fr/program.html). ere are also a growing number of books:

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 48

48

pRACTICAL OnTOLOgIES

• Knitting the Semantic Web (Greenberg and Méndez, 2008, reprinted 2012) • Linked Data for Libraries, Archives and Museums (van Hooland and Verborgh, 2014). In a survey of the perceptions of librarians regarding the semantic web and linked data LaPolla (2013) found that an overwhelming majority of the surveyed librarians felt the semantic web and linked data were important to the future of library catalogues (85% and 91% respectively). e primary focus of the library community in the publication of linked data is the publishing of bibliographic-related resources. Much of the data that has been published has taken the form of catalogue records and other forms of bibliographic metadata. e British National Bibliography (BNB) (www.bl.uk/bibliographic/datafree.html#lod) has been made available online by the British Library as linked data. e Online Computer Library Center (OCLC) have not only made the WorldCat union catalog available as linked data (www.worldcat.org) but also the top three levels of the Dewey Decimal Classification system (http://dewey.info) and the Virtual International Authority File (VIAF) (http://viaf.org). e library catalogue is a valuable resource for information retrieval, but has traditionally suffered from being an information silo; the information was oen neither visible on the web, nor connected to other external resources. ere are many advantages to making bibliographic metadata available as linked data, especially when it is published with an appropriate open licence enabling reuse. For example, the BNB has been published under a Creative Commons zero licence (http://creativecommons.org/publicdomain/zero/1.0/), waiving all their rights associated with the work, whilst OCLC have made the catalogue data available under an Open Data Commons Attribution Licence (http://opendatacommons. org/licenses/by/). is means that new sites and services can be built on top of a library’s data, providing new means of information discovery, either based on data from a single institution or aggregating data from multiple institutions. e sharing of data also has a role to play in reducing the replication of work between institutions. e role of librarians and ontologies should go beyond bibliographic records, however. As has already been mentioned, the ‘double expert’ requirement of ontology development (Murdock, Buckner and Allen, 2012) is already typical of the library and information professional, and as will be recognized in Chapter 5; many of the stages in building ontologies require distinct skills already possessed by the library and information professional. e interpersonal skills necessary for establishing an ontology’s scope and for knowledge elicitation are similar in essence to those required of the librarian in the reference interview. Searching for resources has always been an important part of a librarian’s work, and building ontologies requires the identification

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 49

OnTOLOgIES AnD ThE SEMAnTIC WEB

49

of suitable existing ontologies. e formatting of terms within an ontology has a precedent for the library and information professional in their experience of the Anglo-American Cataloguing Rules (AACR2) and its replacement, Resource Description and Access (RDA). Equally important, librarians are experienced in making sure that resources are not just available, but are used: ‘Books are for use’, as Ranganathan (1931) put it for his first law of librarianship; or ‘data is for use’, as an updated version of the law has been suggested elsewhere (Stuart, 2011).ere is little point in data being available unless researchers are able to use it. is means ensuring that the data (or more specifically ontologies) is appropriately formatted and has the necessary metadata so that it can be found, ensure it reuses existing vocabularies and has sufficient documentation so that it can be used, and that it is published openly with an appropriate licence so that it can be reused. It is important to remember, as Kitchin (2014) notes, that databases and infrastructures are not neutral. e decisions that are made about the structuring of an ontology and the properties that are selected have important ramifications in the way people both view and access resources. One of the best examples of the way technical decisions can impact people’s perspective comes from social media, and Facebook’s ubiquitous ‘Like’ button. Before the introduction of Facebook’s reaction buttons, users could only express a ‘Like’ for something that was shared, as opposed to a disliking or, as is now possible, reacting with a ‘Love’, ‘Haha’, ‘Wow’, ‘Sad’, or ‘Angry’ as well. Users of such a platform are likely to get a more rose-tinted view of the world than they would if both liking and disliking were available, as it nudges people into sharing those items that people ‘Like’, and in turn these things will be shared and ranked more highly. Importantly, librarians have experience of not only dealing with the technical aspects, but also the users of resources, and are ideally based to see it from both perspectives.

Other cultural heritage institutions and the semantic web Libraries are not the only type of cultural heritage institution contributing to the development of ontologies and the semantic web and, as the title of LODLAM (Linked Open Data in Libraries Archives and Museums) suggests, there has been interest and developments throughout the sector.

Archives One of the long standing principles of archival organization has been ‘respect des fonds’, where fonds refers to a group of records from the source. Whereas the library community

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 50

50

pRACTICAL OnTOLOgIES

places particular value on the individual documents, within the archival community it is recognized that there is valuable additional information from the contextualization of resources with other resources from and about the originating source; the EAC-CPF (Encoded Archival Context – Corporate Bodies, Persons, and Families) specification is specifically designed for capturing information about these originators of archival materials. e importance of contextualization makes the archival data particularly suitable for publication as linked data, and versions of the main metadata standards of the archival community have been published as semantic web ontologies. For example, ReLoad, the repository for linked open and archival data, has published EAC-CPF and OAD (based on EAD) as RDF ontologies (http://labs.regesta.com/progettoReload/ en/oad-ontology/). A guidebook of linked data for the archival community has also been published online: E.L. Morgan (2014) Linked Archival Metadata: a guidebook, http://infomotions. com/sandbox/liam/tmp/guidebook.pdf.

Museums and galleries ere have also been some notable excursions into the development of ontologies and the publication of data from within the museums and galleries sector, including data from the British Museum (http://collection.britishmuseum.org/) and the Smithsonian (http://americanart.si.edu/collections/search/lod/about/) (Szekely et al., 2013) – although questions have been raised about the availability of suitable ontology element sets for publishing much of the data that is available (Nisheva-Pavlova, Spyratos and Stanchev, 2014).

Other organizations and the semantic web Further afield the potential from publishing data (and linked data in particular) has been recognized by a wide range of organizations, especially publicly funded institutions that have been under increased pressure to make data available. Many governments have started publishing data, oen in a linked data format (e.g., UK Government, http://data.gov.uk), both in response to calls for greater transparency and as a way of realizing the economic value of the data. Research data from publicly funded research is increasingly expected to be made available, so that the maximum value may be realized, whilst publishers are also interested in the opportunity for the verifiability of the results. Although commercial organizations may be less likely to publish data sets online, for fear of giving away something of value to them, data sets may be released with very distinct purposes in mind, whilst some will have developed enterprise ontologies. e

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 51

OnTOLOgIES AnD ThE SEMAnTIC WEB

51

releasing of data sets by commercial organizations is oen associated with crowdsourcing solutions for particular problems, tapping into the diverse range of perspectives that are available from outside an organization. In comparison, enterprise ontologies are for internal use, forming an internal knowledge base and a shared vocabulary across multiple offices and departments. Different departments inevitably use different terms and vocabularies from one another, which will have ramifications for the finding of resources. Internal enterprise ontologies are not the primary focus of this book, but it is important to remember that whilst ontologies are primarily being discussed throughout this book in terms of external applications and the semantic web, they also have internal applications.

Conclusion e semantic web is a key factor in the rise in interest in ontologies, and the ontologies that are publicly available and widely used as part of the semantic web are the primary focus of much of this book. It is, of course, not necessary that ontologies designed for the semantic web have to be used publicly on the semantic web (unless of course this is a condition of the licensing agreement), nor is it necessary that they adhere to the Resource Description Framework. But much of the potential of ontologies is based around bringing together distributed resources, and the web provides the largest potential number of contributors to such a system. Many of the ideas and concepts touched upon in this section are returned to later in the book: the RDF, RDFS, and OWL vocabularies are discussed in more detail in Chapter 3; SPARQL is returned to in Chapter 6; and the role of the library and information professional with the semantic web and ontologies in particular is discussed more in Chapter 7.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 52

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 53

CHAPTER 3

Existing ontologies

Introduction ere are a large number of ontologies already available, and more are being developed all the time. is chapter primarily focuses on some of the dominant ontologies being used online. ere are many additional ontologies that are associated with specific research projects but have not been more widely adopted, and ontologies that are highly subject-specific. ese are generally avoided in this chapter, as there is little advantage in being aware of these ontologies unless it is your field. e one exception that has been made is the inclusion of the Bible Ontology, as the transformation of a narrative work to an ontology provides an interesting comparison with the other ontologies in this chapter, all of which, with the exception of DBpedia, are ontology element sets. e list in this chapter is the best-seller list of ontologies, ignoring the long tail of Bradfordian distribution where most ontologies have limited reuse. Aer a brief note on the topic of ontology documentation, the chapter is split into four parts. e first part covers the most important ontologies that are used to construct ontologies themselves: RDF, RDFS, SKOS, and OWL2. e next part considers some of the ontologies that are associated with more traditional library and information roles, representing books and other types of resources. is is followed by a look at the upper ontology Basic Formal Ontology and two of the dominant cultural heritage data models: the Europeana Data Model, and CIDOC-Conceptual Reference Model. Finally, some of the most widely adopted ontologies on the web are considered.

Ontology documentation When making use of existing ontologies, the information professional is likely to be met with one of two problems. Either there is very little documentation, if any, associated with an ontology, or there is a vast amount of documentation available for an ostensibly simple ontology. e lack of sufficient documentation is the bane of much of the open web, whether open data, open source soware, or open ontologies,

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 54

54

pRACTICAL OnTOLOgIES

whereas the second problem can be caused by the need for ontologies to be unambiguous. ose who are publishing ontologies on the web generally wish to see them used, and adopted more widely, and are therefore likely to be responsive to requests for clarification of how to use an ontology. Queries about more widely adopted ontologies, or the semantic web more broadly, may also be answered through question and answer services, such as Stack Overflow (http://stackoverflow.com/questions/tagged/ semantic-web). Where ontology diagrams are necessary to improve clarity or understanding, these broadly make use of the basic building blocks of VOWL (Visual Notation for OWL Ontologies) (http://vowl.visualdataweb.org/v2/). Circles (or more oen ellipses due to the size of the labels) are used to represent classes. Lines represent property relationships, with arrows pointing to the object of a triple. Rectangles are used for data types and literals. e VOWL notation is in fact far more extensive, with recommended colours for distinguishing between different classes and properties. In what is an introductory book on the topic, these have been avoided.

Ontologies for representing ontologies ere is not only a need for widely adopted ontologies within a specific field, there is also a need for ontologies that are capable of expressing ontologies themselves.

RDF and RDFS As has already been discussed, RDF and RDFS form the basis of the semantic web, and form the basis of the principal approach to adding a level of semantic information to the web. Some of the terms have already been introduced: rdf:Description, rdf:type, rdf:resource, and rdf:about. rdf:resource and rdf:about are used to distinguish between subjects and objects. Of the core RDF properties, rdf:type is the one that the information professional is likely to come across most oen. rdf:type is a property that is used to state the class a resource refers to. For example: http://www.davidstuart.co.uk/#me rdf:type foaf:Person

e above triple states that the author of this book (the #me being used to distinguish between the author and the web page) has the class type Person defined in the FOAF vocabulary (which will be returned to below). Unsurprisingly the rdf:type property is one of the most heavily used on the semantic web, and it has short-cuts in both the

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 55

ExISTIng OnTOLOgIES

55

RDF/XML and the Turtle serializations. Within Turtle rdf:type may be replaced by a simple a: http://www.davidstuart.co.uk/#me a foaf:Person

Whereas in RDF/XML rdf:Description may be replaced by the class of the resource, so the following two examples include the same triple:









RDF also provides the terms for lists and triple statements. Triple statements are particularly useful when it is necessary to refer to a triple statement as a whole, rather than as individual parts: :davidstatement

rdf:subject

:david

:davidstatement

dc:creator

:david

:davidstatement

:davidstatement

rdf:predicate rdf:object

:hates

:Apple

e above set of triples not only include the information that :david :hates :Apple, but also the fact that this statement was created by :david. Of course this information is only as reliable as the site stating it, but nonetheless it enables the statement to be made. Increasingly, much of this reification may be achieved through the use of N-Quad stores or named graphs, where a graph is given a URI. Whereas RDF terms provide the vocabulary for expressing triples, RDFS provides the terms for expressing the classes and relationships between the properties. With the exception of the RDFS properties, the other properties in the example are from a single fictitious vocabulary, and there is no advantage in coining a fake prefix, so it is

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 56

56

pRACTICAL OnTOLOgIES

simply omitted. Figure 3.1 shows the design of a simple data model describing people and their addresses. rdfs:Class

rdf:type :person

rdfs:domain

:name

:age

:Address

rdfs:range

:hasAddress

rdfs:domain

:street

rdfs:range

rdf:type

rdfs:Literal

rdf:property

:city

:postcode

Figure 3.1 A simple Person and Place ontology using RDF and RDFS

e data model above has five classes, three from the RDF/RDFS vocabularies (i.e., Class, Property, and Literal), and two new classes: Person and Address. is is a fairly simple data model, and the Person class has been restricted to three types of property: name; age; hasAddress. Address also has three types of property: street; city; and postcode. By convention, classes start with a capital letter, and properties with a lower case letter. e two classes are expressed by stating that the resources have the rdf:type rdfs:Class:

:Person

:Address

rdf:type rdf:type

rdfs:Class rdfs:Class

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 57

ExISTIng OnTOLOgIES

57

Similarly the rdf:type and the rdf:Property class can be used to express the fact that the properties are properties: :name

rdf:type

rdf:Property

:street

rdf:type

rdf:Property

:age

rdf:type

:hasAddress rdf:type :city

:postcode

rdf:type

rdf:type

rdf:Property

rdf:Property

rdf:Property

rdf:Property

e relationship between the properties and the classes with which they can be associated may be expressed through the rdfs:domain: :name

rdfs:domain :Person

:street

rdfs:domain :Address

:age

rdfs:domain :Person

:hasAddress rdfs:domain :Person :city

:postcode

rdfs:domain :Address

rdfs:domain :Address

RDFS not only enables the stating of the domain, but also the types of values of a property that are allowed through rdfs:range: :name

rdfs:range

rdfs:Literal

:street

rdfs:range

rdfs:Literal

:age

rdfs:range

:hasAddress rdfs:range :city

:postcode

rdfs:range

rdfs:range

rdfs:Literal :Address

rdfs:Literal

rdfs:Literal

In this particular data model, most of the properties have the range rdfs:Literal. A literal is a value, such as a string of text or a number, as opposed to a relationship to another resource. :hasAddress differs in that it is used to relate resources of class type :Address. As :hasAddress only has the range :Address, a resource that is the object of a triple with a :hasAddress may be inferred as a resources of the :Address class. Other RDFS terms that are regularly used within ontologies are rdfs:sub PropertyOf and rdfs:subClassOf, and rdfs:label and rdfs:comment. Each of these is fairly self-explanatory: rdfs:subPropertyOf and rdfs:

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 58

58

pRACTICAL OnTOLOgIES

subClassOf enable a hierarchy of properties and classes (see Chapter 2), whilst rdfs:label and rdfs:comment enable resources to be given human-readable

names and descriptions. RDF and RDFS are the foundations of the semantic web, even if these foundations are not always explicitly stated. For most widely used vocabularies, however, the relationships with RDF will be stated; in fact the explicit statement is a key for an ontology being widely adopted.

SKOS SKOS (Simple Knowledge Organization System) (www.w3.org/2004/02/skos/) is of particular interest for the structuring of many of the existing knowledge organization systems that already exist within the cultural heritage sector, such as subject headings, classification schemes, thesauri and other taxonomies. SKOS provides a simple vocabulary to represent the collections of concepts found in more traditional knowledge organization systems and allows for thesaurus-like relationships to be expressed between them (typically skos:broader, skos:narrower, or skos:related). A typical example of SKOS usage is provided below, with a selection of the triples associated with the ‘Libraries’ subject heading (http://id.loc.gov/ authorities/subjects/sh85076502.html) from the Library of Congress Subject Headings: :sh85076502 rdfs:type

skos:concept

:sh85076502 skos:broader

:sh85108685

:sh85076502 skos:prefLabel :sh85076502 skos:broader

:sh85076502 skos:narrower

:sh85076502 skos:narrower :sh85076502 skos:related

:sh85076502 skos:exactMatch

:sh85076502 skos:closeMatch :sh85076502 skos:inScheme

“Libraries”@en :sh85038731

:sh2002001003 :sh85076573

:sh85076491





e first triple simply states that the URI represents a skos:concept, and the second triple states that this concept has the preferred label “Libraries” in English. As the term ‘preferred label’ would seem to suggest, a resource should only have one prefLabel in each language; however, SKOS also allows for alternative labels (skos:altLabel) and hidden labels (skos:hiddenLabel) (which are designed for labels that the

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 59

ExISTIng OnTOLOgIES

59

creator might wish to be indexed but not shown to users, e.g., common spelling errors). skos:broader and skos:narrower express hierarchical relationships with other concepts in the scheme. In this case ‘Documentation’ (:sh85038731) and ‘Public institutions’ (:sh85108685) are considered broader concepts than ‘Libraries’, and ‘Bulletin boards in libraries’ (:sh2002001003) and ‘Special libraries’ (:sh85076573) are considered narrower concepts (along with 37 other concepts in the original record). e example also includes three non-hierarchical relationships. skos:related enables non-hierarchical associative relationships to be expressed between concepts, in this case ‘Libraries’ is associated with the term ‘Librarians’ (:sh85076491). skos:exactMatch and skos:closeMatch have been used to build relationships with similar concepts in other concept schemes. skos: exactMatch is used to link the Library of Congress Subject Heading ‘Libraries’ with the concept ‘Libraries’ in the National Agricultural Library esaurus (), whilst skos:closeMatch has been used to link ‘Libraries’ with the term ‘Bibliothek’ in the German National Library. Finally skos:inScheme is used to state the fact that the concept is part of the Library of Congress Subject Headings scheme. ere are additional terms available in SKOS which are used less oen. Two that are of particular interest, more for what they demonstrate about SKOS than their use, are skos:broaderTransitive and skos:narrowTransitive. eir necessity demonstrates the fundamental difference between the traditional knowledge organization systems of librarians and the potential of ontologies for the discovery of new connections and knowledge through inference. Oen thesauri that have been extended over time do not keep to a strict broader and narrower, and the terms would be better expressed as a related term. Nevertheless parts of SKOS have been incorporated in the publishing of a large number of controlled vocabularies, including: • STW esaurus for Economics (http://zbw.eu/stw/versions/latest/about) • UK Archival esaurus (www.ukat.org.uk/downloads/data.php) • e Getty Vocabularies as Linked Open Data (http://vocab.getty.edu). e popularity of SKOS has meant that are a number of tools available specifically for working with SKOS vocabularies. SKOS Play (http://labs.sparna.fr/skos-play) is a simple web-based tool for printing and visualizing SKOS ontologies. Figure 3.2 shows the Nature.com data categories as a tree visualization. e service has also incorporated in InPho (https://inpho.cogs. indiana.edu/taxonomy), which is a semi-automated dynamic ontology of philosophy. Protégé (http://protege.stanford.edu), a free opensource ontology editor, has also

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 60

60

pRACTICAL OnTOLOgIES

Figure 3.2 Nature.com data categories as SKOS Play tree visualization

had SKOS extensions created for it, for example, SKOSed (https://code.google. com/p/skoseditor/). Protégé is discussed in more detail in Chapter 5, where it is used in the development of an ontology. Although SKOS is a widely adopted vocabulary, its simplicity means it has a number of limitations. For example, it was found that SKOS cannot accommodate all the information held within authority records in the MARC standard that has been widely adopted within libraries. e MADS (Metadata Authority Description Schema), developed by the Library of Congress, addresses some of the limitations of SKOS, but it has not been as widely adopted, and has been criticized for not adopting elements from existing vocabularies (van Hooland and Verborgh, 2014). e SKOS vocabulary has also been extended in the form of SKOS-XL (SKOS eXtension for Labels), a W3C specification (www.w3.org/TR/skos-reference/skos-xl.html) that enables a label to be a reference rather than a string, which means additional information may be stated about that label.

OWL 2 OWL, the Web Ontology Language, became a W3C recommendation in 2004, superseding the earlier DAML+OIL ontology language, and itself being superseded when OWL 2 became a recommendation in 2009. Although the full name of OWL, the Web Ontology Language, might lead the information professional to believe that

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 61

ExISTIng OnTOLOgIES

61

it is the key element set in the development of ontologies, much of the functionality that it includes is greater than is necessary for the development of simple informal ontologies. ere is also little need for someone first starting out with ontologies to understand the differences between the three different profiles of OWL 2 (EL, QL, RL), and the restrictions they place on the use of different terms. Nonetheless, even if the full power of OWL 2 is not used, there are some elements that the information professional will undoubtedly find useful in the creation of more simple ontologies, and subsets of OWL have been recommended elsewhere for supplementing RDFS, for example, RDFS+ (Allemang and Hendler, 2011) and RDFS++ (Franz Inc., 2015). Before going on to discuss some of the OWL properties most likely to be of interest to the information professional, it is worth noting that there are a number of OWL-specific syntaxes. Syntaxes for describing ontologies may be separated into two groups, those that describe the graph which describes the ontology, and those that describe the ontology directly (Vrandečić, 2009), and a number of syntaxes have been developed for expressing OWL ontologies directly (e.g., OWL2 Functional Syntax, OWL2 XML Syntax, and Manchester Syntax), and exploring OWL ontologies online a reader may come across any of these syntaxes. OWL may also be expressed as RDF and therefore in any of the RDF serializations, and that is how they are covered in this section. OWL introduces three properties for expressing equivalence: owl:equivalent Class, owl:equivalentProperty and owl:sameAs. e first two should be selfexplanatory, whilst the last is for stating that two URIs refer to the same resource. e semantic web is designed to link resources from across the web, each of which may have made use of different classes, properties, and identifiers. ere are many properties used on the semantic web where it is useful to be able to state the associated inverse property: hasParent, hasChild; hasAuthor, hasWritten; hasPart, isPartOf. owl:inverse allows these relationships to be expressed. If data is being combined from two genealogical sources, and one uses ‘hasParent’ whilst the other uses ‘hasChild’, it is not necessary to create a whole new set of triples; it is sufficient to add a single triple (hasParent owl:inverse has Child) and infer the rest. owl:InverseFunctionalProperty and owl:FunctionalProperty are particularly useful for joining separate data sets, as they can be used to state that a value cannot be shared by different entities, so if two entities share the same value then they must necessarily be the same entity, even if some of the other associated properties differ. Whereas owl:FunctionalProperty can be used to state that if two subjects associated with a particular predicate are the same, then the two objects must necessarily be the same, owl:InverseFunctionalProperty can be used to state that if two objects associated with a particular predicate are the same then the

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 62

62

pRACTICAL OnTOLOgIES

associated subjects must necessarily be the same. ese are most easily shown by example. A typical owl:FunctionalProperty might be that of having a mother. If we simplify the notion of motherhood, and discount stepmothers, adopted mothers and three-parent babies, we can broadly say that every person has one, and only one, mother. erefore it may be stated that: rel:hasMother

rdf:type

owl:FunctionalProperty

If two different sources state that one person has two different mothers (as represented by names or URIs), then it may be inferred that the two mothers are in fact the same person: :Jesus

:Jesus

:Saint_Mary

rel:hasMother

rel:hasMother owl:SameAs

:Saint_Mary

:Virgin_Mary

:Virgin_Mary

[inferred]

Equally, within different library catalogues it is quite oen the case that authors’ names or book titles will be entered differently, and union catalogues oen show multiple editions when in fact only one edition has been printed: catalogA:123

dc:title

catalogA:123

dc:identifier

catalogB:99

dc:identifier

catalogB:99

dc:title

“Lord of the Rings” “978-0261103252”

“The Lord of the Rings”

“978-0261103252”

e ISBN, however, is an identifying number that may be considered unique to each edition, and stating this with owl:InverseFunctionalProperty allows the fact two editions are the same to be inferred: dc:identifier catalogA:123

rdf:type

owl:sameAs

owl:inverseFunctionalProperty catalogB:99

[inferred]

OWL also enables a distinction to be made between owl:DataTypeProperty and owl:ObjectProperty: owl:DataTypeProperty refers to the subset of properties that link to literals and other XML Schema datatypes; owl:ObjectProperty refers to the subset of properties that link to instances of other classes. For example, if the name property should always be a literal then:name rdf:type owl:Datatype

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 63

ExISTIng OnTOLOgIES

63

Property. If, however, the address property requires a unique identifier for an object of the address class, then :address rdf:type owl:ObjectProperty. OWL has far more extensive functionality than can be expressed here: for example, enabling unions, intersections, and cardinality. A fuller exploration of some of the functionality can be found in Allemang and Hendler’s (2011) Semantic Web for the Working Ontologist: effective modeling in RDFS and OWL, as well as the current OWL standards: www.w3.org/standards/techs/owl#w3c_all.

Ontologies for libraries Dublin Core Of all the ontology element sets discussed in this chapter, Dublin Core probably requires the least introduction to the information community, and has been described as the best known metadata brand on the web (Weibel, 2009). e original Dublin Core Metadata Element Set consists of 15 generic properties to describe online resources; however, as the web and web technologies have evolved so has Dublin Core. Dublin Core Terms currently consists of 55 terms defined as RDF properties (Table 3.1). is was part of a move towards supporting application profiles in 2000. Table 3.1 Dublin Core Terms properties abstract

coverage

hasFormat

isVersionOf

accessRights

created

hasPart

language

requires rights

accrualMethod

creator

hasVersion

license

rightsHolder

accrualPeriodicity

date

identifier

mediator

source

accrualPolicy

dateAccepted

instructionalMethod medium

spatial

alternative

dateCopyrighted

isFormatOf

modified

subject

audience

dateSubmitted

isPartOf

provenance

tableOfContents

available

description

isReferencedBy

publisher

temporal

bibliographicCitation educationLevel

isReplacedBy

references

title

conformsTo

extent

isRequiredBy

relation

type

contributor

format

issued

replaces

valid

Although Dublin Core may be criticized for its simplicity, and for encoding to the lowest common denominator, this feature has led to it being widely adopted amongst other ontologies and application profiles. For example, the SIOC Core Ontology for representing data from the social web identifies six Dublin Core properties suitable for reuse with SIOC (dct:subject, dct:title, dct:created, dct:has Part, dct:isPartOf, dct:modified), and OAI-ORE (Open Archives Initiative – Object Reuse and Exchange) incorporates five elements from Dublin Core Elements

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 64

64

pRACTICAL OnTOLOgIES

(dc:description, dc:format, dc:language, dc:rights, dc:title) and another 11 from Dublin Core Terms. Other ontologies have defined elements with reference to Dublin Core Terms. For example, the FOAF ontology (discussed in more detail below) defines the foaf:Agent class as an owl:equivalentClass to dc:Agent and the foaf:maker property as owl:equivalentProperty to dc:creator, whilst the LODE (Linking Open Descriptions of Events) ontology (http://linkedevents.org) defines its Event class as a rdfs:subClassOf dc:Event. In other cases Dublin Core is used to include metadata about the ontology itself and it is the default metadata standard for many publishing platforms. For more professional bibliographic records, however, a specialized ontology may be required. ere are a range of vocabularies available for encoding bibliographic information. Some have been created specifically for the semantic web, whilst other projects have focused on transforming existing standards to RDF.

Bibliographic Ontology e Bibliographic Ontology (http://bibliontology.com) was specifically designed for describing citations and bibliographic resources on the semantic web, based on a number of existing metadata standards. In addition to many classes and properties from existing element sets (e.g., dc:Agent, foaf:based_near), it also defines a number of new elements. For example, defining more specific types of documents (e.g., manual, manuscript, patent, thesis) and relationships between documents and agents (e.g., director, editor, interviewee). Like much of the semantic web, the Bibliographic Ontology can be seen as trying to strike a balance between the practical and the ideal. It is designed to be extensive enough to be of practical use, but not necessarily so exhaustive in its coverage of every potential outcome that it is too complex to be easily adopted. Its practical approach may be evidenced by its influence on Schema.org (which is discussed in more detail below), whilst its lack of exhaustiveness may be seen in comparison with CiTO (Citation Typing Ontology) (http://vocab.ox.ac.uk/cito). Although the Bibliographic Ontology can be used to represent the citations between resources, it doesn’t incorporate properties for distinguishing between different types of citation; a resource either cites another resource (bibo:cites), is cited by another resource (bibo:citedBy), or a citation relationship doesn’t exist. In comparison, CiTO incorporates properties to distinguish between different types of citation, for example, cito:citesAsAuthority, cito:citesAsDataSource, cito:disagreesWith. In most situations the distinction between citing and not citing would be sufficient, where a more subtle encoding is necessary; for example, in a bibliometric study of citation practices, a more extensive element set should be used.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 65

ExISTIng OnTOLOgIES

65

FRBR, FaBiO, and FRBRoo An important development in the representation of bibliographic objects has been the development of the FRBR (Functional Requirements for Bibliographic Records) model, representing bibliographic objects and their relationship with the associated responsible entity. An FRBR-ized library catalogue would not simply describe the bibliographic object that the library held, but its relationship with other things in the ‘bibliographic universe’ (Welsh and Batley, 2012, 8). FRBR provides a practical model for understanding the relationships between the different types of object that might be related in different types of bibliographic database. It distinguishes between three groups of entities, those representing the intellectual content, agents associated with the content and the subject of the content. Figure 3.3 shows an entity-relationship diagram of the representation of intellectual content, where distinctions are made between a work, an expression, a manifestation, and an item. A work is a distinct intellectual creation, for example, Charles Work Dickens’ Oliver Twist. is work may be is realized through realized through different expressions, for Expression example, the original English-language is embodied in version, a translation or an abridgement. ese different expressions are then Manifestation embodied within physical manifestations, is exemplified by for example, a particular edition published Item in a certain year or an audio recording of the expression. Finally, item refers to a Figure 3.3 single exemplar of the manifestation, for FRBR entities and relationships representing example, a specific copy sitting on a the intellectual content specific shelf in a particular library. A publisher, a printer, a bookseller, and a library may all refer to and need to represent related bibliographic objects, but nonetheless need to represent them differently. e author and the publisher will primarily engage on the works and expressions, publishers, printers and booksellers are interested in the manifestations, whilst a library will need to represent individual items within its bibliographic catalogue. Importantly, FRBR is separate from any specific vocabulary for representing the bibliographic objects, so different vocabularies may be used for different user-communities or to enable different levels of bibliographic detail, whilst still adhering to the FRBR model. FaBiO (http://purl.org/spar/fabio) provides a vocabulary for representing bibliographic records on the semantic web according to the FRBR principles, and FRBRoo (www.cidoc-crm.org/frbr_inro.html) expresses the FRBR relationships as part of the CIDOC-CRM model.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 66

66

pRACTICAL OnTOLOgIES

:Work :realization :realizationOf

:hasManifestation :isManifestationOf :Expression

:embodiment :embodimentOf :hasRepresentation :isRepresentedBy

:hasPortrayal :isPortrayedBy :Manifestation

:exemplar :exemplarOf :Item

Figure 3.4 Structuring intellectual content in FaBiO

Figure 3.4 shows the relationships and classes for intellectual content in FaBiO. Each of the classes shown in the diagram also has multiple sub-classes that have been defined, and associated properties. For example, fabio:Expression has sub-classes defining both widely recognized types of expression (e.g., article, book, personal communication) as well as less obvious types of expression (e.g., dust jacket, nanopublication), with many of the sub-classes having further sub-classes. e big difference from the original FRBR model, however, is the addition of new relationships between item and expression, work and manifestation, and work and item, creating greater flexibility in the representation of data. FRBRoo integrates the FRBR data model with the more extensive CIDOCConceptual Reference Model (CIDOC-CRM), a more extensive (and extensible) ontology for sharing information in the cultural heritage sector, which is returned to below. e International Working Group on FRBR and CIDOC-CRM Harmonisation (2015) identify a number of differences between FRBRoo and the original FRBR model; most noticeable in terms of structuring of the intellectual content are the inclusion of temporal aspects and a greater refinement of the four classes associated, for example distinguishing between different classes in the context of the author and the publisher. Another element set that adheres to the FRBR data model are those of Resource Description and Access (RDA), the replacement for the second edition of the AngloAmerican Cataloguing Rules (AACR2).

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 67

ExISTIng OnTOLOgIES

67

RDA RDA updates AACR2 for the 21st century, providing a more international and more refined set of data elements for an increasingly inclusive digital world. Although its primary application is for use within the library community, it was also designed with other communities in mind and may be used as an element set in other domains. Representations of the RDA elements are available at the RDA registry: www.rdaregistry.info. ese are associated with eight classes. Four of them align with the FRBR model (work, expression, manifestation, and item). e other four classes are for representing an associated agent, with an Agent class and three subclasses (Person, Family and Corporate Body). As would be expected from a set of elements designed specifically for the cataloguing of bibliographic objects for the library community, the list of elements is quite extensive. As of February 2016, there were 296 associated with Works, 268 associated with Expression, 254 associated with Manifestation, 60 Item properties and 269 Agent properties. Although the list of associated properties is more extensive than many organizations or individuals would need to use regularly for expressing bibliographic information, many of the elements have use beyond bibliographic material and the FRBR model. is is not only true of many of the Agent properties, but also certain properties associated with Works, Expressions, Manifestations, and Items. is is recognized within RDA, and unconstrained URIs are provided which are independent of the FRBR model and have no specified domain or range. For example, whereas the property for the preferred title for a resource of the Work class can be represented by the URI http://rdaregistry.info/Elements/w/preferredTitleForeWork.en, there is also an unconstrained version of the property with a distinct URI (http://rdaregistry. info/Elements/u/preferredTitleForeResource.en) designed for resources that are not of the class Work. at the RDA elements have not been as widely adopted as those from some other element sets may be due to its relative youth. e library community is still in the process of transitioning from AACR2 to RDA, and as it becomes increasingly well known it seems likely that the elements will become more widely adopted. RDA elements are already adopted in the Europeana Data Model (see below).

Bible Ontology It is important to remember that ontologies are not restricted to describing resources; as well as the creation of ontologies for describing resources, it is also possible to describe the content of works. One such example is the Bible Ontology, included here not for its widespread adoption but as a representation of a transformed narrative work into an ontology. e Bible Ontology (http://bibleontology.com) has encoded

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 68

68

pRACTICAL OnTOLOgIES

some of the contents, people, and events in the Bible. is is designed to enable a new way to engage with both the Bible, and resources that are linked to and from the Bible Ontology. For example, it is possible to retrieve a list of people associated with particular events, retrieve information about social networks, or people born in particular areas. At the moment the ontology is quite limited, both in terms of the proportion of the bible that has been encoded and the extent to which it has been encoded. Nonetheless it demonstrates the potential for a far more extensive ontology, potentially incorporating FRBR-type aspects, not only distinguishing between works, expression and manifestations of different versions of the Bible, but also the interpretations that have been added to the texts by theologians and biblical scholars. e Bible is not the only narrative that has been transformed in such a manner; the fact that the Quran is out of copyright, and is the subject of a lot scholarship, has also seen it transformed (Rizwan, Aida and Zulkifli, 2013).

Upper ontologies An upper ontology provides a non-domain specific framework within which to consider entities and relationships. Constraining the structure of ontologies within different domains facilitates cross-domain ontology integration and knowledge sharing; however, such constraints may be considered the antithesis of the linked data approach to ontology design, which privileges the freedom of individual perspective and a low entry level to ontology creation. Nevertheless one upper ontology is considered here, Basic Formal Ontology (BFO) (https://github.com/bfo-ontology/ BFO). is is a small upper ontology that has been adopted in a number of scientific domains, and is used here to illustrate how an upper ontology can be extended to include a domain ontology. In this case the Information Artifact Ontology (IAO) (https://github.com/information-artifact-ontology/IAO/), which has been designed to distinguish between different types of information entity (Ceusters, 2011), is considered.

Basic Formal Ontology BFO broadly distinguishes between two types of entity: continuants and occurrents. Continuants, as the name suggests, are entities that continue to exist through time such as places and people, whereas occurrents are entities that happen, such as events or processes. Below these two entities exist two hierarchies of subtypes, a hierarchy of 24 subtypes below bfo:continuant and a hierarchy of eight subclasses below bfo:occurrent. Directly below bfo:continuant in the hierarchy are three

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 69

ExISTIng OnTOLOgIES

69

subtypes: ‘independent continuant’; ‘specifically dependent continuant’; and ‘generically dependent’. e definitions for these three subtypes provided below have been taken from the IAO OWL file: • Independent continuant: ‘A continuant that is a bearer of quality and realizable entity entities, in which other entities inhere and which itself cannot inhere in anything.’ • Specifically dependent continuant: ‘A continuant that inheres in or is borne by other entities. Every instance of A requires some specific instance of B which must always be the same.’ • Generically dependent continuant: ‘A continuant that is dependent on one or other independent continuant bearers. For every instance of A requires some instance of (an independent continuant type) B but which instance of B serves can change from time to time.’ Each of the three types of continuant have been extended within IAO, and illustrates more clearly the difference between the types of constituent and the nature of the extensions. ‘Generically dependent continuant’ doesn’t have any further subtypes within BFO, although within IAO it has been extended with the subtype ‘information content entity’, which is then further divided into the types of information that may exist. For example, an information content entity may be a patent, a footnote, an abstract or a single piece of data. Information content entities are generically dependent continuants because they are not realizable on their own: they must exist on a particular medium, although the nature of this holder of information can vary considerably, in the same way as an expression of a work may be embodied in multiple different manifestations within the FRBR model. ‘Independent continuant’ has the subtypes ‘immaterial entity’ and ‘material entity’ in BFO. As the names would suggest, a material entity is an independent continuant that comprises of physical material, for example a person, whereas an immaterial entity is an independent continuant that has no material parts, for example the boundary between two countries. IAO extends BFO with the concept of ‘material information bearer’ which concretize information content entity, one type of which is a photographic print. A photographic print is an independent continuant because it is a realizable entity, not dependent on other entities. e photographic print inheres the generically dependent continuant of a photograph. Whereas a generically dependent continuant may be inhered in different entities – for example the same photograph may appear as a photographic print or as a digital image – a ‘specifically dependent continuant’ depends on specific entities, for example

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 70

70

pRACTICAL OnTOLOgIES

the height of a particular person. BFO has two subtypes of specifically dependent continuant: quality and realizable entity. A quality, such as the height of a person, is always realized within the person, whereas a realizable entity is only realized at particular times or processes: for example, the role of being a doctor is only realized when a person is treating a patient, not all the time. IAO has extended BFO through the addition of the role ‘author role’, inhered in a person or organization who works on the creation of a specific document. e problem with a universal upper ontology is that there is a high entry point for making use of the ontology. e requirement that an upper ontology is universally applicable means that it forces people to think within the confines of a formal structure that does not reflect the way most people usually think. More intuitive are the data models that are designed for a broad range of domains rather than all domains.

Cultural heritage data models Whereas ontologies are oen designed to deal with one particular domain, more extensive models have been developed that are designed to encode resources in a wide range of domains. CIDOC-CRM and the Europeana Data Model are two such models, both of which are designed to be extensible, although they differ considerably in their complexity.

Europeana Data Model e Europeana Data Model (EDM) (http://pro.europeana.eu/page/edm-documentation) provides generalized metadata properties for resources and their contextual classes, so that data from a wide range of cultural heritage institutions (including libraries, archives and museums) can be accommodated within a single model. e model accommodates aggregations of cultural heritage objects, and their digital representations, as well as five main contextual classes: edm:Agent, edm:Place, edm:TimeSpan, skos:Concept, edm:Event. It is designed to be extensible through the use of subClassOf and subPropertyOf properties from the RDFS vocabulary to provide greater levels of granularity. For example, the Agent class may have the subclasses of people, organization, and group, whilst the isRelatedTo property which is used to express relationships between agents may have subproperties such as knows, parentOf, or subOrganizationOf. EDM has formed the basis of a number of additional extensions, including Digitised Manuscripts to Europeana (http://dm2e.eu) and the CENDARI EDM Extension (www.cendari.eu/about-us/project-deliverables). EDM is designed to be a simple data model suitable for data from a wide range of

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 71

ExISTIng OnTOLOgIES

71

cultural heritage institutions, and therefore a more specialized or extensive model may be appropriate where the primary purpose is the sharing of data within a more homogenous community.

CIDOC-Conceptual Reference Model CIDOC-Conceptual Reference Model (CIDOC-CRM)(www.cidoc-crm.org) provides a more extensive data model than EDM that all types of cultural heritage institutions can map their information to, and is the culmination of over a decade of work. e additional level of granularity contained within CIDOC-CRM may be shown through a comparison of a single EDM property, edm:isRelatedTo, and some of the CIDOC-CRM properties that it might be considered to subsume. edm:isRelatedTo provides a very broad relationship for linking resource with an associated concept or resource. Amongst the many potential sub-properties of edm:isRelatedTo identified in a harmonization with CIDOC-CRM (Doerr, 2011) were: was made for; was intended for; was influenced by; used specific object; was motivated by; continued; was based on; used constituent. ere are also many other properties for which there is no equivalent within EDM. CIDOC-CRM undoubtedly allows for greater expressiveness than EDM, but that does not mean that it is necessarily most suitable for the semantic web, or will become more widely adopted. Despite the relative complexity of CIDOC-CRM, many of the properties have been included within application profiles and for the publishing of data sets. For example, it is used by the British Museum Semantic Web Collection Online (http://collection. britishmuseum.org) and by CLAROS for sharing scholarly data sets on classical art (www.clarosnet.org),

Ontologies for the web As well as interest in the cultural heritage sector for the sharing of high-quality data, an increasingly semantic web is not limited to traditional knowledge organizations. As has been seen with the adoption of social media and web 2.0 technology, there is a lot of value in the information that can be captured by a dispersed community, and some of the most popular ontologies have been designed for encoding and sharing data on the web.

DBpedia ontology Wikipedia needs little introduction. Although news headlines can still be made by Wikipedia’s occasional libellous or deliberate error, it nonetheless provides an

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 72

72

pRACTICAL OnTOLOgIES

encyclopedia of unprecedented size and scope, and is oen a person’s first port of call when searching for introductory materials on virtually any topic. e breadth of its content, and the fact that some of it has structure, has also made it of interest to the semantic web community. DBpedia has been a collaborative effort to extract the structured data from Wikipedia, and make it available on the web as RDF (Auer et al., 2007). e breadth of the content on Wikipedia has meant that DBpedia is the central node in the Linking Open Data cloud diagram (http://lod-cloud.net/), a diagram that has been published regularly since 2007 to show those significant data sets that have been published in a linked data format and linked to another data set. During the process of publishing the structured data DBpedia has built an extensive ontology element set, these take two forms: the official ontology and the Wikipedia structure. Although DBpedia is primarily of use for the information that it contains, rather than the ontology element set, the ontology element set can nevertheless provide a wide range of terms that will already have been used in at least one other instance for the creator of a new ontology. It may be simply browsed online (http://mappings. dbpedia.org/server/ontology/classes), although, as may be expected from a crowdsourced encyclopedia, there are significant differences in how oen the properties have been applied. A simple way to view how the properties are actually used in practice is to view individual instance records, and the DBpedia knowledge base can be searched through a simple interface (http://rubenverborgh.github. io/dbpedia-lookup-page). Of course, as all the information in DBpedia is in a machine-readable format, it is possible to automatically query the frequency with which properties are associated with a particular class of resource, and this is covered in the section on SPARQL in Chapter 6.

FOAF e FOAF (Friend of a Friend) ontology is one of the most established semantic web ontologies. Initially designed for describing people and organizations on the web and linking them together, many of the terms have now been adopted in many other ontologies. A typical FOAF page for a person may include their name, their interests, projects they are working, or have worked, on, and relationships to people they know. e example below provides an RDF/XML example of a FOAF page, created with the FOAF generator FOAF-a-Matic: www.ldodds.com/foaf/foaf-a-matic.html:







David Stuart Dr

David

Stuart

3643f4a5ef82fdeb1797ed5dbe20426ce56552f4





John Smith

e2ac3658f1846accac478b74162411b15bdeea98



Joe Bloggs

d5dffaead6701ed41f0977d811fec1da5eabb708

e use of foaf:mbox_sha1sum allows e-mail addresses to be used as unique IDs without sharing the e-mail address. Establishing widely used ontologies that provide structure for social networkingtype functionality offer the potential to challenge the excessive power of dominant social network sites such as Facebook and Twitter, allowing for greater freedom of expression and control over one’s own data. As Tim Berners-Lee (2007) put it, ‘I express my network in a FOAF file, and that is the start of the revolution’, although the revolution seems to be fairly slow-burning so far, and other social media ontologies such as SIOC (Semantically-Interlinked Online Communities) have not been widely adopted. e huge potential of social networking ontologies, albeit more in theory than in practice, and the longevity of FOAF in particular, means that the few quantitative studies that have focused on the semantic web have looked at FOAF specifically. Ding

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 74

74

pRACTICAL OnTOLOgIES

et al. (2005) investigated over 1.5 million FOAF documents, the vast majority of which came from two large blog sites, LiveJournal (www.livejournal.com) and Dead-Journal (www.deadjournal.com), while Golbeck and Rothstein (2008) looked solely at the FOAF pages created by social network sites. Stuart (2012) focused on the use of FOAF by the academic community within the UK. Although there has been sporadic use of the FOAF ontology for FOAF pages, its principal use has been in the adoption of parts of its vocabulary, such as foaf:name, foaf:title, and foaf:knows within other ontologies. Such reuse is discussed in more detail in Chapter 4. As well as social network terminology, FOAF also contains a number of demonstrator and technical terms designed to support data linking efforts. Of particular interest is foaf:focus, a term that is designed to build relationships between concepts and things. For example, the British Library Data Model (Book) (www.bl.uk/bibliographic/ pdfs/bldatamodelbook.pdf) uses foaf:focus to relate Person-as-Concept with Personas-Agent, Organization-as-Concept with Organization-as-Agent, and Place-as-Concept with Place-as-ing.

Schema.org Schema.org is probably the most important set of element sets on the web today; not only is it designed specifically for the web, but it is designed specifically for use by search engines, the principal starting point for most people searching for information. Schema.org is an initiative to create a common vocabulary for marking-up web data in microdata, although it has also been incorporated into other ontologies. It was launched in June 2011 by Google, Yahoo and Bing, (Google, 2011), with Yandex, the Russian search engine, joining in November 2011 (Schema.org, 2011). It is designed to create vocabularies for the sorts of information that may be found on millions of sites across the web, especially where there is value from having the information structured and aggregated. For example, being able to distinguish a company’s name and contact details from the surrounding text could allow for the automatic development of business directories. Schema.org is not the first initiative to attempt to provide such vocabularies, or even the first attempt by some of the major search engines. One of the longest-established initiatives is microformats (http:// microformats.org); established in 2005, it made use of the available functionality in HTML 4 to encode a limited set of popular vocabularies. As is noted on the microformats site, their simple approach, and designing for humans first and machines second, has seen them outlast some previous initiatives from Yahoo (e.g., CommonTag.org) and Google (e.g., Google Base and Google Data) (Microformats, 2014). Schema.org is, however, already more successful than previous initiatives by the search engines. is is not only attributable to its being a combined initiative, but

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 75

ExISTIng OnTOLOgIES

75

also to developments in web standards that facilitate more extensive vocabularies (e.g., microdata and RDFa), and the increased recognition of the potential of marking up data by content creators (see Chapter 2 for a more detailed discussion of these technologies). Some of the classes of thing for which schema.org has defined properties include: creative works, events, person, place, and product. Many of the subjects of these classes have already been the subject of ontology work elsewhere: A schema:Person is not dissimilar to a foaf:Person, a schema:Book is not dissimilar to a dc: BibliographicResource, and a schema:Product is not dissimilar to a goodrelations:ProductOrService (www.heppnetz.de/ontologies/goodrelations/ v1.html#ProductOrService). However, different organizations, with different aims and objectives for their ontology, may have different requirements, and require classes of objects to be modelled differently. Requirements can also change over time. For example, it seems likely that if the development of FOAF had started in 2010, rather than 2000, there would be more social media properties and fewer instant messenger properties, although if it had started in 2015 the rise in mobile phone messaging apps would mean properties would have to be developed for both. Table 3.2 provides a comparison of the properties for schema:Person and foaf:Person. Both classes inherit properties from related super-classes. schema: Person inherits properties from schema:Thing whilst foaf:Person inherits properties from the foaf:Agent super-class, and owl:Thing. Table 3.2 lists the schema:Person properties (including those inherited from schema:Thing) and an equivalent (or near equivalent) from foaf:Thing, foaf:Agent or foaf:person where one exists. Whereas foaf:Person is designed for describing a person’s online presence, with multiple properties for describing online accounts, schema.org has a greater focus on the real world relationships and provision of services. No single organization can hope to successfully model all types of things in a manner suitable for everyone, not even Google. It is therefore important that ontology engineers learn from existing ontologies, and gather as wide an input as possible. Schema.org does both of these things. Part of the reason for the number of similarities between schema:Product and goodrelations:Product is that schema.org has made use of the GoodRelations vocabulary, as the acknowledgement on the schema:Product page states: is class contains derivatives of properties from the GoodRelations Vocabulary for ECommerce, created by Martin Hepp. http://schema.org/Product

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 76

76

pRACTICAL OnTOLOgIES

Table 3.2 Comparison of schema:Person with foaf:Person schema:Thing

foaf

additionalType alternateName

nick

description image

img

schema:Person

foaf

schema:Person foaf

additionalName

honorificPrefix

address

honorificSuffix

affiliation

Workplace Homepage

isicV4

alumniOf

school Homepage

jobTitle

title

mainEntityOfPage homepage award

knows

knows

name

makesOffer

made

name

birthDate

potentialAction

birthPlace

memberOf

sameAs

brand

naics

url

children

knows

nationality

colleague

knows

netWorth

contactPoint

owns

deathDate

parent

deathPlace

performerIn

duns

relatedTo

knows knows

e-mail

mbox

familyName

familyName sibling

knows

faxNumber

spouse

knows

follows

taxID

gender givenName

seeks

telephone

phone

givenName vatID

globalLocationNumber

weight

hasPOS

workLocation

height

worksFor

workplace Homepage

homeLocation

Some of the subclasses and properties of the schema:CreativeWork class have also benefited from existing ontologies, e.g., schema:Periodical (http://schema.org/ Periodical) was ‘inspired in places’ by the Bibliographic Ontology mentioned above. Improving the schema.org for bibliographic information is the focus of the Schema Bib Extend Community Group (www.w3.org/community/schemabibex), which includes many librarians as well as other information professionals. Even if you are not part of a formal group with a mission to improve schema.org, it is nonetheless possible to extend schema.org for personal use. For example, someone marking up details of people on a website may wish to distinguish those people who are information professionals from those who are computer scientists, and specialized

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 77

ExISTIng OnTOLOgIES

77

classes and properties can in turn become more specialized, for example, to distinguish between distinct types of information professional: librarian, archivist, ontologist. e search engines won’t be able to understand these finer levels of semantics, they won’t be able to distinguish between a librarian, an archivist and an ontologist, but they will nonetheless identify they are all types of people. If the extensions become widely adopted, then they may then be included within the core of schema.org. Schema.org is primarily designed for the web and encoding information within web pages, and whilst the element sets may be used for linked data, it is not its primary purpose. For example, the URIs for particular classes or properties do not provide dereferenceable RDF, nor is there a way of distinguishing between the page associated with a property and the page with information about the property. Nonetheless, for all its faults, the fact that it is supported by some of the world’s major search engines means that it is important to be aware of. ere are also ambitions for it to have a more central role as a hub for more specialized vocabularies. Peter Mika, a Director of Research at Yahoo Labs, sees the main challenge for schema.org being the ability to scale for the increasing number of extensions and alignments (Mika, 2015). Whether it would be good for a rich diverse semantic web to have a single dominant provider of ontologies is a perennial question. Open issues and discussions surrounding schema.org are available on the associated GitHub page: https://github.com/schemaorg/schemaorg/issues.

Facebook Open Graph Protocol In 2007 Tim Berners-Lee blogged ‘I express my network in a FOAF file, and that is a start of the revolution’ (Berners-Lee, 2007), but as has already been mentioned, the revolution has been quite slow in coming. Despite open element sets and technologies having the promise of data portability, and stopping users becoming locked into proprietary systems, people nonetheless continue to flock to the dominant social network sites, with Facebook claiming 1.49 billion active users each month (Facebook, 2015). ere is, however, some crossover between the Facebook and the web of data in the form of the Open Graph Protocol (http://ogp.me) (OGP). As Mika points out, whilst there are many more extensive vocabularies available for specialized domains, there are ‘only two major sets of schemas in usage – schema.org and Facebook’s much smaller Open Graph Protocol’ (Mika, 2015, 53–4). e OGP is designed to enable the representation of web content within the social graph through the addition of metadata for popular types of content, e.g., video, audio and images, as well as web pages themselves. e importance of Facebook on the web means that OGP has become one of the most successful vocabularies online. In an analysis of 3 billion HTML web pages, from over 40 million web sites, Bizer et al.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 78

78

pRACTICAL OnTOLOgIES

(2013) found that the OGP accounted for six of the most frequently used classes, and indeed it can be found to be used on many of the most popular content providers (e.g., www.bbc.co.uk, www.theguardian.com). Browser extensions are available to facilitate web users viewing this data, e.g., the META SEO Inspector for the Chrome browser includes OGP data. Many of these concepts could have been accommodated by existing vocabularies. For example, the four required properties for every page are og:title, og:type, og:image, and og:url, and the widely adopted Dublin Core could have provided near-equivalents with the elements dct:title, dct:type, and dct: identifier.Facebook user tests, however, found that content managers were unhappy with having to remember names from multiple sources (Allemang and Hendler, 2011) and as Allemang and Hendler go on to point out, the semantic web is designed to support multiple solutions.

Conclusion ere really are too many publicly available ontologies to mention in detail, with more emerging all the time as projects are funded that incorporate the development of increasingly niche ontologies for specific domains or from novel perspectives. Each of these ontologies can be seen to compete for attention, at least on the semantic web, where data is generally published with the aim of being reused, and reuse is highly contingent on the accessibility of the ontology (unless you happen to have a dataset the size of Google or Twitter). Most ontologies are, inevitably, destined to die before they are widely adopted, and even those that are widely adopted today will not necessarily be widely adopted tomorrow. Nevertheless, this chapter has introduced some of the most important ontologies for the semantic web, for librarians and for use on the web. What quickly becomes clear is the interrelated nature of many of these ontologies and application profiles, oen reusing elements and being designed to interact with one another. e next chapter discusses how to find other ontologies, and some of the criteria that should be considered when deciding which ontologies to reuse in whole or in part.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 79

CHAPTER 4

Adopting ontologies

Introduction With a large number of ontologies already available, there will be many occasions when an ontology will already exist that fulfils some or all of an information professional’s ontology requirements. e reuse of existing ontologies can not only help with the integration of data across different systems, but may, importantly, also reduce the repetition of work. is chapter starts by considering the importance of reusing existing ontologies in more detail, before going on to explore some of the tools that are available for identifying existing ontologies, and some of the issues that may contribute to deciding whether or not to incorporate a particular ontology.

Reusing ontologies: application profiles and data models e reuse of existing knowledge organization systems and processes is well established within the library community. Most libraries find it far more appropriate to adopt or extend an existing classification or cataloguing system than create a bespoke in-house system, with the advantages of a custom-made system far outweighed by the costs of maintaining the system and making use of other institutions’ data. e principal reasons for reusing an existing ontology are the same as those for reusing other existing knowledge organization systems: cost-effectiveness and facilitating the interoperability of data. Although cost-effectiveness may be considered particularly pertinent to adoption of an ontology with a large number of instances, and interoperability particularly applicable to the adoption of existing element sets, both reasons apply (albeit not equally) to both parts of an ontology: the development of a logically consistent element set with clearly defined classes and properties that can accommodate all expected types of instance is not necessarily a simple process, whilst the adoption of the same instances across multiple platforms may facilitate information retrieval from multiple sites and sources. For the sake of simplicity, however, here we primarily consider cost-effectiveness in terms of ontologies with a

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 80

80

pRACTICAL OnTOLOgIES

large number of instances, and the advantage of adopting existing element sets for data interoperability. e number of instances that may be created on average per hour will vary considerably in ontology development (as Hedden, 2010, notes for taxonomical terms), but the creation of a high-quality ontology can nonetheless be expected to be extremely labour-intensive, especially when talking about an extremely large ontology or broad subject area. For example, Getty’s Art & Architecture esaurus (AAT) (www.getty.edu/research/tools/vocabularies/aat) contains 353,285 different terms, and a similarly sized ontology would not only take many person years to develop, but even then may not be as high-quality as AAT, since AAT has been in development since the 1970s and tested against the rigours of multiple users in multiple different settings. e amount of work necessary in the development of a single instance in an ontology will depend heavily on the subject, purpose, extensiveness and necessary quality of an ontology. e consequences of missing or incorrect information in a pharmaceutical knowledge base are likely to differ significantly from similar deficiencies in an ontology of World War 1 events; in one case, if the information is not readily available multiple sources may need to be consulted until it is found, in the other, the ontologist may quickly move on to the next term. e creation of a consistent element set may be equally time-consuming, especially where a large number of classes or properties need to be incorporated. For example, Gene Ontology (http://purl.bioontology.org/ontology/GO) consists of 43,531 classes. In a fast-moving field there are obvious difficulties in trying to keep up with precise definitions for such a large number of concepts. Although Arp, Smith and Spear (2015) emphasize that an ontology should include fixed knowledge, there is inevitably greater understanding of established terms than of new terms. e real importance of reusing an existing ontology set, however, especially with less formal ontologies, is to facilitate data interoperability. Adopt a widely used element set and the encoded information will be more widely understood by external services (if the information is made public), and is more likely to be portable to alternative soware. For example, the semantics of information encoded in Dublin Core will be understood by a wide range of programs and automatic agents across the web, and additional services may be built on top of this data; whilst the bespoke property :writer may be used to express the same information as dc:creator in a bibliographic dataset, and the term may confer a level of understanding to human readers, the property would have little meaning to most programs and automatic agents. Element sets do not have to be adopted in their entirety, or on their own; rather, they may be treated as a buffet from which individual classes and properties are selected or referred to. An application profile is a schema that combines terms from one or more element sets for a particular application (Heery and Patel, 2000). It is a practical approach to

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 81

ADOpTIng OnTOLOgIES

81

implementing ontologies: making use of what is available rather than fixating on fulfilling all the criteria of a schema. Aer all, in most instances an ontology that exactly matches all of an implementer’s requirements will not be available. Even if there is an ontology designed for a particular topic, it will not necessarily include the same properties or be suitable at a particular level of granularity. For example, both a science museum and a scientific instrument maker may wish to represent a set of scientific instruments. For the scientific instrument maker it is likely to be particularly important that the technical specifications, and associated legal requirements and intellectual property rights, are represented in the ontology. For the science museum, much of this information is likely to be irrelevant, with particular importance being attributed to the provenance of an object and its historical context. How a scientific instrument should be represented will differ again for those who employ the instruments in their labs, or sell them from their warehouses. e N8 Equipment Ontology (http://n8eo-browser.appspot.com) was developed by the University of Manchester for describing scientific equipment shared by the N8 consortium of UK universities in the north of England. is ontology describes scientific equipments according to the type of equipment, function, technique and location. Whilst the function and technique information would be likely to be too detailed for the science museum, the technical specifications are probably not extensive enough for the instrument maker. It also lacks the necessary legal requirements and intellectual property rights required by the instrument maker, and the provenance and historical context required by the science museum. is does not make it a bad ontology, but rather emphasizes how different organizations can have different ontology needs, whilst some parts of an ontology may still be useful. Even in the information profession, where a large number of vocabularies have already been created for the most popular document types, and there has been a long history of co-operation in the development of standards, there may be cases where a particular special collection has not been accommodated or the ontology doesn’t include all the properties a person wishes. e bibliographic landscape is also changing. As new formats and infrastructures emerge it may be necessary for these to be incorporated into an ontology: authors are not only represented by names, but increasingly represented by an ORCID (Open Researcher and Contributor ID); a ‘book’ or a ‘journal article’ no longer necessarily refers to a single primarily textual resource, but may bring together new types of content (e.g., video, audio, datasets) with a narrative text to form an enhanced publication. Some ontologies and data models have been specifically designed with extensions in mind. For example, schema.org enables both reviewed and external extensions (https://schema.org/docs/extension.html) and the Europeana Data Model is designed to be extensible through the use of rdfs:subClassOf and rdfs:subPropertyOf.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 82

82

pRACTICAL OnTOLOgIES

Other elements have been specifically designed for reuse, away from their original setting. Two principle examples of this are Dublin Core Terms and RDA. e extensive interlinking between many of the vocabularies that form part of linked data can be seen in the visualizations on the Linked Open Vocabularies (http://lov.okfn. org/dataset/lov) website, with the connections between Schema.org and other vocabularies shown in Figure 4.1.

Figure 4.1 Linking between Schema.org and other vocabularies as shown on Linked Open Vocabularies

is does not mean that it is always appropriate to make use of an existing ontology. Sometimes, even if one is already available for reuse, a researcher may decide to develop an independent ontology. In developing ONTOSSN (a Scientific Social Network ONTOlogy) Ahmed et al. (2014) decided to build an ontology from scratch, despite the fact many of the concepts could have been taken from other ontologies that have broadly attempted to do the same thing:, e.g., FOAF, CERIF. As has already been mentioned, in the development of the Open Graph Protocol it was decided not to reuse existing ontologies, as user tests found that content managers were unhappy with having to remember names from multiples sources (Allemang and Hendler, 2011). Whilst there are, as Diane Hillman is oen quoted as saying, ‘no metadata police’ (Heery and Patel, 2000), committing to an ontology nonetheless requires statements

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 83

ADOpTIng OnTOLOgIES

83

to be ‘logically consistent’ with an ontology’s definitions if they are to be meaningful statements (Gruber, 2009). It is also important to recognize, as Davis, Shrobe and Szolovits (1993) point out, that an ontology is necessarily an imperfect partial representation of a thing rather than reflecting the true complexity of the real world, and choosing one particular ontology reflects a decision to focus on certain aspects of the world. As perceptions change about which aspects of a particular thing should be represented, there may be limitations in reusing existing ontologies, especially where their dominance has been established by being an early mover.

Identifying ontologies It is important to be aware of the ontologies that are available, although even if an appropriate ontology exists it is not always easy for a person to find it. It has been likened to dowsing rather than a scientific process (W3C, 2010). Current online ontology identification may make use of ontology libraries, an ontology search engine, or even a traditional search engine. An ontology library has been defined as ‘a Web-based system that provides access to an extensible collection of ontologies with the primary purpose of enabling users to find and use one or several ontologies from this collection’ (Noy and d’Aquin, 2011). e term collection is included to suggest an element of curation in comparison to an ontology or semantic search engine; whereas ontology libraries provide a curated collection of ontologies, ontology and semantic search engines are based on data received from automatic crawls of the semantic web. In reality the distinction between the two is not always clear-cut: a library may use a certain amount of automatic indexing and a search engine may use some curation (e.g., in deciding to ignore certain domains or languages), but the distinction is important in helping users understand the value of the different services. Most importantly, there is no single source that includes all ontologies, and a comprehensive search should make use of multiple services.

Ontology Libraries ere are a number of ontology libraries available online, covering different areas and having different levels of functionality. Four of these are discussed below, with the various features and associated metrics correct at the time of writing (August 2015), although these features invariably change over time. Whilst they are discussed here as ontology libraries, they oen include a far wider range of knowledge organization systems, and as is oen the case, the terms they use are not necessarily used consistently. Ontology libraries may be as difficult to find as ontologies themselves,

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 84

84

pRACTICAL OnTOLOgIES

.

although the Wikipedia ontology page (http://en.wikipedia.org/wiki/Ontology_ (information_science)) contains a limited list of ontology libraries.

Linked Open Vocabularies Linked Open Vocabularies (LOV), http://lov.okfn.org, is an ontology library hosted by the Open Knowledge Foundation, designed to provide access to high-quality element sets for linked data. Curators include ontologies that follow best practices, and augment the metadata where information cannot be extracted automatically. As well as enabling a user to search through the 515 ontologies LOV has collected in its library, LOV also provides a number of ways to browse through the vocabularies, and as was noted above, provides visualizations showing the interlinking between the different ontologies. e data within LOV may also be queried via a SPARQL endpoint, downloaded as triples, or accessed via an extensive API (enabling additional services to be built on top of LOV). e high quality of the metadata and the numerous ways of interacting with the data combine to make LOV one of the most user-friendly ontology libraries, even though it is not the largest.

BARTOC.org BARTOC (BAsel Register of esauri, Ontologies& Classifications) (www.bartoc.org) is the largest of the libraries discussed here, and provides a register of ‘controlled and structured vocabularies’, enriched with Dewey Decimal Classification and subject headings from Eurovoc (http://eurovoc.europa.eu). Of the 1,518 vocabularies indexed, however, only 113 are listed as ontologies, the topics of which are shown in the word cloud in Figure 4.2. Not all vocabularies of use in building an ontology are necessarily classified within BARTOC as ontologies, however. For example, Dublin Core metadata terms and Bibframe are both classified as controlled vocabularies. Although BARTOC does not provide the extensive visualizations of ontology interlinking that LOV provides, it nonetheless incorporates a wide range of ways of accessing its data. As well as a basic search, it provides a faceted search by Dewey Decimal Classification, topic, language, and type (e.g., ontology, thesaurus, glossary), and allows the vocabularies to be browsed by place, rating or recency. It also allows the data to be downloaded in a wide range of formats (e.g., XML, CSV, DOC) and provides a SPARQL endpoint.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 85

ADOpTIng OnTOLOgIES

85

Figure 4.2 Word cloud of subject headings of ontologies in BARTOC

Taxonomy Warehouse e Taxonomy Warehouse (www.taxonomywarehouse.com) is a more traditional vocabulary library with very basic functionality for discovering the 666 vocabularies in its library: they may be browsed in alphabetical order, or searched for through a basic search interface. Nonetheless, it contains a wide range of controlled vocabularies on a wide range of subjects, and detailed metadata records.

BioPortal BioPortal (http://bioportal.bioontology.org) is an example of a specialized ontology library, providing a repository of biomedical ontologies. At the time of writing BioPortal claims to include 446 ontologies that may be browsed or searched. Whilst most of these are primarily of interest to the biomedical community, some of the scientific ontologies indexed will have a wider application: for example, ontologies on scientific units and scientific workflows. Bioportal also includes a number of metrics which may help in the evaluation of an ontology. On the library homepage it includes information about ontologies that have been viewed the most oen, whilst the summary page for an ontology provides metrics about the structure of the ontology, as well as some information about the projects that are making use of the ontology. Although views of an ontology may be

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 86

86

pRACTICAL OnTOLOgIES

interesting, it is likely to quickly become a self-fulfilling list, with those viewed most oen being the ones at the top of the list.

Ontology and semantic search engines Ontology and semantic search engines have both advantages and disadvantages in comparison to ontology libraries, in much the same way as there are advantages and disadvantages of search engines in comparison to directories for finding information on the traditional web. e automatic indexing of content enables a service to have a more up-to-date index and ensures maximum coverage; however, it comes at the cost of including lower-quality ontologies and limiting metadata to that which can be automatically extracted. For the traditional web of documents, users have found the advantages of search engines far outweigh those of a directory, as directories have not scaled well to the size of the web; however, there aren’t the same problems of scale for the manual cataloguing of ontologies. ere may be hundreds or thousands of publicly available ontologies published on the web, but such a number is dwarfed by the billions of web pages that are available. Understanding how ontologies are used on the web, however, requires the indexing of the whole of the web of data, and that requires an automated process. e ontology library LOV incorporates information about the number of data sets within which an ontology is used by incorporating data from LODstats (http://stats.lod2.eu/), which contains information on 3426 data sets, but even this may be thought of as a small section of the semantic web. For example, whereas Bizer et al. (2013) found that Facebook’s OGP accounted for six of the most frequently used classes, it is not identified as being used by any linked open data sets on LOV. ere have been a number of different semantic search engines with different levels of functionality, but many of them suffer from the problem that they start as academic or research projects with a limited period of support. For example, Sindice.com, one of the most extensive semantic indexes contained sufficient machine-readable data to be used in other research projects (e.g., Stuart, 2012), but as of 2015 was no longer available.

Swoogle Swoogle (http://swoogle.umbc.edu) is an ontology search engine that was developed by the eBiquity research group at the University of Maryland, Baltimore County (UMBC) over ten years ago (Ding et al., 2004). Despite its age the site nonetheless continues to provide a useful service for finding ontologies used on the semantic web, with a simple search interface as well as through a RESTful interface for automatic

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 87

ADOpTIng OnTOLOgIES

87

querying. Over the years its data has been used in a wide range of research papers: it has been used to show the growth in the semantic web (Auer et al., 2009); to find common terms used across different research communities (Jayadianti et al., 2013); and to identify appropriate vocabularies in a proposed linked data modelling methodology (Schaible et al., 2013).

Watson Watson (http://watson.kmi.open.ac.uk/WatsonWUI/) provides another simple search interface and RESTful interface for searching ontologies. It also includes options for selecting the type of entities that are matched (i.e., classes, properties, and individuals) and the part of the entity that should match the keyword that is searched for (e.g., name, label or comment).

Falcons Falcons (http://ws.nju.edu.cn/falcons/) is a more recent ontology search engine. It is similar to Swoogle and Watson, but provides greater functionality for browsing through the results and finding related results, and provides a visual representation of the results from an ontology search. For example, Figure 4.3 shows the top two results for a search for ‘person’ within the Ontology query form. e top two results are from

Figure 4.3 A search for ‘person’ within the Falcons Ontology Search

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 88

88

pRACTICAL OnTOLOgIES

two of the most popular vocabularies that may be used to describe a person: FOAF and DC terms (details on the ranking algorithm can be found in Cheng et al., 2011). e first result shows that the FOAF ontology defines a Person class type, which is linked by owl:equivalentClass to two other ontologies with a Person class, and has the label Person. e second result shows that DC terms also has an Agent class, the description of which includes the word ‘person’. Additional context to the class is provided by four properties that have Agent as a range.

A web search engine e most popular search engines need little introduction; Google is synonymous with the web for millions of users, and it is a natural destination for people searching for anything, including an ontology. ere are, however, many disadvantages in using a general search engine to search for an ontology, and few advantages. Amongst the disadvantages are: inconsistent use of the term ‘ontology’; unsuitability of ontologies for search engine indexing; and unsuitability of search engine indexes for discovering ontologies. On the plus side, however, a general search engine has the potential to discover ontologies that are not incorporated into the semantic web, and despite its shortcomings Google has previously been used as the basis for data collection in a study of FOAF (e.g., Ding et al., 2005). As was discussed in the introduction to terminology in Chapter 1, and can be seen in the terminology used by the ontology libraries above, there is inconsistent use in terminology surrounding ontologies. Not only is ‘ontology’ used both with information science and philosophy, but relevant ontology element sets and instances may be referred to variously as thesauri, taxonomies, glossaries, controlled vocabularies, metadata or dictionaries. Finding a suitable ontology may depend on the selection of the appropriate knowledge organization system term, although there is no guarantee that any such term will be associated with a publicly available ontology. ose who are publishing ontologies on the web will probably not be giving particular attention to search engine optimization, or documentation. Ontologies that have been designed by a small team for one particular application may have been made freely available online for reuse in a suitable format (e.g., RDF or OWL), but if the primary aim is not the widespread adoption of an ontology then time is unlikely to be spent on it. Unfortunately, for resources to be returned by a search engine, and to be ranked sufficiently highly so that they can be seen, they need to include the relevant search terms and be sufficiently linked-to by ranked documents with the same terms. General search engines are also not designed for the searching of ontologies. Where ontologies have been appropriately indexed by a search engine, and are sufficiently highly ranked to be found, they nonetheless oen lack the structured data that enables

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 89

ADOpTIng OnTOLOgIES

89

the quick evaluation of ontologies and comparisons to be made between resources. e sort of information that may have already been systematically captured in an ontology library, such as the subject, format and creator, along with metrics about an ontology’s size and use, may need to be pieced together from multiple different sources, if it is available at all. One potential advantage, however, is that a search engine might uncover ontologies that are not actually part of the semantic web. Within this book the focus is on ontologies that are being used on the semantic web, as those provide the opportunity to share more data than ever before and develop new insights from across the world. ere are, of course, many other bespoke and proprietary ontologies used within commercial organizations, or vocabularies that predate the semantic web, that may be of interest to someone building an ontology. Anyone who has spent time searching the web for existing ontologies, however, will undoubtedly quickly recognize the validity of the comparison between dowsing and ontology searching.

The ideal ontology discovery tool ere is undoubtedly a need for more extensive ontology discovery tools to encourage the searching, browsing and reuse of existing ontologies, and it is worth considering what types of functionality such tools would ideally have to facilitate with the assessment of new tools as they emerge.

1. Ontologies AND other controlled vocabularies Ontologies are only one of a number of different types of controlled vocabulary, and it may be that an appropriate vocabulary has already been developed but not been structured in an RDF format, or that a simpler vocabulary may be adopted for a particular task. Unless ontologies and other types of controlled vocabularies can be browsed and searched together, there is likely to be a lot of unnecessary repetition of work. As can be seen from the examples above, many ontology libraries already include a wider range of other controlled vocabularies.

2. Metrics about use Metrics are oen limited within an ontology library, especially regarding the use of ontologies. Although information about the number of classes, properties, and instances within an ontology may be simple to compile, and are relatively stable, information about how an ontology is being used is far more difficult to capture and

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 90

90

pRACTICAL OnTOLOgIES

likely to be more dynamic. For example, the number of classes and properties that can comprise the FOAF vocabulary can be simply identified by visiting the FOAF page for a particular specification. e FOAF vocabulary specification 0.99, the Paddington edition, can be seen to comprise 13 classes (including ‘archaic’ classes and properties). Although this gives some insight into the extensiveness of the ontology, it is metrics about the use of the ontology that are likely to be of most use when trying to decide whether to adopt the whole, or part of an ontology. How widely is an ontology being used? Is it being used many times on a few sites, or a few times on many sites? Is it being used evenly, or are some parts being used more oen than others? Although FOAF is one of the widely adopted element sets, it seems likely that its popularity rests more on properties such as foaf:familyName and foaf:firstName, rather than foaf:myersBriggs and foaf:dnaChecksum. Information about how an ontology is used is particularly difficult to track, requiring an index of all the pages that have been published using a particular ontology. Although the semantic web is not as large as the traditional web of documents it would nonetheless be a significant undertaking to index all of it, especially if the index was to be kept up to date, just to discover that one particular ontology was more favoured than another similar one. Metric data would be particularly useful if it was associated with the date on which the data was published, so that people could identify those ontologies that, even if they weren’t widely adopted, could nonetheless be seen as up-and-coming. Even if information about the use of an ontology is captured, it is nonetheless only a rough guide to the importance of an ontology. at data is published in accordance with a particular ontology doesn’t mean that anyone is making use of it. Although an ontology library may capture information about the number of times an ontology is viewed or explored on its site, this gives little indication of its wider adoption. It is also important to recognize that some sites may be considered more important than others. If only one site is using a particular ontology, but that site happens to be the British Library, then other libraries may pay the ontology particular attention.

3. Ontology ranking Metrics may not only provide insights into the structure and suitability of an ontology element set, but may also be used for ranking ontologies within an ontology library, and a number of different ranking algorithms have been proposed. Li, Shi and Cheng (2014) have proposed an ontology ranking method based on formal concept analysis, a tool for representing, analysing and extracting rules about the relationships between objects and attributes. Sridevi and Umarani (2014) have proposed a ranking based on a semantic closeness measure using keywords’ and class labels’ synonym rings

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 91

ADOpTIng OnTOLOgIES

91

(semantically equivalent terms) taken from the lexical database WordNet (https://wordnet.princeton.edu). Ontologies are, by their nature, semantically rich resources, providing the opportunity for the development of a wide range of new and innovative ranking mechanisms. e ideal ontology library would not only provide a range of ways of ranking the data, but also be explicit about the nature of the ranking algorithms.

4. Description and provenance information An ontology library should also provide descriptions of the ontologies and their provenance. As well as descriptive text providing indexable content to help with the retrieval of an ontology from an ontology library, a description of the ontology and its provenance can help with the selection of the ontology. Is the organization behind the ontology reputable? Does it have similar aims and objectives to those of the prospective user? Is the ontology likely to continue to be developed in the future? How oen has it been updated in the past? Does an ontology document the development of the ontology and its elements?

5. Subject headings and classification Controlled vocabularies are not only the subject of an ontology library, but subject headings and classifications may be used to improve the recall of ontologies from the ontology library. Depending on the number of vocabularies within a library, such classifications may be at quite a coarse level of granularity. For example, even as coarse a level of classification as the ten top-level classes of the Dewey Decimal Classification (e.g., Science, History and Geography, Arts and Recreation) may be a useful tool in helping users browse to the ontologies that they want in a general ontology library. As the number of ontologies increases, however, or in a specialized ontology library, there would obviously be a need for more specialized subject headings and classification terms.

6. Ontology viewing tools An ontology library should ideally also make it easy to browse the contents of an ontology. e documentation that accompanies published ontologies varies considerably, meaning that the person looking for an ontology to fill particular needs must expend a lot of cognitive effort in identifying the suitability of any particular ontology. Providing a consistent interface for browsing the ontologies would make the process far simpler.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 92

92

pRACTICAL OnTOLOgIES

7. Long-term viability Like much of the semantic web, ontology libraries are oen short-term and ad hoc affairs. Oen they are little more than a list of links. e development of a robust semantic web and the wider adoption of ontologies, however, requires a far more professional set of libraries that can provide insights into the changing nature of ontology use, provide comprehensive coverage and build a reputation for reliability. Unless the ontologies that are already available are simple to find, reuse and build upon, people will continue to create their own, and few ontologies will get widespread dissemination.

8. Crowdsourcing content Ontology libraries can undoubtedly be enhanced by the automatic crawling of the semantic web, but it is nonetheless important to recognize that not all ontologies are on the semantic web, and the ideal ontology discovery tool should also contain those controlled vocabularies that are hidden in non-machine-readable formats or behind closed doors. Identifying such hidden vocabularies, and encouraging a more vibrant ontology marketplace, are only possible through the contributions of the widest possible number of users.

Selection criteria If an ontology is found, it is necessary to assess whether it meets the requirements of the potential user. Although it is oen difficult to find a suitable ontology for a particular situation, for some of the more popular types of entity there will be multiple ontologies to choose from. For example, an event may be captured by the LODE (Linking Open Descriptions of Events) ontology (http://linkedevents.org), the Event Ontology (http://motools.sourceforge.net/event/event.html), or Schema.org’s Event class (http://schema.org/Event). Even if only one ontology can be found there is always the option of creating a new ontology, so it is necessary to have some criteria by which to judge an ontology. Gruber (1995) provides a list of five criteria to which an ontology should aspire, each of which may be considered within any evaluation: clarity; coherence; extendibility; minimal encoding bias; and minimal ontological commitment. Clarity refers to the ability of an ontology to convey the meaning of its associated terms. Unfortunately ontologies oen necessitate the definition of highly technical terms or multiple terms with subtle differences between them, and as few creators of ontologies are experienced lexicographers the definitions are rarely as clear as a user may wish. Coherence refers to the consistency of the ontology, ensuring that both

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 93

ADOpTIng OnTOLOgIES

93

inferences and definitions are logical. Extendibility refers to the ability of users to extend an ontology to meet their own specific needs. An encoding bias refers to a restriction being implemented for the convenience of a particular implementation rather than reflecting the knowledge that is to be encoded. For example, restricting dates to the format dd/mm/yyyy, despite many of the dates not being known at that level of granularity. Minimal ontological commitment is about the adoption and consistent use of ontologies, requiring that no more claims should be made about a world that is being modelled than is necessary. is issue is returned to below. Some additional criteria that may be considered include: goodness of fit; format; currency; adoption; documentation; and licensing. Goodness of fit refers to how well an existing ontology (or element) meets the requirements of the required ontology, and is the most important factor. It is, unfortunately, oen simpler to say when an ontology element is definitely not suitable for a particular task than when one is. For example, despite there being similarities between a book and a gravestone, in that they both involve written information and on a variety of materials, it’s highly unlikely that an ontologist would try to build a bibliographic data set making use of the Graves Ontology (http://rdf.muninnproject.org/ontologies/graves.html), but determining the suitability of some of the closer-fitting ontologies may be more difficult. It may be that a potential ontology element set can accommodate the proposed data, but that the terms are too broad. Dublin Core is the classic example of an element set that can accommodate a wide range of data, but which is likely to be too broad for a specialist collection. Rather than having the associated contents rattling around in the broad elements, one solution is to create new sub-properties and sub-classes for existing properties and classes. For example, in an archive of classical music recordings, rather than using the broad Dublin Core term ‘creator’ it is probably more useful to distinguish between composers, conductors and musicians. Creating each of these as sub-properties of ‘creator’ can enable greater specificity within a dataset, whilst facilitating the reuse of the data by services built on top of Dublin Core. e alternative is where an ontology, or some of its elements, may be too restrictive, and it relates to Gruber’s (1995) idea of minimal ontological commitment. For example, a horse breeder may wish to publish a data set about the lineage of the horses in his/her stables. Although there are a number of relationship ontologies that enable the expression of parent–child relationships, generally these are associated with people rather than horses. Someone (or more likely, something) reading the triple :RedRum schema:parent :Quorum might, following Schema.org’s documentation, reasonably expect both :RedRum and :Quorum to refer to people rather than horses. Although, at least on the semantic web, ‘anyone can say anything about anything’ (W3C, 2002) and there are ‘no metadata police’ (Heery and Patel, 2000), such practices

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 94

94

pRACTICAL OnTOLOgIES

should generally be avoided, to encourage both reuse of data and the semantic web more generally. e inclusion of unconstrained properties alongside constrained properties in the publishing of some ontologies, e.g., RDA, provides a nice solution to the problem. If, for example, Schema.org incorporated both constrained and unconstrained properties, the triple :RedRum schema:parent :Quorum would suggest that:RedRum and :Quorum were people, whereas :RedRum schemaunconstrained:parent :Quorum would suggest no such thing. schemaunconstrained:parent could be used in a far wider range of situations, not only in relation to the child–parent relationship of other types of animal, but also abstract concepts, with one idea being the ‘parent’ of another idea. e biggest problem in terms of ‘goodness of fit’ is that oen an ontology will cover only part of the requirements. An ontology may be required to describe people, resources and events. Whilst each of these classes may be adequately covered with a selection of properties from existing ontologies (e.g., FOAF, DC and LODE), it may be that a large number of namespaces is considered overly complicated. As with Facebook’s Open Graph Protocol, it may be deemed more appropriate to create a consistent element set, especially where data is expected to be created by a large number of external users. Format refers to the way the ontology has been encoded. Whilst RDF, SKOS and OWL are increasingly important for sharing data on the semantic web, they are by no means the only formats that are available. It is still not unusual for a small taxonomy to be created and shared in an Excel spreadsheet. ere are many ontology data sets that precede the development of OWL and are only available in DAML, whilst in some cases the vocabulary will not have been made available in a machine-readable format at all. An extensive set of terms matching an ontology’s requirements may only exist in the pages of a book sitting on the shelves of half a dozen libraries around the world, or as a specification separate from any particular encoding: for example, ISAD(g) the General International Standard Archival Description. It may be that the advantage of using one ontology over another is outweighed by the work involved in transforming the ontology. Currency is particularly important, although what is meant by an ontology being current and up-to-date will depend heavily on the field and the audience. As Arbesman (2012) points out, books, journals and even facts all have half-lives, and these can differ significantly between domain and source; whereas the half-life of the hard sciences in journal articles is relatively short, they have the longest half-life when it comes to books. Although research has not been carried out into the half-life of ontologies and other vocabularies, it seems likely that some ontologies in the arts or the humanities may be expected to provide value for longer than an ontology in a STEM subject, with some thesauri and authority lists having value hundreds or even thousands of years aer they were first created, but that is not necessarily the case for

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 95

ADOpTIng OnTOLOgIES

95

all such ontologies. Unless steps have been taken to ensure that an ontology is maintained and updated, an ontology will nonetheless quickly miss whole emerging fields and essential methodologies and fail to recognize paradigm shis in the way objects are classified. It is important to understand what a user can expect from the publisher of the ontology in terms of maintenance and sustainability. Will the ontology even be online in six months’ time? Many of the decisions regarding whether or not to make use of an ontology rely on information in the documentation. Unfortunately in many cases the documentation is either limited or non-existent. It may be that the lack of accompanying documentation is sufficient reason in its own right not to reuse an existing ontology. Unless the accompanying documentation is sufficient to establish that a set of terms has been compiled by a robust process to build confidence in an ontology or to ensure that it can be applied consistently, then it may not be appropriate. Whilst some controlled vocabularies will have an established reputation, e.g., MESH (Medical Subject Headings) or the Getty esauri, in other cases it is necessary for the documentation to establish their credibility. In most cases, ontology element sets and data sets published on the web are designed to be reused. How they can be reused, however, depends heavily on the accompanying licence, and this is an increasingly complex landscape. ere are now a wide range of different licences that can be used for publishing data (e.g., Creative Commons, Open Database License), each of which can have different obligations on those making use of the data in terms of attribution and reuse. Where data is being combined from multiple sources, there may be extensive compliance requirements, and whilst the new licences are supposed to simplify the process, they are only as good as the implementation, and sometimes this can be confusing.

Conclusion Many different ontologies have been built over the years, although it is not always simple to find those that are most suitable for the publishing of a particular dataset. ere is a need for higher-quality ontology libraries and search engines if we are to move beyond the stage where many of the ontologies that are created fall into disuse before they have the chance to be widely adopted. It is equally important not to fall into the trap of thinking that everything that is available is available online. Multiple useful vocabularies will be available offline, existing as a paper copy in a library or book appendix, or if they are online they are not necessarily in a usable format, or may exist behind the scenes driving the content management system. Personal contacts can still be some of the most important sources of information about ontologies.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 96

96

pRACTICAL OnTOLOgIES

However, even if all those ontologies and vocabularies that have been created were more accessible, there would undoubtedly still be a need to create new ontologies for situations and perspectives that have not yet been considered. Even if you decide to reuse elements from existing ontologies, there may be a need to create a representation of the application profile. Methods and tools for building ontologies are the subject of the next chapter.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 97

CHAPTER 5

Building ontologies

Introduction It is important that information professionals are not only users of existing ontologies, but that they build their own. is is not only important because of the large number of areas that still require the development of ontologies, but also because it is only through the development of ontologies that we come to fully appreciate the associated tools and technologies. is chapter starts with a broad methodology for building an ontology and an overview of some of the different tools that are available for building an ontology. e chapter finishes with an ontology development example that focuses on the development of an ontology element set, designed for capturing information about bibliometric metrics and indicators.

Approaches to building an ontology ere have been a number of different approaches to ontology development, although as Noy and McGuinness (2001) state, ‘there is no single correct ontology-design methodology’. e appropriateness of a particular methodology will depend heavily on the purpose of the ontology; the approach taken in the development of an enterprise ontology to aid with the retrieval of internal documents in a multinational corporation will differ significantly from the development of an element set by a postdoctoral researcher within an academic institution wanting to publish a particular set of research data. Tonkin, Pfeiffer and Hewson (2010) distinguish between three types of methodology for ontology development methodologies: self-reflective methodologies; collaborative approaches; and empirical methodologies. A self-reflective methodology is the simplest: a single person is responsible for the development of an ontology based on the knowledge he or she already possesses. In the case of the post-doctoral researcher mentioned above, a self-reflective methodology may prove sufficient, especially where it is a highly specialized area with a limited tradition of data sharing.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 98

98

pRACTICAL OnTOLOgIES

Where there is a more established history of data sharing, or a domain is too large for a single person to comprehend or encompasses differing perspectives, then a collaborative approach may be more appropriate, with a consensus viewpoint being based on multiple individuals’ expert knowledge. Such a collaborative approach could be important in the development of an enterprise ontology in a multinational corporation, where no single individual would be sufficiently knowledgeable on the organization as a whole or the different perspectives involved. Such an ontology may also be developed with an empirical methodology, a data-driven approach making use of the data in documents rather than the knowledge in people’s heads. Data-driven approaches have the advantage of more closely matching the terms that are actually used, rather than those which should be used, and may therefore be advantageous in the development of an ontology that is used to classify documents (Tonkin and Pfeiffer, 2010). ese types of methodology are actually merely different approaches to knowledge acquisition, and may be accommodated in a more general ontology methodology in which knowledge acquisition is one of a dozen steps. e twelve-step methodology detailed below is based on the eleven-step model developed and applied as part of the CENDARI project (CENDARI, 2011), a research infrastructure project designed to integrate archives for researchers of the medieval and World War 1 eras. e methodology is based on the four earlier methodologies with clearly defined steps, each of which can be seen to emphasize different aspects of the ontology development process: Uschold and King’s (1995) methodology; Grüninger and Fox’s (1995) TOVE methodology; Fernández-López, Gómez-Pérez and Juristo’s (1997) Methontology; and Noy and McGuinness’s (2001) simple knowledge engineering ontology. ere are additional descriptions of ontology development available, although some are specific to a particular ontology development environment (e.g., Kozaki et al., 2002) or for a particular scenario, for example, extending sub-ontologies on the cloud (Flahive, Taniar and Rahayu, 2015). e four ontology methodologies discussed, however, are sufficient to form a general framework for ontology building that is more comprehensive than any single ontology building methodology. An overview of the steps described in each of the ontologies is provided in Table 5.1. Uschold and King’s (1995) methodology for building ontologies was one of the first general methodologies, and consisted of four main stages: identify the purpose of the ontology; build the ontology; evaluate the ontology; and document the ontology. e steps in the other ontology methodologies may be seen to be additional details and perspectives on these stages. e importance of starting by identifying the purpose of an ontology is recognized in each of the methodologies, albeit under different names: competency (Grüninger and Fox, 1995), specification (Fernández-López, GómezPérez and Juristo, 1997), and domain and scope (Noy and McGuinness, 2001). Documentation is only an explicit step in two of the methodologies (i.e., Uschold and

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 99

BUILDIng OnTOLOgIES

99

Table 5.1 Overview of steps in different ontology development methodologies Uschold and King’s (1995) Methodology

Grüninger and Fox’s (1995) TOVE methodology

1. Identify purpose 2. Building the ontology a. Ontology capture b. Ontology coding c. Integrate with existing ontologies 3. Evaluation 4. Documentation

1. 2. 3. 4.

Fernández-López, Gómez-Pérez and Juristo’s (1997) METHONTOLOGY

Noy and McGuinness’s (2001) simple knowledge-engineering methodology

1. 2. 3. 4. 5. 6. 7.

1. 2. 3. 4. 5. 6. 7.

Specification Knowledge acquisition Conceptualisation Integration Implementation Evaluation Documentation

The competency of the ontology Define the terminology of the ontology Specify the definitions and constraints of the terminology Test the competency of the ontology by proving completeness theories

Determine the domain and scope of the ontology Consider reusing existing ontologies Enumerate important terms in the ontology Define the classes and the class hierarchy Define the properties of classes-slots Define the facets of the slots Create instances

King, 1995; Fernández-López, Gómez-Pérez and Juristo, 1997), and evaluation in only three of the methodologies (i.e., Uschold and King, 1995; Fernández-López, GómezPérez and Juristo, 1997; Grüninger and Fox, 1995). Unsurprisingly, much of the focus within the methodologies is on the building of the ontologies stage. Grüninger and Fox (1995) break it down into two stages: defining the terminology, and specifying definitions and constraints. Uschold and King (1995) distinguish between three stages: ontology capture, ontology coding and integration with existing ontologies. Fernández-López, Gómez-Pérez and Juristo (1997) have four stages that fall within the scope of building the ontology: knowledge acquisition, conceptualization, integration, and implementation. Whilst Noy and McGuinness’s (2001) break it down into six stages. Combining the four methodologies can provide a ten-stage methodology, to which two additional stages, identify appropriate soware and sustain the ontology, have also been added: 1 2 3 4 5 6 7 8 9

Ontology scope Ontology reuse Identify appropriate soware Knowledge acquisition Identify important terms Identify additional terms, attributes and relationships Specify definitions Integrate with existing ontologies Implementation

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 100

100

pRACTICAL OnTOLOgIES

10 Evaluation 11 Documentation 12 Sustainability. Each of these steps are explored in more detail below. Since Mizoguchi’s (2004) criticism that most ontology-building methodologies are concerned with the top layer of ontology building rather than the more important (at least for novices) finer-level guidelines, there has been a wide range of ontological research into the different steps of ontology building, and the description of the twelve steps below attempts to strike a balance between providing an overview of the ontology-building process and having sufficient detail to be practically useful. It should also be noted that whilst the steps are given in a particular sequence, as Noy and McGuinness (2001) have stated: ‘Ontology development is necessarily an iterative process.’ Uschold and King’s (1995) three stages within ‘building the ontology’ do not have to be carried out in order, but may be reordered or merged into a single iterative stage. At each of the steps there is likely to be an appraisal of earlier steps; in the words of Tonkin, Pfeiffer and Hewson (2010), ontology development should be ‘agile’ and ‘test-driven’.

The twelve steps 1. Ontology scope It is essential to start with a clear idea of the bounds and the audience of the ontology. No object or idea exists in isolation, but rather is connected to a host of other objects and things by a multitude of different relationships. In traditional librarianship the focus was on access points, the routes into a publication’s bibliographic record (Welsh and Batley, 2012), but an ontology may be far more inclusive, depending on its purpose. e books at the centre of a bibliographic database not only have authors, but also a host of other associated organizations and individuals: publishers, printers, editors, translators, illustrators and proof-readers. e author will have literary influences, familial relationships, mentors. A work will have been written in a particular location, with a particular set of tools and with a particular working pattern. Although such a detailed knowledge base would be likely to provide numerous interesting insights into writing practices, it would undoubtedly be overkill for a local public library which is unlikely to be regularly asked questions about which books were written on a Remington typewriter, by the coast, or by a single mother influenced by Tolstoy. Hedden (2010, 292) identifies five aspects that should be established at the start of a project when developing a taxonomy:

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 101

BUILDIng OnTOLOgIES

1 2 3 4 5

101

What is the purpose of the taxonomy? Who will be using the taxonomy? What content will the taxonomy be covering? What is the scope of the taxonomy? What resources are available for developing the taxonomy?

As was mentioned in Chapter 1, ontologies can have many different purposes, and this will inevitably affect the nature of the ontology that is created. An ontology designed for indexing resources will inevitably differ from one in the same domain designed to support information retrieval or navigation, or act as a knowledge base. Where an ontology is to act as a knowledge base its purpose may be reflected in a set of competency questions, the sort of questions an ontology is supposed to answer (Allemang and Hendler, 2011). Depending on the nature of the questions an ontology may be more or less formal or informal, lightweight or heavyweight. e same ontology may be used by a diverse set of users in a wide range of situations, and understanding who the users are is an important part of selecting the correct terms as well as an ontology’s level of complexity and constraints. For example, myocardial infarction is undoubtedly an appropriate term for use in MeSH, from where it will be applied by information professionals with experience of the health sciences; however, heart attack is likely to be a more appropriate term for users wishing to navigate a public health website. On the semantic web it may be impossible to have a clear idea of who all the potential users of an ontology will be, but nonetheless some ontologies have been published with steps taken to ensure that they are usable by as wide a range of users as possible. For instance, RDA has published both constrained and unconstrained versions of its vocabulary, to enable the use of terms beyond RDA applications. If an ontology is being used for information retrieval purposes, it is important to have a clear idea of the range content that the ontology will be used to access. What type of content is it? Does it include different types of media? Is the content multilingual? Such features are likely to have an impact on the building of an ontology. It is also important to have a clear idea on the scope of the ontology; to know what types of entities should be included and what shouldn’t. For example, in the creation of a domain ontology for researchers of World War 1, there are a seemingly endless set of entities that might be of interest to the researchers. Take people, for instance: the war affected hundreds of millions of people’s lives, and while the ontology element set might be expected to be able to accommodate anyone, it would not be expected to include everyone, so a decision needs to be taken about which people to include: politicians? military personnel? scientists? business owners? Without a clear idea of the scope of the ontology it can very quickly become a case of trying to model the

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 102

102

pRACTICAL OnTOLOgIES

whole world, beyond the resources of the largest research group. Failure to understand what is realistic to build can be a key failure in the development of knowledge-based systems (Shadbolt and Smart, 2015). Resources are inevitably restricted, particularly within information services, and failure to understand the resources that are available for developing an ontology will lead either to an ontology that is uneven or not extensive enough. e resources that are available for developing an ontology not only include time and person hours, but also existing ontologies and vocabularies. e resources necessary to include a comprehensive list of World War 1 battles differs considerably depending on whether such a list is thought to exist in some format already, or if it is expected to be created from scratch through a detailed analysis of archival records or secondary literature. As additional resources emerge during ontology development, or aspects of ontology development take longer than expected, it may be necessary to reappraise the resources available for ontology development. For Stewart (2011), coming from the perspective of building enterprise taxonomies, it is important that information about an ontology’s scope is clearly articulated in a governance document. is will aid both in decisions about what should and shouldn’t be included within the ontology, and provide a mechanism of determining whether the ontology has achieved what it meant to achieve.

2. Ontology reuse e principal reasons for ontology reuse were discussed extensively in Chapter 3. In summary, the reuse of existing ontologies can be more cost-effective, and facilitates the interoperability of data. Even where an existing ontology doesn’t meet all of a developer’s ontological requirements, there may be properties that do, and even if no properties are suitable for reuse existing ontologies can nonetheless help with the process of knowledge discovery. As will be elaborated on in step 4, knowledge discovery is the extraction of relevant information from existing resources, and ontologies and controlled vocabularies are resources for knowledge discovery much like any other resource. For example, someone wanting an ontology to facilitate access to the archives of the Royal Society is unlikely to consider DBpedia of a sufficiently high standard to form the basis of such an ontology: the DBpedia ontology element set may not include the type of information the ontology builder wishes to capture; it is applied inconsistently across different entities; and it includes a large amount of information that is not particularly pertinent to an ontology for the Royal Society (e.g., instance records for every episode of Star Trek!). Nonetheless, there will be some information about the Royal Society, and Royal Society members, that may

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 103

BUILDIng OnTOLOgIES

103

nonetheless form the starting point for a more authoritative ontology. Reuse of existing ontologies is not restricted to those within a particular domain, but is also likely to include non-domain-specific vocabularies such as those for structuring ontologies and including provenance information. e use of RDF, RDFS, and OWL for the structuring of ontologies has already been discussed in detail (see Chapter 4), but it may be advantageous to include additional information about the development and application of an ontology’s terms. For example, the Information Artifact Ontology also makes use of 57 annotation properties (www.ontobee. org/ontology/IAO), enabling the description of people involved in the development of an ontology (e.g., ‘term editor’ or ‘contributor’), the status of a term (e.g., ‘retired from use as of ’, ‘has obsolescence reason’), and the provision of examples of use (e.g., ‘example of usage’). Such annotation properties are likely to be of use irrespective of an ontology’s domain. Many of the criteria that may be adopted in deciding whether or not to reuse an existing ontology are the same as those that are used in reflecting on the development of your own ontology and are returned to in step 10, ontology evaluation, below.

3. Identify appropriate software for ontology development None of the four existing methodologies discussed above had an explicit step of identifying appropriate soware for soware development, although the choice of soware has ramifications on the ontology that is created: money spent on soware may mean less money for ontology development; a development environment that focuses on a natural language and graphical user interface may place restrictions on what an ontology can express; and desktop soware may restrict the number of collaborators contributing to an ontology at any one time. ere are a wide variety of tools available for developing ontologies and this section does not attempt to provide an extensive survey of existing ontology soware, which would inevitably become dated very quickly, but rather explores some of the factors that might be considered when selecting soware. When considering soware requirements it is important to recognize that no single piece of soware will necessarily fulfil all of an ontology developer’s requirements. For example, it may be that one piece of soware is necessary for transforming an existing ontology or vocabulary into a format that can be ingested, a second piece for the main development process, and a third for the sharing and receiving of feedback on the ontology.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 104

104

pRACTICAL OnTOLOgIES

Free versus subscription ere is a wide range of soware available for ontology building and, as is oen the case, it comes with differing pricing and funding models. e three most popular editors in order of popularity are, according to Warren et al. (2014), Protégé (http://protege.stanford.edu), TopBraid (www.topquadrant.com/tools/ide-topbraidcomposer-maestro-edition/), and NeOn (http://neon-toolkit.org). Whereas Protégé is an open source project developed by Stanford University, and NeOn was funded under the European Commission’s Sixth Framework Programme and is freely available, TopBraid Composer is a commercial product with different editions including different features at different prices. No funding model is necessarily universally advantageous. Although the cost of a TopBraid Composer Maestro Edition licence (US$3450 at the time of writing) may be prohibitively expensive for some organizations, another organization may see it as good value for money due to the particular functionality it provides, or the included technical support and maintenance. A subscription model may also offer a level of assurance about the long-term sustainability of the soware. NeOn, for example, was tied to a specific funded research project, and the last version of the soware dates from 2011.

Popularity and extensibility It is also important to consider the popularity and extensibility of ontology editing soware. Popularity not only contributes to the sustainability of the soware, but where the soware is open source or extensible it may mean that bugs are fixed quickly, help is available online and additional functionality is available through plugins and extensions. As has been mentioned, Protégé is the most widely used ontology editor (Warren et al., 2014), and a number of useful plugins have been developed for it. ese include: ProtégéLOV, to ease the process of reusing existing ontologies in the Linked Open Vocabularies ontology library discussed in Chapter 4; and ProtégéVOWL, which provides a graphical representation of an ontology using the Visual Notation for OWL Ontologies (VOWL 2) discussed in the documentation step below. It is worth noting, however, that not all plugins are available for every version of Protégé. As with so much of the semantic web, the plugins are oen the result of public research projects, and once the project has ended they are not necessarily updated for the most recent versions of Protégé.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 105

BUILDIng OnTOLOgIES

105

Graphical versus text-based user interface e most distinctive difference between two ontology editors can be the interface. On the one hand, as linked data increases interest in the development of ontologies amongst a wider range of users, there is a need for more user-friendly approaches to ontology development; on the other, there continues to be a need for ontology editing soware that can scale for the development of increasingly large ontologies. is has resulted in ontology editors as diverse as OWLGrEd (http://owlgred.lumii.lv), with its focus on a graphical user interface, FluentEditor (www.cognitum.eu/semantics/ FluentEditor), with a controlled natural-language interface, and Tawny-OWL (https://github.com/phillord/tawny-owl), a full programming environment for the construction of OWL ontologies. It is also worth noting that most readers will already have soware on their computers suitable for the creation of an ontology in the form of a simple text editor (e.g., Notepad for Windows and TextEdit on Apple computers). e creation of complex integrated ontologies in a text editor is not particularly recommended, any more than it is recommended that a complex website is created in a text editor rather than a more extensive content management system. But the point is, it is possible, and may indeed be tempting occasionally aer struggling with some of the idiosyncracies or limitations of a particular piece of soware. Although there are inevitably trade-offs between the different types of soware, there is little point in an ontology developer swapping a highly visual editor for a command line editor if they are unable to adapt to the new environment, irrespective of any perceived performance advantages.

Single-user versus collaborative development If an ontology is to be developed by more than one person, it is important to consider the collaborative functionality of an ontology editor. Whereas some are designed with the single user in mind, others enable multiple users to collaborate on an ontology at the same time (e.g., WebProtégé, http://webprotege.stanford.edu). Although collaborative development may be synonymous with server-side applications, desktop soware may also provide such functionality. e importance of whether ontology editing soware enables multiple editors depends heavily on the type of ontology that is being developed. If a small ontology element set is being developed, it may not be excessively burdensome for files to be shared back and forth via e-mail or through a shared virtual drive. If, however, an ontology is to be populated with hundreds or thousands of instances, and developers need to work on the file concurrently, then soware which is designed to support collaborative functionality is likely to be necessary if workflows are to be manageable.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 106

106

pRACTICAL OnTOLOgIES

Language support Language support may also need to be a consideration in ontology development. Does an ontology editor need to be able to cope with multiple scripts? Is it designed to ease the creation of a multilingual ontology? And, most importantly, does it support the required ontology language? In most cases the library and information professional will wish to create an ontology in RDFS or OWL, and whilst these are the most widely supported languages by far, it may be worth checking the version of OWL that is supported. Some soware and plugins are also available for supporting the creation of SKOS vocabularies.

Integration of existing knowledge It is also important to consider whether the ontology editor allows for the simple ingestion of existing ontologies and structured data. It may be that an ontology editor allows the ingestion and editing of a single ontology, rather than enabling the ingestion and alignment of multiple ontologies. It is also important to recognize the data that an ontology developer wishes to ingest may not be in the form of a structured ontology, but, for example, may include data captured through a natural language processing tool. How easily does an ontology editor fit into the expected workflow of ontology construction?

Domain-specific ontology editor Associated with the idea of integrating existing knowledge and language support is the idea of a domain-specific ontology editor. OBO-Edit (http://oboedit.org) is one such editor, specifically designed for writing ontologies in the OBO biological ontology file format. Most domains don’t currently have a sufficiently large community of users to make a domain-specific ontology editor sustainable for long-term development, although this may change in the future as ontologies become more widely adopted.

4. Knowledge acquisition Knowledge acquisition refers to the process of capturing knowledge to form the basis of the ontology, and can be broadly broken down into two types: knowledge elicitation, which involves getting associated concepts from engaging with people; and knowledge discovery, by extracting the necessary information from existing resources. Knowledge acquisition is oen given relatively little attention within ontology design methodologies, although it incorporates a wide variety of different approaches, and multiple approaches may be used in the creation of a single ontology. For example, in the creation of a chemical ontology Fernández-López, Gómez-Pérez and Juristo (1997)

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 107

BUILDIng OnTOLOgIES

107

included non-structured interviews with experts, informal text analysis, formal text analysis, and structured interviews with experts.

Knowledge elicitation In their discussion of knowledge elicitation Shadbolt and Smart (2015) identify a number of different methodologies and techniques that may be used, which may be broadly categorized as either natural or contrived. e natural methods include: • interviews with domain experts • protocol analysis, which involves capturing a record of experts doing their work and an explanation of why they are doing what they are doing • critical decision method, in which a domain expert recalls previous situations that required difficult or unusual decisions to be made. e contrived methods include: • concept sorting, which may involve the sorting of cards with a set of concepts written on them into different piles • repertory grids, revealing the concept map of a domain through triadic elicitation, which involves selecting three concepts so that two are similar and both are different from a third • laddered grids, through which a concept map is created by starting with a seed item and asking the domain expert to move around the domain map with appropriate questions: for example, can you give examples of x? what are the necessary properties of x? • limited information task, which involves the expert being provided with a problem and having to ask questions to get the information necessary to solve the task • concept/process mapping. Additional methods identified by Gavrilova and Andreeva (2012) include questionnaires, round-table discussions and brainstorming. e suitability of a particular methodology will depend heavily on the type of information that is being elicited (e.g., tacit or explicit), as well as the resources that are available. For example, whilst protocol analysis may potentially provide a rich source of information about how domain experts go about their work, far beyond what may be achieved in a simple interview, it is generally far more resource intensive. Here we consider two of the most popular methods in more detail, one natural method

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 108

108

pRACTICAL OnTOLOgIES

and one contrived method: interviews and concept mapping. Interviews are extremely popular due to their perceived simplicity, but as Gavrilova and Andreeva (2012, 529) point out: ‘best practices in interviewing need years of training and practical fieldwork’. Interviews may be broadly categorized as unstructured, semi-structured, or structured. Unstructured interviews are popular because of the limited amount of preparation or domain knowledge necessary – instead the domain expert is invited to discuss the domain without restrictions about the content that may be discussed. Unstructured interviews can be inefficient, however, producing vast quantities of unwieldy information (Cooke, 1994) that may have dwelt on irrelevant areas whilst not spending long enough on important areas (Shadbolt and Smart, 2015). Many techniques have been developed, however, for eliciting information within a more or less structured framework. Cooke (1994), in her review of knowledge elicitation techniques, identified a wide range of different techniques from many different disciplines, including case study analyses, forward scenario simulations, critical incident methods, teachback, role play, twenty questions and questionnaires. Questionnaires may be used to solicit responses from a far larger number of domain experts than many of the other interview techniques. ey may include open-ended questions about concepts, classes and their attributes (Cooke, 1994), and have been used to elicit knowledge in a range of ontologies: Chandramouli et al. (2008) used a questionnaire to create an ontology of cultural behaviour in adaptive education; Fransson et al. (2015) had 123 experts participate in a survey of definitions for a biobanking vocabulary; and Danial-Saad et al. (2013) built an ontology for assistive technology using the Delphi method with three rounds of questioning. By their nature interviews are more suited to the capturing of explicit knowledge, the sort that may be readily expressed in words, and this is especially true of unmediated questionnaires. However, modern communication technology (e.g., Skype or Google Hangouts) means that more in-depth interviews with experts in even the most specific of fields are easier to carry out than ever before. Of course, depending on the nature of the ontology, the ontologists themselves may also be a source of knowledge for the ontology development. Concept mapping developed in the 1970s as a tool for representing changes in children’s conceptual understanding during the learning process (Novak and Cañas, 2008), and has also been used in the mapping of expert knowledge. Expert knowledge concept mapping examples include modelling knowledge in both weather forecasting (Coffey et al., 2002) and on electronics troubleshooting in the US Navy (Coffey et al., 2003), and knowledge on the running of the bulls in Pamplona (González García and Zuasti Urbano, 2008). Knowledge related to a particular focus question is mapped hierarchically through the expression of associated concepts which are then also linked together through

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 109

BUILDIng OnTOLOgIES

109

cross-links. Novak and Cañas (2008) provide a simple methodology for developing concept maps: 1 Establish the focus question. e focus question has an important role in the form the concept map will take, and it should be a question rather than a topic. Derbentseva, Safayeni and Cañas (2007) found that a dynamic question (i.e., a ‘how’ question) encouraged more dynamic relationships to be built between the concepts. A good focus question keeps the concept map on topic, and helps with both the selection of relationships between the concepts. 2 Identify key concepts. e domain expert should then identify key concepts associated with the domain or focus question. ese concepts are then ranked in order from the most general to the most specific, and placed to one side in a ‘parking lot’, from where they are used to build the preliminary concept map. 3 Build preliminary concept map. e key concepts should be used to build a preliminary concept map, with propositions created to reflect the relationships between the concepts. e construction of appropriate propositions can be the most difficult part of creating a concept map (Novak and Cañas, 2008), especially as the hierarchical assumption of concept maps has been questioned (Derbentseva, Safayeni and Cañas, 2007), providing scope for a far wider range of propositions. It is a ‘preliminary’ concept map, in that it may be revised a number of times, new concepts added, and may never be said to be truly finished. 4 Cross-links should be added. Cross-links are links between concepts on different parts of the map, and break down the rigidity of the hierarchical structure. If there are to be a number of revisions of a concept map, or an expert skeleton is provided onto which a number of additional insights are to be added, or a particularly large concept is expected to be created, then using free concept-mapping soware (e.g., Cmap, http://cmap.ihmc.us) is probably more efficient than moving around a large number of Post-it notes. Soware can facilitate the collaborative creation of concept maps and the sharing of concepts and propositions. e concept map methodology is a more human-centric version of the larger ontology-building process, with each of the concept map-building steps having an equivalent (or near-equivalent) ontology-building step. Concept mapping talks of a focus question rather than ontology scope, identifying key concepts rather than important terms, and building a preliminary concept map and adding cross-links rather than identifying additional terms, attributes and relationships. e concept map is, nonetheless, not an ontology – it is neither appropriately encoded nor standardized – but as Starr and de Oliveira (2013) found, conceptual mapping can provide an intermediate knowledge representation language.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 110

110

pRACTICAL OnTOLOgIES

Knowledge elicitation is particularly important as, in the words of Hoffman and Lintern (2006, 215): ‘e gold is not in the documents’. However, access to domain experts may be limited, either because of their availability or because of limited resources. Either way, knowledge acquisition is likely to be enhanced through knowledge discovery.

Knowledge discovery Knowledge discovery refers to the extracting of information from existing resources. Increasingly these resources are digital, although this will not always be the case. The resources that are available will vary considerably according to the nature of the ontology that is being created. An enterprise ontology, designed to facilitate information retrieval across different departments, may make use of internal policies and procedures and other internal documents; whereas an ontology designed to act as a knowledge base for a particular academic field is likely to make use of the literature that has been published in the field, as well as a host of grey literature and other informal publications. As Stewart (2011, 78–9) notes in Building Enterprise Taxonomies: ‘. . . do not exclude any source of content a stakeholder deems critical just because it doesn’t come packaged in a format you were expecting’. Terms may be taken from file structures, search engines and a host of other non-traditional information resources. Of course existing ontologies may also be a source of enlightenment, and the vast quantity of data available on the web is unlikely to be ignored. Unstructured text may be analysed manually, semi-automatically or automatically. Manual analysis of unstructured content may be as technologically simple as using a pencil to mark terms and relationships on a physical document, whereas automatic analysis will require the implementation of natural language processing (NLP) soware in the workflow to extract the concepts and relationships that are of interest, and possibly additional soware and technologies to ensure the unstructured text is in an appropriate format to begin with. Natural language processing uses computers to ‘understand and manipulate natural language text or speech to do useful things’ (Chowdhury, 2003, 51), within which there are a wide range of different research areas and techniques, including speech recognition, sentiment analysis, machine translation, named entity recognition, and relationship extraction. Named entity recognition (NER) refers to the identification and classification of named entities within unstructured content (Marrero et al., 2013), and relationship extraction refers to the identification of predefined relationships between these identified entities; whereas a named entity will be a subset of words in a text, a relationship will not be – rather, it is an association between these subsets of

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 111

BUILDIng OnTOLOgIES

111

words (Sarawagi, 2007).e extraction of entities and relationships has obvious application in ontology population, and may be particularly useful where the sources for knowledge acquisition are primarily printed texts in English (the language within which most of the NLP research has been done). However, as van Hooland and Verborgh (2014, 172) point out: ‘Because these services can be applied in a quick and low-cost manner, we should not assume they offer any added value’. Although the problem of NER may be claimed to be solved, with studies oen reporting success ratios of over 95%, as Marrero et al. (2013) point out, there is a lack of agreement about what a named entity is, and how we define named entities has obvious implications for the success in identifying those named entities. Also, whereas a success rate of 95%, or even 99%, may sound sufficiently high to replace the human, if the most valuable knowledge is lost in that 5% or 1%, then adopting automatic techniques may prove to be a false economy. e only way of ascertaining the suitability of NER and relationship extraction with any degree of certainty is to test a tool, or set of tools, against a gold-standard sample of human annotated documents. However, certain factors may help in making a decision on whether NER and relationship extraction is suitable for knowledge discovery: quantity, quality and format of the content. If the quantity of documents that are to be used for knowledge discovery is relatively small, then it may be quicker to do the analysis by hand than go through the rigmarole of installing and testing a variety of technological solutions. Where a large number of documents are available, the quality and format of the text may influence the decision as to whether to implement automatic information extraction methods. Informal documents such as e-mail and web pages are not generally suitable, whereas if documents are of good quality and incorporate controlled vocabularies, for example in technical standards, NLP has been found to be less error-prone than a manually generated ontology and may provide a ‘baseline ontology’ that may then be edited by domain experts (Omoronyia et al., 2010). ere are now a wide range of NER tools available, from simple online services to full-blown frameworks, from free and open source options to paid-for and proprietary services. DBpedia Spotlight (https://github.com/dbpedia-spotlight/dbpedia-spotlight/ wiki) and GATE (https://gate.ac.uk/download) are discussed in more detail below. DBpedia Spotlight provides a web application (http://dbpedia-spotlight.github.io/ demo), a web service and installable code, which enable the annotation of content mentioning DBpedia resources as well as wider NER. cURL (http://curl.haxx.se) is a widely used tool for transferring data across the internet, and the example below shows how it may be used for sending requests to the DBpedia Spotlight web service via the command line:

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 112

112

pRACTICAL OnTOLOgIES

curl “http://spotlight.dbpedia.org/rest/annotate”—data-urlencode

“text=Samuel Johnson was born at Lichfield, in Staffordshire, on the

18th of September, N.S. 1709; and his initiation into the Christian

church was not delayed; for his baptism is recorded, in the register

of St. Mary’s parish in that city, to have been performed on the day

of his birth.”

e example above is sent to the /annotate web service, which combines spotting and disambiguation: /spot, the spotting web service, takes a text and identifies entities to annotate; /disambiguate, the disambiguation web service, takes spotted text and chooses an identifier for the entity. e service requires either a text parameter including the text to be annotated, or a URL for the text to be annotated. In this case, a short piece of text from Boswell’s Life of Samuel Johnson. e default spotting algorithm identifies 16 concepts and entities: Samuel Johnson, born, Lichfield, Staffordshire, 18th of September, initiation, Christian church, baptism, recorded, register, St. Mary’s, parish, city, performed, day, birth. DBpedia Spotlight incorporates a number of different algorithms, as well as recognizing entities that have been marked up by other spotting soware, and these may be simply used by adding the spotter parameter to the command (e.g., adding —data “spotter=LingPipeSpotter” to the end of the command above).Table 5.2 shows the spotted concepts and entities according to five of the different spotter algorithms integrated into DBpedia Spotlight. ey can be seen to differ significantly in terms of both the concepts and entities that are identified, and the number of concepts and entities that are identified. e default output type is text/xml, although the application/xhtml+xml type comes with embedded RDF in the form of RDFa, which may then be extracted. For each resource spotted and disambiguated, information is returned on the form of the concept/entity in the original text (surfaceForm=), its position in the text (offset=), the associated DBpedia identifier (URI=), the type of resource (types=), and information on why a particular match was made (support=, similarityScore=, and percentageOfSecondRank=).e default text/xml response for the ‘Samuel Johnson’ entity is provided below:

Table 5.2 Different entities and concepts identified with different spotter algorithms KeyphraseSpotter LingPipeSpotter

AtLeastOne NounSelector

CoOccurrence BasedSelector

NESpotter

Samuel Johnson

Samuel Johnson

Samuel Johnson

Samuel Samuel Johnson

Samuel Johnson

Johnson was born Johnson born

born

born at Lichfield Lichfield

Lichfield

Lichfield

Lichfield

Staffordshire

Staffordshire

Staffordshire

Staffordshire

18th of September

18th of September

18th of September

18th of September

N.S.

N.

N.

N.

parish

parish

parish

initiation

initiation

initiation

Christian church

Christian church

Christian church

baptism

baptism

baptism

18th September N.S.

Christian Christian church church delayed baptism

baptism is recorded recorded register of St register

register

register

city

city

city

day

day

day

birth

birth

birth

birth

Mary’s

St. Mary’s

St. Mary’s

St. Mary’s

performed

performed

city

recorded

Restrictions may be made in the parameters to limit matches according to specific types, prominent DBpedia entities, or a high similarity level. In this instance the default parameters match the ‘born’ concept with the German physicist Max Born and the

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 114

114

pRACTICAL OnTOLOgIES

‘day’ concept with the American activist Dorothy Day, suggesting they require some adjustment. e/candidates web service is similar to /annotate but provides a ranked list of candidates rather than selecting one. GATE (General Architecture for Text Engineering) provides a far more extensive set of tools than DBpedia Spotlight, designed as it is for a wider range of text analysis than just NER. It consists of: GATE Developer, an application for information extraction; GATE Teamware, an online collaboration environment; and GATE Embedded, a library for building more extensive applications. As the biggest open source language processing project it has had a large number of plugins developed. Included in the distribution are plugins for dealing with content in different formats and languages, for machine learning, and a variety of information extraction algorithms. ere are also ontology specific plugins such as Ontology_Tools, which provides an ontology editor, and an OwlExporter for GATE has also been developed (www.semanticsoware.info/owlexporter). e more extensive functionality, however, comes with a steeper learning curve, and its suitability will depend heavily on the size of the ontology that is being created and the skills and resources available for NLP. As with the results from DBpedia Spotlight, it is important to recognize that the resulting set of annotations is not the same as an ontology. For example, GATE exports annotations in two formats, GATE XML and Inline XML. As the name suggests, inline XML integrates annotation code into the text, whereas GATE XML puts nodes in the text and then the elements refer to the nodes. e snippet below shows inline XML annotation of the person and location identified in the first six words of Boswell’s biography, “SAMUEL JOHNSON was born at Lichfield”, using GATE’s ANNIE (A Nearly New Information Extraction system):

SAMUEL JOHNSON was born at Lichfield

However, oen much of the information extracted might be considered superfluous to an ontology, which may only require the important concepts, entities and the relationships between them. ere are a wide range of data models that could be used for annotating a document, and by default ANNIE identifies organizations, people, locations and dates. A new annotation schema can be quite simply added, although a large number of additional steps are necessary if the schema is to be applied automatically. e knowledge produced by an NLP needs to be transformed into an appropriate format. Although this same information could be integrated into triples, for example,

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 115

BUILDIng OnTOLOgIES

115

using the Open Annotation Data Model (www.openannotation.org/spec/core), there is nonetheless still a lot of data manipulation necessary and the coining of unique URIs. is may be achieved, either through the creation of an appropriate XSLT (Extensible Stylesheet Language Transformations) document that transforms the NLP into an appropriate format, or, in the case of GATE, through the use of an additional plugin. Other NLP web services include OpenCalais (http://new.opencalais.com) and Alchemy API (www.alchemyapi.com), which may be considered alternatives to DBpedia Spotlight, and NLTK (Natural Language Toolkit) (www.nltk.org), a platform for NLP in the Python programming language that may be an alternative to GATE. Someone encoding the same passage as RDF triples, according to the Europeana Data Model, may encode the following triples: :person/samueljohnson

:person/samueljohnson

:person/samueljohnson

:person/samueljohnson

:place/lichfield

:place/lichfield

rdf:type

foaf:name

rdaGr2:gender

rdaGr2:placeOfBirth

rdf:type

skos:note

edm:Agent

“Samuel Johnson”

“Male”

:place/lichfield

edm:Place

“Is a city”

However this encoding also relies on information that is not explicit within the text, for example, the fact that Samuel Johnson is a male and Lichfield is a city, and relies on external knowledge. NLP will undoubtedly have an increasingly important role in the future development of ontologies and knowledge discovery, and the topic is returned to later. In most instances, however, the information scientist will want to build the structure of an ontology by hand, selecting parts of a wide range of formal and informal documents, and a lot of manual work is still required in the marking up of documents, and cleaning the identified entities, particularly where a highly specialized language is being used.

5. Identify important terms e process of knowledge acquisition is likely to elicit a large number of overlapping concepts and entities, of varying degrees of importance, with a wide variety of competing terms used to represent these concepts and entities. Which concepts and entities are important will depend heavily on the scope of the ontology (defined in step 1), and failure to have properly documented the scope of an ontology, or to adhere to the agreed scope, is likely to quickly lead to an uneven or unwieldy ontology.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 116

116

pRACTICAL OnTOLOgIES

Again, there are likely to be significant differences between those ontologies developed for the humanities and those for the sciences. Arp, Smith and Spear (2015) suggest that ontologies should only include those things for which there is evidence that instances exist; beyond the sciences, however, such a stipulation would probably be deemed too restrictive. Ontologies are, by their nature, simplifications of the world, and it is important that an ontology should be no more complex than necessary. But equally, there are concerns about over-simplification and reduction to the lowest common denominator. What is too complex, and what is too simple, necessarily depend on context. On the one hand an element set such as Dublin Core might be considered overly simplistic for a rich bibliographic database; on the other, this same simplicity allows for its widespread adoption across the web. Equally, whereas Getty’s Art & Architecture esaurus is a rich thesaurus for describing multiple facets of art and architecture, there may be little value in even applying such a vocabulary to a small private collection, let alone creating such an extensive vocabulary. Ideally an ontology may allow for different levels of granularity. Once the concepts and entities have been identified, we nonetheless have to identify appropriate terms to represent those concepts and entities. As a broad rule of thumb, the RDA principle for representing resources offers a working principle: ‘take what you see’ (Oliver, 2010). If the particular set of documents forming the basis of knowledge discovery all mention myocardial infarction then use the term myocardial infarction rather than heart attack. e former British Prime Minister should be referred to as Tony Blair rather than Anthony Charles Lynton Blair; it’s the United Kingdom rather than the United Kingdom of Great Britain and Northern Ireland; and water rather than dihydrogen monoxide. Unlike the RDA environment, where the terms are generally being transcribed from a physical item, in ontology design there are oen multiple competing terms, each of which may have a claim for inclusion. In such situations it is useful to consult existing controlled vocabularies. For example, you may not wish to incorporate all of the controlled vocabulary of the Rare Books and Manuscripts Section of the Association of College and Research Libraries, but the fact that they prefer Account books to Ledgers (http://rbms.info/vocabularies/ genre/tr24.htm) may sway the selection of Account books as the preferred term. In addition to professionally created controlled vocabularies for specific topics, Wikipedia should also not be overlooked in the identification of useful terms, whilst its policy on the selection of appropriate article titles also provides useful guidance for the selection of appropriate terms in an ontology that is designed for public consumption. e selection of an appropriate Wikipedia article title is based around five characteristics (Wikipedia, 2015): recognizability, the title people are familiar with; naturalness, the title people are likely to search for; precision, the title should be

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 117

BUILDIng OnTOLOgIES

117

unambiguous; conciseness, the title should be no longer than necessary; and consistency, similar titles should follow similar patterns. Term selection includes identifying how to represent entities and concepts. Consistency in naming patterns has been the focus of a lot of attention within the library community, especially in analogue systems, where a lack of consistency meant that works would either not be found or at least take longer to find. Guidelines specifically designed for the formatting of terms in controlled vocabularies include: • ANSI/NISO Z39.19-2005, Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies, www.niso.org/kst/reports/standards • ISO 25964-1:2011, Information and documentation – esauri and interoperability with other vocabularies – Part 1: esauri for information retrieval, available to buy at www.iso.org/iso/catalogue_detail.htm?csnumber=53657. As has already been discussed in Chapter 2, in the creation of ontologies for the semantic web, the concept and entities are not represented by labels alone, but also by URIs. e information professional needs to decide whether such URIs should be descriptive or opaque, and there are arguments on both sides.

6. Identify additional terms, attributes and relationships An ontology is not merely a set of terms in a controlled vocabulary: these terms (or rather the concepts and entities they represent) must be organized according to a useful data model with rich relationships between the concepts and entities and associated attributes. For example, an archive associated with a famous historical figure may wish to build a rich ontology associated with that figure. Knowledge acquisition will have identified a large set of concepts and entities, including concepts, agents, places and events, and these will have been whittled down to the most important. As the concepts and entities are entered into the ontology editor the gaps in the knowledge base begin to appear. Although the entities have names and labels there may be a distinct lack of other attributes: places may have no geo co-ordinates or demographic information; people may lack occupations, interests or dates of birth; and events may lack information on when they took place or details on what was involved. Most important to an ontology are the rich set of relationships between the concepts and entities, and the relationship gaps also start to emerge and more knowledge may need to be acquired. e fact that ontology building is an iterative process cannot be repeated enough!

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 118

118

pRACTICAL OnTOLOgIES

7. Specify definitions Controlled vocabularies are designed to reduce ambiguity and to be applied consistently, and so definitions are an important part of adding value to an ontology; ‘a natural language description of the concepts is essential to ensure their interpretation by humans’ (Rocha da Silva et al., 2014, 600). Arp, Smith and Spear (2015) suggest that definitions are possibly the most important part of ontologies, as they help to ensure consistent use. e need for disambiguation in term names for concepts and entities has already been mentioned, but whilst entities may naturally have numerous attributes that help with disambiguation this is less likely to be the case with concepts. For example, an ontology that includes a number of people with the same name may use dates of birth and death to distinguish between them (e.g., ‘John Smith (1938–1994)’, ‘John Smith (1923–2007)’), but even if the user of the ontology doesn’t know the dates associated with the person they are interested in, a rich ontology will have other information that they can make use of to disambiguate (e.g., occupation, affiliation, place of residence). In comparison, concepts are likely to have few associated attributes (in an ontological sense), which may cause people difficulty in distinguishing between them and applying them consistently. Take, for example, the concepts of the seven deadly sins (wrath, greed, sloth, pride, lust, envy and gluttony): in many people’s minds there is some overlap between the terms, especially greed and gluttony. For such situations SKOS, the ontology element set used for the structuring of many of the existing knowledge organization systems (discussed in more detail in Chapter 2) includes seven elements specifically for documentary purposes. is includes skos:note, a general documentary property, with skos:scopeNote, skos:definition, skos: example, skos:historyNote, skos:editorialNote and skos:changeNote as sub-properties. Scope notes are typically used for one of three purposes: to establish the boundaries of a term, disambiguate between terms or advise on term usage (Hedden, 2010). skos:definition provides space for a more complete definition of a concept’s meaning. skos:example allows for the inclusion of examples. skos:historyNote enables notes to be kept on significant changes in the meaning or form of concepts, whereas skos:changeNote allows for the inclusion of less significant changes. Finally skos:editorialNote is designed for internal housekeeping purposes, for example, noting concepts that need to be revised. For ontology element sets additional information needs to be made explicit so that the ontology can be applied consistently. As well as defining the properties and classes themselves, most oen using rdfs:label and rdfs:comment, it may also necessary to be explicit about the domain, range and cardinality of different properties, as well as the relationships between properties and the rules of inference (typically with RDFS and OWL 2 elements). e necessary formality of such definitions will depend on the

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 119

BUILDIng OnTOLOgIES

119

nature of an ontology; for example, the circularity of the FOAF definition of the Document class (‘e Document class represents those things which are, broadly conceived, “documents”’) has been criticized (Arp, Smith and Spear, 2015), but such a definition may be considered highly appropriate for the informality of the web and an ontology that is designed to be as inclusive as possible. Equally, Horridge (2011) advises against the general application of domains and ranges for properties, and there are an increasing number of examples of ontologies that don’t include domain and range restrictions to encourage their reuse, but they may nonetheless be useful where the primary purpose is not for an ontology to become widely adopted.

8. Integration with existing ontologies e idea behind linked open data is that there is additional value to be had from integrating data sets and ontologies into a wider web of data. Expressing such links between any two related classes, properties or instances is relatively simple, with a number of widely adopted properties: •

owl:equivalentClass: two classes in different element sets are semantically

equivalent

• •

rdfs:subClassOf: one class is a subclass of another owl:equivalentProperty: two properties in different element sets are

semantically equivalent

rdfs:subPropertyOf: one property is the subproperty of another • owl:sameAs: two different URIs represent the same concept or entity • skos:broader/skos:narrower: one concept is broader or narrower than •

another concept



dct:hasPart/dct:isPartOf: one entity is part of another entity.

However, when a data set consists of thousands or even millions of concepts, entities and instances, such relationships cannot be simply expressed by hand (unless an infrastructure is available for a crowdsourced approach). ere are, however, a number of different tools and algorithms available for ontology alignment, and since 2004 the Ontology Alignment Evaluation Initiative (OAEI) (http://oaei.ontologymatching.org) has provided a forum for testing different approaches. e suitability of any particular tool or algorithm will depend heavily on the data that is being aligned and should be tested with sample data, but as with so much of the semantic web, there is oen a trade-off between functionality and usability, and alignment soware is oen associated with short-term projects. In their analysis of ontology matching systems, including those examples that had

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 120

120

pRACTICAL OnTOLOgIES

participated repeatedly in OAEI, Shvaiko and Euzenat (2013) distinguish between four types of data that the systems can make use of: strings (terminological), structure (structural), instances (extensional), and models (semantics). Terminological approaches make use of the descriptions of the ontology element set, and structural approaches make use of the relationship of these elements with other elements. Extensional approaches use the contents of the instances adhering to the element set, and systems that make use of semantic data in the alignment of ontologies incorporate logical reasoning and inferencing. e importance of the different types of data will depend on the ontologies being matched; if the two ontologies that are being matched primarily consist of concepts with labels and little hierarchical structure, then the similarity of the instance data is likely to be of greater interest than string, structure, or model data. Amongst the seven systems considered by Shvaiko and Euzenat (2013), terminological and structural approaches were universally available, and these are considered in more detail below. ere are a number of techniques for terminological alignment. Sun, Ma and Wang (2015) distinguish between language-based similarity measures and string-based similarity measures, with string-based similarity measures being further sub-divided into those that use character-based matching and those that are token-based. Character-based similarity measures simply compare the characters and the order in which they appear. Notable examples include the Levenshtein distance (also known as the edit distance) which is based on the number of edits required to transform one string into the other, and the I-SUB technique, which considers the commonalities as well as the differences between two strings and was specifically designed for string matching in ontologies (Stoilos, Stamou and Kollias, 2005). Token-based similarity measures first splits the texts into meaningful tokens (typically words) and then makes a comparison between the two sets of tokens. Notable examples include the Jaccard similarity co-efficient, which is calculated by dividing all the tokens shared by the two sets by the total number of tokens in the two sets, and TF-IDF (term frequency-inverse document frequency), which is designed to weight terms according to how oen they appear in the ontology. Languagebased similarity measures do not reduce texts to strings of characters, or sets of tokens, but rather provide a similarity based on the order and meaning of the words. Although all language-based similarity measures require some linguistic knowledge, Euzenat and Shvaiko (2013) distinguish between language-based similarity measures that are algorithm only and those that require external sources. e intrinsic, algorithm approach to language-based similarity revolves around the normalization of words through techniques such as stemming and lemmatization so that different forms of terms can be grouped together. Language-based similarity measures that require external sources typically make use of dictionaries and thesauri. e most widely used of external resources is undoubtedly WordNet (http://wordnet.princeton.edu), although other more innovative

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 121

BUILDIng OnTOLOgIES

121

external resources have also been used in ontology matching, for example, Bing and FAROO search data have both been used in the WeSeE-Match tool (Paulheim and Hertling, 2013). Structural methods, as the names suggests, take into consideration the structure of entities rather than the associated labels. Euzenat and Shvaiko (2013) distinguish between internal structural methods, which only consider the relationships internal to a particular entity, and external structural methods, which consider the relationships between one entity and other entities. Internal structural methods are primarily of use as a preprocessing step (Euzenat and Shvaiko, 2013); numerous entities will have similar structures and many properties will have the same domain and range, but as a preliminary step it can facilitate more focused terminological processing. ere are a wide range of ontology matching and alignment soware packages available, although many of them are part of increasingly complex matching systems with additional dependent soware and services, and as such require varying levels of expertise. For example, some require the installation of a MySQL database, some take the form of Eclipse plugins, and others require signing up for access to APIs for external sources. Extensive lists of ontology matching soware exist both online (e.g., www.mkbergman.com/1769/50-ontology-mapping-and-alignment-tools) and in the literature (e.g., Euzenat and Shvaiko, 2013), and new soware is regularly being exhibited at OAEI (http://oaei.ontologymatching.org). Selection of appropriate soware will require consideration of the soware functionality and performance, as well as the technical competence of the ontology matcher. Examples of simple alignment soware that may be simply installed on a desktop without additional components (with the exception of Java) include OnAGUI and AgreementMakerLight. OnAGUI (Ontology Alignment Graphical User Interface (http://sourceforge.net/ projects/onagui) is included here for its simplicity. It is limited to aligning ontologies, or more precisely aligning ontology instances, through a graphical user interface with terminological methods. Two ontologies, adhering to an OWL or SKOS format, are uploaded and the similarity between the terms is calculated according to either I-SUB or Levenshtein distance algorithms, or whether they are an exact match. e suggested alignments are shown, and may be edited, and the results can be exported as RDF, CSV or SKOS. AgreementMakerLight (http://somer.fc.ul.pt/aml.php) is a far more sophisticated application with greater functionality than OnAGUI, and has been particularly successful in recent alignment evaluations at OAEI. Based on the earlier AgreementMaker (https://github.com/agreementmaker/agreementmaker), Agreement MakerLight has been optimized for aligning very large ontologies without user interaction (Faria et al., 2013). It is available as both an Eclipse project and an executable

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 122

122

pRACTICAL OnTOLOgIES

file that can run with a GUI or via the command line. It combines both terminological alignment and structural alignment, incorporating a variety of string matching algorithms and external knowledge sources and even allows for the translation of terms with Microso’s Bing Translator aer an access token has been acquired. Suggested alignments may be viewed as a graph, edited and saved as RDF. It may also be worth considering, especially for the less technically minded, other data manipulation soware not specifically designed for ontology alignment. For example, van Hooland and Verborgh (2014) use OpenRefine (http://openrefine.org), the data transformation and cleaning soware, to reconcile data with approximate string matching. When aligning ontologies it may be important that information about the alignment is also represented. What were the methods used in the alignment? What were the confidence levels used? Unsurprisingly, an ontology has been created, the Mediation Bridge Ontology, for representing such information (Khan et al., 2015).

9. Implementation As has already been mentioned many times, ontology development is an iterative process, and implementation is about drawing the many preceding threads together into a coherent whole suitable for meeting the original objectives of the ontology. By the implementation stage a first-version element set will have been defined, knowledge acquired and alignments with other ontologies identified, but this information may still be scattered in multiple formats in multiple folders on multiple computers. e disparate parts may still need to be integrated and decisions made about the final representation of URIs, but finally the ontologies will be ready to be implemented into the system they were designed for. What this means in practice will depend heavily on the nature of the ontology. Implementation in the context of an ontology element set designed for general use on the semantic web will differ considerably from implementation of a proprietary enterprise ontology of concepts for document retrieval in a commercial organization. For an ontology element set for general use on the semantic web, implementation may consist of simply publishing the ontology and applying it to a small personal data set. Implementation of an enterprise ontology may involve the passing of the ontology to developers who will integrate it into enterprise systems for testing. On the one hand implementation should be the simplest stage; aer all, it is primarily simply fitting together the pieces of a jigsaw. On the other hand, by their very nature ontologies are not neat and tidy with a distinct cut-off point; there is always more that can be done, and knowing whether an ontology meets its original objectives can only be ascertained through extensive evaluation.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 123

BUILDIng OnTOLOgIES

123

10. Evaluation Ontology evaluation has an important role in the development of ontologies; the same domain may be modelled in different ways in different ontologies, and being able to determine which ontology best fits the particular criteria is an important part of the adoption of ontologies. Evaluation is an ongoing process throughout ontology development, but evaluation of the final ontology can only really commence once the disparate parts have been brought together and implemented. ere are a number of parts of an ontology that may be evaluated, and different criteria by which they may be judged. Vrandečić (2009) identifies six aspects of an ontology that may be evaluated: vocabulary; syntax; structure; semantics; representation; and context. Vocabulary refers to the labels and URIs that are used to represent concepts and entities. Syntax refers to the way the ontologies are serialized, for example, whether in RDF/XML or Manchester Syntax. Structure refers to the connections between the different parts of an ontology: for example, whether the ontology consists of one or more graphs, whether it forms a hierarchy, and how deep it is. Semantics refers to the underlying models that the structure is describing. Representation relates to how well the semantics are represented by the structure. Finally, context considers an ontology from the perspective of its particular situation: for example, the data it represents, alternative representations, and requirements of the ontology. Brank, Grobelnik and Mladenić (2005) similarly identify six aspects (or ‘levels’ in their terminology) that may be evaluated: lexical, vocabulary, or data layer; hierarchy or taxonomy; other semantic relations; context or application level; syntactic level; structure, architecture, design. ere are both different approaches to the evaluation of these aspects, and different criteria that may be used in the evaluation. Vrandečić (2009) identifies eight criteria by which an ontology may be evaluated: accuracy is how well an ontology reflects real world expertise; adaptability is whether an ontology is extensible and suitable for all its potential uses; clarity is whether an ontology is comprehensible to users; completeness is whether the domain is covered sufficiently and at a high enough level of granularity; computational efficiency is whether the ontology can be quickly and easily subjected to reasoning services; conciseness refers to whether an ontology excludes irrelevant axioms and instances; consistency/coherence is about whether an ontology is both understandable and logically consistent; organization fitness refers to the suitability within the particular organization and systems in which it is to be deployed. e applicability of any particular criteria will depend on the purpose of the ontology, with certain criteria potentially being mutually exclusive. For example, completeness may hinder conciseness and computational efficiency, whilst organizational fitness may preclude adaptability. e criteria by which an ontology is to be evaluated will inevitably influence the

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 124

124

pRACTICAL OnTOLOgIES

approach that is to be taken. Brank, Grobelnik and Mladenić (2005) consider four broad approaches to ontology evaluation: comparisons of an ontology to a gold standard; applying the ontology and analysing the results; comparing the ontology to another source of data; and human assessment as to how well an ontology meets a set of predefined criteria. For the information professional, focused on bridging the divide between users and the information resources they require, human assessment is likely to have a particularly important role to play. How well does it meet users’ needs? Have potential users adopted it? Is it sufficiently rich to facilitate the retrieval of associated documents?

11. Documentation Documentation is an essential part of the use and reuse of ontology. It is not merely about describing the end product, but also describing the policies that have been implemented along the way, and the policies that are necessary for its future development. It is about publishing the ontologies in a user-friendly and accessible format, so that they can be widely and consistently adopted rather than having ontologies for the same domain being reinvented time and again. is section considers two types of ontology documentation: documentation embedded within the ontology; and the publishing of accompanying documentation. Ontologies are designed for encoding knowledge, and knowledge that some may consider documentation may also be found in the ontology itself. Term definitions and scope notes of how terms should be applied are widely incorporated within ontologies, and such documentation is a key part of evaluating the clarity of an ontology. e ontology itself is also the core of the documentation, and decisions about how it is published have implications for how understandable the ontology is. e simplest way to publish an ontology is to make a text file containing the ontology available in a widely adopted serialization such as RDF/XML available on a website. is, however, is easiest for the ontology publisher rather than a prospective adopter of an ontology, who may become quickly overwhelmed by the coding of even the smallest ontology. If the ontology is designed for widespread adoption (for example, as part of the semantic web) or extension then it is important that userfriendly and innovative approaches are taken. Although the publishing of many ontologies continues to revolve around the publishing of a static file, and plain HTML files generated from those serializations, there are more interactive examples: N8EO (N8 Equipment Ontology), an ontology describing equipment shared by the N8 consortium of universities, has a facet browser that enables browsing by equipment, function, technique or location (http://n8eo-browser.appspot.com); whilst the European Bioinformatics Institute offers an Ontology Lookup Service

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 125

BUILDIng OnTOLOgIES

125

(www.ebi.ac.uk/ontology-lookup) that provides a single query interface to multiple biomedical ontologies and allows users to browse through the hierarchies of terms in ontologies and visually display the hierarchy associated with particular terms. e provision of a visual interactive element in the publishing of an ontology is now possible for anyone with tools such as the open source WebVOWL (http://vowl. visualdataweb.org/webvowl), providing web visualizations of ontologies using VOWL (Visual Notation for OWL Ontologies) – see Figure 5.1 – although visualizations that try to include everything can quickly become confusing with large ontologies.

Figure 5.1 WebVOWL visualization of FOAF

LODE (Live Owl Documentation Environment) (www.essepuntato.it/lode) also promises a simple way to generate simple documentation from an OWL document. Extracting classes, properties and annotations from an OWL document and providing the results in a human-readable HTML format. Not all documentation will necessarily be encoded in the ontology itself, nor should it be. e need for clarity must be balanced with the need for conciseness, and some knowledge is more suited to the narrative form of accompanying documentation than triples: for example, primer documents providing an introduction to an ontology, and mapping guidelines providing details on how an existing ontology may be mapped to a new ontology element set. It is important to recognize that there is no simple one-size-fits-all solution to documentation, but appropriate documentation is essential if an ontology is to be used widely and as intended.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 126

126

pRACTICAL OnTOLOgIES

12. Maintenance and sustainability We live in a time of rapidly expanding knowledge, and it is important that an ontology is kept up to date: new terms are introduced, old terms are updated or deleted, and the changes that have been made are documented. One of the criticisms of ontologies is that they are hard to maintain and ‘will always be one step behind the current state of domain knowledge’ (Brewster and O’Hara, 2007, 565), and it is therefore important to consider sustainability of an ontology from the start. Chowdhury (2014) identifies three main dimensions of sustainability: economic sustainability, social sustainability and environmental sustainability. ese provide a useful framework for considering the long-term sustainability of an ontology. e information profession seems to be in perpetual financial straits, and economic sustainability must be a high priority in the provision of any new service. e creation of an ontology is an expensive and time-consuming process, and the ontology will continue to require resources for as long as it is to be maintained and updated: server space needs to be paid for, content management systems need to be kept up to date, and people revising the ontology will need to be paid. e size of these costs will differ significantly, depending on the nature of the ontology: the cost of sustaining a proprietary enterprise ontology with hundreds of thousands of concepts on a private server will differ considerably to the cost of sustaining a small ontology element set for wider adoption on the semantic web. Nonetheless, economic sustainability for any ontology requires a sufficient demonstration of the value of the ontology to those who are expected to contribute to it, whether they are people with the resources to finance an ontology’s continued maintenance or those who may volunteer to contribute to the necessary workload. Social sustainability refers to the ‘maintenance and improvement of well-being of current and future generations’ (Chowdhury, 2014, 18). It is not sufficient for an ontology to be on a sound financial footing; it is also necessary that it continues to do the job it is designed to, as well as, if not better than, when it is first deployed. Do the ontology instances continue to provide sufficient coverage of the domain for people to be able to index and find the resources they want? Does the element set reflect current understanding of a particular domain? Do updates to an ontology involve a high work burden on users of the ontology? Does the ontology form the basis of a dynamic discussion about knowledge and knowledge representation in the field, or does it quietly gather dust until its more obvious deficiencies are recognized? Finally, environmental sustainability is an increasingly important consideration for every individual and organization. Although this may be the least important of Chowdhury’s (2014) three dimensions of sustainability – aer all, most ontologies are unlikely to have a significant environment impact – it should nonetheless be a consideration. Even if the finances are available for monthly meetings on ontology

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 127

BUILDIng OnTOLOgIES

127

development with participants from around the world, and the participants feel their well-being is enhanced from such meetings, from the perspective of environmental sustainability such meetings would obviously have a negative impact. ere is an understandable difficulty in achieving both economic and social sustainability: economic sustainability can be facilitated by reducing the maintenance costs, whilst the social sustainability is achieved by increasing the number of hours spent working on keeping the ontology up to date. In some situations it may be possible to overcome the seeming impasse through technology and crowdsourcing. For example, Murdock, Buckner and Allen (2012) have tried to automate the maintenance of InPhO (Indiana Philosophy Ontology) as much as possible using data mining and online expert feedback. However, where there is no continuing support for an ontology within an institution (e.g., where it was tied to a short-term research project) then it may be necessary to hand over responsibility to an external organization, or even the community as a whole. Placing an ontology on a platform such as GitHub, as the Library Holdings Ontology (http://dini-ag-kim.github.io/holding-ontology/holding.html) has been, with a sufficiently open licence for reuse and the development of derivative products, enables a community of users to develop an ontology as they wish.

Ontology development example: Bibliometric Metrics Ontology element set e rest of this chapter details the creation of a bibliometric metrics ontology element set, designed to illustrate the ontology development process. e example is not designed to produce a fully polished ontology (although the end product has been published online), nor is it designed to provide highly detailed descriptions of how to use the tools that are utilized; rather, the focus is on the decision process that may occur at each stage. Bibliometrics was originally defined as ‘the application of mathematics and statistical methods to books and other methods of communication’(Pritchard, 1969, 349), and in recent years interest has exploded both within the information science community and amongst social scientists and public officials interested in science policy. Interest has been driven by both a wave of new forms of communication, and an interest in directing limited resources to those individuals, institutions or research areas with the most potential for return on investment. ere is an increasingly large number of metrics and indicators, however, that have formed the basis of an increasing variety of research studies, and assimilation of the ever-expanding knowledge is increasingly difficult. Modelling bibliometric metrics and indicators would provide a way for current knowledge to be represented, gaps to be identified, and associated studies to be found.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 128

128

pRACTICAL OnTOLOgIES

1. Ontology scope It is important in the creation of any ontology to start with a clear idea of what the scope of the ontology is. Bibliometrics can be defined in many ways, from narrow definitions focusing on the references in traditional forms of communication to inclusive definitions, including statistical methods applied to a wide range of communications. Bibliometric metrics and indicators will also be related to a variety of informal and formal publications, which in turn have associated authors at various institutions. e metrics and indicators will be based on different units of analysis, and studies will have applied them to various data sets, and compared them with various non-bibliometric metrics and indicators reaching various conclusions. e extent to which each of these aspects will be modelled by an ontology will depend heavily on the purpose of the ontology. In this case, the primary purpose is keeping track of the metrics and indicators, rather than detailed results of all the papers applying the metrics. Nonetheless the ontology should be able to represent both metrics and publications associated with those metrics. Figure 5.2 provides a first dra of the Bibliometric Metrics Ontology based on the original ontology scope. ere are two classes within the ontology :Metric and :Publication. Although no data properties have yet been defined, some provisional object properties have. e model recognizes that publications can be related to metrics in multiple ways; a publication may mention, define or use a particular metric. Publications also cite one another, and metrics oen extend work that has come before.

:mentions :defines :uses :extends

:Metric

:publication

:cites

Figure 5.2 First draft of the Bibliometric Metrics Ontology, with two classes and provisional relationships

2. Ontology reuse ere are a wide range of existing ontologies that may be reused in the development of an ontology. Choices regarding ontologies for representing an ontology (e.g., RDF, RDFS, OWL 2) or an upper ontology (e.g., BFO) will depend heavily on how widely the ontologies have been adopted in a particular community. e Bibliometrics Metric

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 129

BUILDIng OnTOLOgIES

129

Ontology is not designed to be part of a proprietary system, but rather to facilitate the public sharing of information about bibliometric metrics, and as such would benefit from being represented using the vocabularies of the semantic web where necessary: RDF, RDFS, OWL. Upper ontologies have had limited uptake on the semantic web, seemingly due to the added level of complexity they introduce to the understanding and adoption of an ontology. Although schema.org provides a simpler solution to embedding an ontology into the wider web of data, there would be limited value in extending schema:Intangible or schema:StructuredValue with a Bibliometric Metric Ontology which would at the same time restrict the structuring of the ontology. erefore this section focuses on the existence of suitable elements for the proposed :Publication and :Metric classes. Publication class ere are already a large number of existing vocabularies that may be used to represent publications (e.g., DC, RDA, Bibliographic Ontology), and without any specialist requirements any of these might be suitable. In this case Dublin Core Terms (DC Terms) has been selected to provide elements for the :Publication class, as it is both simple and designed to incorporate a wide range of publication types. DC Terms is less suitable, however, for expressing relationships between publications and metrics. Although there is a dct:references property (alongside the inverse dct:isReferencedBy), this may not be considered to provide a sufficiently refined level of granularity. More specialist ontologies are available, however, most noticeably CiTO (Citation Typing Ontology) (http://purl.org/spar/cito). As well as the general citation property cito:cites, CiTO also includes a large number of citation sub-properties, including cito:extends, cito:uses MethodIn, cito:describes, and cito:discusses. Importantly, domain and range restrictions were removed from version 2.0 of CiTO, allowing the properties to be used in other contexts as well as traditional bibliographic citations, making it potentially suitable for the relationships between publications and metrics, and between metrics.

Metric class ere are not as many well known ontology element sets to describe a metric as there are to describe publications. Nevertheless, a search for ‘metric’ in the Linked Open Vocabularies (http://lov.okfn.org) ontology library reveals a metrics class in CERIF (Common European Research Information Format). CERIF, however, is a complex model for describing a wide range of entities associated with the research process, and

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 130

130

pRACTICAL OnTOLOgIES

is an XML schema rather than an RDF ontology. Version 1.3 of CERIF was published in an RDF format (https://code.google.com/p/cerif-linked-data), but a metrics class was not included in the encoding. Nonetheless it is important to consider the CERIF format, as it is beginning to be used by some in the bibliometrics community. Snowball Metrics, an initiative to establish robust bottom-up metrics for institutional metrics, currently publishes its metrics in CERIF to encourage the sharing of metrics data (www.snowballmetrics.com/eurocris-cerif-xml-for-snowball-metrics). Most of the CERIF properties are for sharing metric data rather than describing metrics. e indicators, and the measures that comprise them, are primarily described by unique IDs, names, descriptions and keywords using elements from CERIF’s own namespace, whilst the CERIF approach of having separate entities for any potential shared attribute means that the indicator and measurements section consists of thirteen different types of entity. A metrics ontology designed primarily at the bibliometrics community is able to simplify the complexity of the model without loss of information, thus making it accessible to a wider range of users. e distinction between indicators and measurements is an important one, however. Within the bibliometrics community the need to distinguish between metrics and indicators is recognized (e.g., HEFCE, 2015), as there is oen a leap between what is measured and what the measure is being used to indicate. e term ‘indicator’ may be more appropriate for both the ontology and class name, and is used throughout the rest of this example. Figure 5.3 shows an updated version of the dra ontology, incorporating those classes and properties that are being reused from elsewhere. Only two of the properties for a bibliographic resource have been incorporated in Figure 5.3, due to space restrictions.

cito:describes cito:discusses cito:usesMethodIn cito:extends

dct:BibliographicResource

:Indicator

dct:title

Literal

cito:cites

dct:creator

Literal

Figure 5.3 Second draft of the renamed Bibliometric Indicators Ontology

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 131

BUILDIng OnTOLOgIES

131

3. Identify appropriate software for ontology development As has already been detailed, there is a wide range of ontology development soware available, although as there are no special requirements for the Bibliometic Indicators Ontology (it is neither particularly large nor requires multiple collaborators), the most widely adopted ontology editor, Protégé, is sufficient. Figure 5.4 provides a screenshot of the basic installed Protégé instance, with the four default tabs: Active Ontology; Entities; Individuals by class; and DL Query. ‘Individuals by class’ is fairly self-explanatory, relating to the constituent parts of an ontology in Protégé. e ‘Entities’ tab provides mixed overview of the content of the ontology, and ‘Active Ontology’ is for the provision of information about the ontology. ‘DL Query’ provides an interface for searching the ontology.

Figure 5.4 Screenshot of Protégé 5.0 with the Entities tab selected

Figure 5.4 shows Protégé with the entities tab selected. e ontology’s IRI has been changed on the ‘Active Ontology’ tab, and the classes and properties identified in step 2 added. e prefix for DC terms (dcterms:) is widely used and is automatically recognized by Protégé to produce the relevant IRI and label. CiTO is not such a widely used ontology and it is therefore necessary to add an appropriate prefix (cito) and a base URI (http://purl.org/spar/cito/), as well as a prefix for the Bibliometrics Indicators Ontology (bino) and its base URI (http://www.davidstuart.co.uk/ontologies/bino#). e view has been changed to ‘Render by prefixed name’ clearly displaying the associated sources for each of the elements.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 132

132

pRACTICAL OnTOLOgIES

Only some of the properties are displayed in Figure 5.4, as a distinction between data properties and object properties is made and only the object properties are shown as the ‘Object property hierarchy’ tab has been selected within the ‘Entities’ tab. e distinction is to do with the target of the properties: data properties are properties that have literals as the target; object properties are properties that have other resources as the target. e title and creator of the bibliographic resource are data properties.

4. Knowledge acquisition e process of knowledge acquisition began with the identification of existing ontologies and vocabularies for reuse, but the most important source of information on bibliometric indicators is undoubtedly the research papers defining them, although identification of relevant research papers may require an interview with one or more domain experts, who will be able to highlight either specific indicators, research papers, or associated keywords. Taking an inclusive definition of bibliometrics that includes non-traditional forms of communication (e.g., web pages and social media) means there are hundreds of indicators that need to be accommodated within the ontology. For example, in a review of the characteristics of author-level bibliometric indicators, Wildgaard, Schneider and Larsen (2014) analysed 108 indicators, and this did not include altmetric and webometric indicators, whilst there are also equally numerous indicators at a journal, article, and institutional level.

5. Identify important terms Analysis of a small sample of indicators quickly identifies some of their most important properties: indicators have names or labels; indicators are designed for the analysis of different object types; indicators are calculated according to using different variables (e.g., the number of citations) and with some constraints (e.g., a two-year period of time).

6. Identify additional terms, attributes and relationships

:name, :objectOfAnalysis, :hasVariable and :hasConstraint may be considered the minimum associated properties. Additional properties could include alternative names for the indicator, a description of the indicator, the type of indicator (e.g., webometric, altmetric, bibliometric), and whether the indicators are associated with, or dependent on, a particular data source (e.g., number of retweets is associated with the Twitter social network site). In Figure 5.5:objectOfAnalysis, :hasVariable, and :hasConstraint have

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 133

BUILDIng OnTOLOgIES

133

all been designated data properties. In fact, they may be better represented as data objects enabling the encoding of relationships between the concepts, and consistency in the terms used.

cito:describes cito:discusses cito:usesMethodIn cito:extends

:Indicator

:name :objectOfAnalysis :hasVariable :hasConstraint

dct:BibliographicResource

Literal Literal Literal Literal

Figure 5.5 Properties associated with the Bibliometric Indicators Ontology

7. Specify definitions If the ontology element set it is to be widely and consistently adopted it is important that there are clear definitions of the properties and how they should be applied. For example, explicitly stating the distinction between a variable and a constraint: •



:hasVariable associates an :Indicator with an input that is used in the

calculation of that indicator and is expected to vary for different objects of analysis. Cardinality: min 0, max unbounded :hasConstraint associates an :Indicator with a limitation that is placed on an:Indicator. For example, the Journal Impact Factor :hasConstraint time, as only citations from a two-year period are counted. Cardinality: min 0, max unbounded.

8. Integration with existing ontologies For the Bibliometric Indicators Ontology there is no need for specialist alignment soware, or exploring the appropriateness of a particular text-matching algorithms. e ontology is of a sufficiently small size for integration decisions to be made on human judgement alone.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 134

134

pRACTICAL OnTOLOgIES

e proposed :name element may be replaced by skos:prefLabel, with other SKOS elements suitable for the alternative names, descriptions, concepts and the relationships between concepts (see Figure 5.6 – again only two of the properties for a bibliographic resource have been incorporated due to space restrictions). Relationships might also be encoded to the elements in the CITO schema. cito:describes cito:discusses cito:usesMethodIn

cito:extends

dct:BibliographicResource

:Indicator skos:prefLabel skos:altLabel skos:note

cito:cites

Literal Literal

dct:title

Literal

dct:creator

Literal

Literal skos:broader skos:narrower skos:related

skos:Concept dct:type :objectOfAnalysis :hasVariable :hasConstraint

skos:prefLabel skos:altLabel skos:note

Literal Literal Literal

Figure 5.6 Bibliometric Indicators Ontology (BInO) – v. 0.1

9–11. Implementation, evaluation and documentation Implementation of the Bibliometric Indicators Ontology consists of publishing the ontology element set, so that it may be more widely adopted, and applying it to the subset of indicators that the ontology creators wish to encode. e Bibliometrics Indicators Ontology element set has been published online using GitHub (https://github.com/dpstuart/bino), alongside the subset of indicators that have already been encoded. ese indicators also provide concrete examples of the application of the ontology element sets. Much of the narrative of the documentation has been encoded within the OWL file, and HTML files can be generated automatically from the RDF/XML using LODE (http://www.essepuntato.it/lode/http://www. davidstuart.co.uk/ontologies/bino.owl).

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 135

BUILDIng OnTOLOgIES

135

12. Maintenance and sustainability Although the Bibliometric Indicators Ontology element set undoubtedly has value beyond this book, it has primarily been created as an example for the book, and the author is not planning to develop the ontology in the future. By publication of the ontology element set and accompanying instances on GitHub, and with an appropriate copyright licence, the ontology can nonetheless be developed and reused by others.

Conclusion Information professionals are far more likely to use an ontology or knowledge base than develop one, but it is nonetheless important that they are aware of some of the decisions that will have been made in the development process. Allemang and Hendler (2011, 312) identify four common errors when modelling for the semantic web: rampant classism; creeping conceptualization; exclusivity; and objectification. e first two are both easy to understand and easy to rectify. Rampant classism refers to the potential habit of referring to everything as a class, even when it would be better modelled as an instance or a property. Creeping conceptualization refers to continuing to extend an ontology beyond what is actually required; just because there are additional concepts that may be included in an ontology, it doesn’t mean that they should be included. The last two should be less of a problem for the library and information professional focusing more on the development of informal ontologies and application profiles for linked data, than the more formal ontologies of those involved with large-scale knowledge bases and artificial intelligence. Nevertheless, they are important to consider for the information professional interested in the development of more formal ontologies. Exclusivity refers to the mistaken belief that because all known members of a subclass belong to a superclass, that all future members will also belong to the superclass. For example, if AcademicInstitution is defined as a subclass of Library, then other academic institutions may be mistakenly inferred to be libraries even if they are not. Objectification refers to the mistaken belief that a class is a template for creating instances, when in fact an instance may be a member of multiple classes. Although highly intricate ontologies may enable more inferences to be drawn from a data set, the emergence of linked data also shows that there can be a lot of value in the publication of data with relatively informal relationships between concepts. Most library and information professionals are likely to wish to start with the development of small informal ontologies, with only a small proportion progressing to the development of extensive formal ontologies. e Bibliometric Indicators Ontology example is an overly simplistic example, demonstrating the process rather than the

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 136

136

pRACTICAL OnTOLOgIES

final object, yet it should nevertheless have demonstrated how attainable such ontologies are for the information professional. Even if information professionals don’t develop ontologies, they will still play an important role in the development of an increasingly semantic web.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 137

CHAPTER 6

Interrogating ontologies

Introduction Ontologies are for use, and generally such use extends beyond that of the creator of the ontology. It is therefore important to consider the different ways ontologies can be explored, or rather, interrogated. ‘Interrogate’ has associations of more close and formal investigations of an ontology, which is indeed what ontologies are designed for. Ontologies will be queried by different users in different ways, and there are a wide range of tools of varying complexity for interacting with different ontologies, but ontologies are formal representations of knowledge and to extract the full value from an ontology requires queries of equal formality. Ontologies, and the associated instance data, are generally interrogated for one of three reasons: to determine whether an ontology is suitable for reuse; to extract information from an ontology/data set; and to gather information about an ontology’s use. is chapter considers each of these reasons in turn, and discusses some of the most appropriate technologies for each: • interrogating ontologies for reuse — ontology editors and viewers — ontology reasoners • interrogating a knowledge base — search engines and personal assistants — SPARQL • understanding an ontology’s use — ontology search engines — semantic crawlers. Of course the tools are not restricted to one particular use; there is inevitably some overlap between the tools and they are discussed as appropriate.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 138

138

pRACTICAL OnTOLOgIES

Interrogating ontologies for reuse One of the main reasons for wanting to understand an ontology is so that it can be reused. is may involve reusing an ontology element set to structure your own information, or reusing a vocabulary for the indexing of a set of documents.

Ontology editors and viewers A simple way to investigate an ontology is to read the documentation and explore the ontology in one of the tools mentioned in Chapter 5 for developing the ontologies (e.g., TopBraid, Protégé) or publishing the ontologies (e.g., WebVOWL, SKOS Play). LodLive (http://en.lodlive.it) also provides a simple web-based visual interface for exploring an ontology. A simple ontology may require little more than a text editor, although most encoded ontology would benefit from being viewed through an editor or viewer, especially complex ontologies that make the use of unions, intersections, and functional properties. ere are limits, however, as to what can be clearly understood about an ontology through viewing an ontology in an editor or viewers. Even if an ontology can be clearly displayed, and semantic web plugins for graph visualization soware (e.g., SemanticWebImport for Gephi https://marketplace.gephi.org/plugin/semanticwebimport) enable many thousands of nodes to be viewed at once, the logical implications of the nature of relationships are not necessarily clear.

Ontology reasoners All ontologies are formal, according to the definition at the start of this book, but some are more formal than others. Ontologies fall on a continuum from the informal to the extremely formal; from those with few constraints to those with strict rules. Until now this book has focused primarily on the development of quite informal ontologies. It is worth considering, however, more formal ontologies, and the importance of logical consistency when deciding whether or not to reuse an ontology, or determining the robustness of an ontology that has been created. Reasoners are designed to make inferences based on explicit rules. Sometimes a distinction is made between reasoners and inference engines, with a reasoner including a wider range of techniques for the identification of new statements based on existing knowledge. Here, however, the terms are treated as synonymous. Reasoners are oen included within a triplestore’s query engine, enabling the querying of inferred knowledge, but they can also be used to report on inconsistencies in an ontology. Take, for example, a familial property :hasParent, potentially used for encoding the relationship between a child and its parents. Such a relationship is an asymmetric

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 139

InTERROgATIng OnTOLOgIES

139

property, i.e., if person A has parent person B, then person B can never have the parent person A. Of course there is nothing to stop one or more people encoding such information, and as has already been mentioned on the semantic web ‘anyone can make statements about any resource’ (W3C, 2004), but if they do the information will be logically inconsistent. e following triples may be added to an ontology in Protégé: :hasProperty

:John

:Daniel

rdf:type

:hasParent

:hasParent

owl:AsymmetricProperty ,

:Daniel :John

But running an OWL reasoner such as FaCT++ or HermiT on such an ontology will produce a warning that ‘Your ontology is inconsistent . . .’. ere are a wide range of reasoners, supporting different ontology and rule languages, providing different interfaces on different platforms, and being made available under different licences, and there have been a number of comparisons of the reasoners (e.g., Singh and Karwayun, 2010; Dalwadi, Nagar and Makwana, 2012). In most cases, however, as long as the reasoner is OWL 2-compliant, the choice of reasoner for the information professional will primarily be based on which reasoner is integrated or compatible with technologies that are already being used, whether that is an ontology editor such as Protégé or an RDF framework such as Jena. Where an ontology is highly formal, and there are potentially significant consequences from an inconsistent ontology, it may, however, be prudent to test an ontology with multiple reasoners and explore any differences in the results, as discrepancies between reasoners have been found (Lee et al., 2015).

Interrogating a knowledge base Search engines and personal assistants e semantic web is not something separate from the web that people use every day, but rather is an increasingly integrated part of it. Millions of people now interact with semantic content every day through search engines and automated assistants (e.g., Cortana and Siri) without even considering it; in fact the company that created Siri was co-founded by the same Tom Gruber whose definition of ontology is so widely adopted: ‘an explicit specification of a conceptualization’ (Gruber, 1993, 199). Such applications are designed for the general user, with simple query or natural language interfaces. is inevitably limits the complexity of the query that can be created. Although a simple question such as ‘Which is the highest lake in Europe?’ may be successfully answered, more complex questions such as ‘Which is the highest lake in Europe that has a size greater than that of Windermere?’ is more likely to receive an erroneous answer, if any at all. Such questions require a more formal query language.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 140

140

pRACTICAL OnTOLOgIES

SPARQL SPARQL queries typically represent small graphs of triples, the parts of which may be either known or unknown, which are then compared with a larger graph or graphs to identify the unknown elements. Creating SPARQL queries is more complicated than entering a few keywords into Google, and relies on an understanding of the associated ontology element set, although this understanding may be built from more general queries. e development of SPARQL queries is a topic that could be the subject of a book in its own right, and indeed has been (e.g., Ducharme, 2013), and there isn’t room for a complete introduction to the topic here. An introduction is therefore provided by example, leading the reader from the simplest query through increasingly complicated examples. e examples make use of three public ontologies, each with SPARQL query forms so that readers may enter and adapt the queries if they wish. ere are many additional sources of information on SPARQL, from books and online tutorials, to the SPARQL standard itself: • Allemang, D. and Hendler, J. (2011) Semantic Web for the Working Ontologist, 2nd edn, Morgan Kaufman • Ducharme, B. (2013) Learning SPARQL, 2nd edn, O’Reilly Media • SPARQL 1.1 Query Language (2013) www.w3.org/TR/sparql11-query. Allemang and Hendler (2011) provide an accessible and fairly thorough introduction to SPARQL sufficient for most users as part of a wider introduction to ontologies and the semantic web, whilst Ducharme (2013) provides the sort of thoroughness that can be expected from 350 pages devoted to a query language. SPARQL 1.1 introduced a wide range of additional functionality, including aggregate functions, subqueries and negation, and so it is important to ensure that resources relate to the appropriate version.

A universal SPARQL example e simplest useful SPARQL example is one that retrieves all the triples from a triplestore, and is suitable for any SPARQL endpoint: SELECT * WHERE {

?s ?p ?o. }

e triple pattern syntax of SPARQL (i.e., the section between the WHERE curly brackets) is similar to the TURTLE serialization of RDF, although one of the main

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 141

InTERROgATIng OnTOLOgIES

141

differences is that unlike in TURTLE, variables may be substituted for each part of the triple, whether subject, predicate or object. is triple pattern is matched against a sub-graph of the RDF data, and the values that match the variables may be returned. In this case three variables are used, and the query will match every triple in the graph that is being queried. e SELECT part of the query identifies which of the variables to return. SELECT * selects all of the associated variables, and the same results could have been achieved in the above query with: SELECT ?s ?p ?o. It may seem as though the SELECT part of the query is unnecessary; however, variables may be used that a user does not wish to collect information about. For example, someone exploring the properties of a data set may only wish to retrieve the list of distinct predicates used within it. e DISTINCT modifier following SELECT indicates that only differing results should be shown. Oen triplestores have limits on the number of results that can be downloaded for any particular query, generally to protect processing power or perceived abuse of a data set. A user of a data set may also wish to limit the number of results that are returned, to speed up the time the query takes or for sampling, because they only wish to collect a certain number of results. is may be achieved by adding a LIMIT clause stating the maximum number of results returned: SELECT * WHERE {

?s ?p ?o. }

LIMIT 10

Simple SPARQL matches e first SPARQL query provided above didn’t require any knowledge of the ontology or knowledge base. A slightly more complicated query might make use of knowledge of the ontology element set. e following query makes use of knowledge of the British Library Data Model (www.bl.uk/bibliographic/pdfs/bldatamodelbook.pdf) to list those resources written by the author of this book: PREFIXdct: SELECT ?book WHERE {

?book dct:creator

. }

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 142

142

pRACTICAL OnTOLOgIES

PREFIX is used so that namespaces can be abbreviated within the triple pattern syntax. In this case the search is for the creator property from the Dublin Core terms, which the British Library uses to express the relationship between books and their authors. The property doesn’t link the name(s) of the books and the name(s) of the author, but rather URIs representing the books with URIs representing the authors. In this case the URI representing the author has been taken from the British Library site. The results from entering the above query in the British National Bibliography SPARQL editor (http://bnb.data.bl.uk/flint-sparql) are two URIs for the resources: http://bnb.data.bl.uk/id/resource/015855235

http://bnb.data.bl.uk/id/resource/016295302

Retrieving the names of the books requires the addition of another triple pattern to the query: PREFIX dct: SELECT ?title WHERE {

?book dct:creator

;

}

dct:title ?title.

e query now identifies books that have both the author URI and a title, and returns the titles: Facilitating access to the web of data : a guide for librarians Web metrics for library and information professionals

Within the Turtle serialization the use of the semi-colon at the end of a statement means that the same subject is applied to the following statement. So in the above query the ?book variable is implied in the second statement (i.e., ?book dct:title ?title.). e use of a comma means that both the subject and the predicate are repeated in the next statement. For example, the query below returns the URIs for books in the British National Bibliography that have both the subject Library Science and Information Science:

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 143

InTERROgATIng OnTOLOgIES

143

PREFIX dct: SELECT ?book WHERE {

?book dct:subject ,.

Matching literals in a SPARQL query In the previous SPARQL queries URIs have been used for the predicates and the author of the book. In most instances, however, the URI will not be known, or the value may be a literal rather than a URI. SPARQL includes a number of functions for querying literals, whether string, numeric or date, and in conjunction with the FILTER keyword enables the restriction of the set of results returned to a particular criteria. Of interest to many potential searchers will be whether a text contains a particular term or phrase, and there are two functions in SPARQL for this: CONTAINS and REGEX. As the name would suggest, the CONTAINS function tests whether one string is contained within another. e example query below has been constructed for the Europeana data set, metadata on 43 million objects collected by Europeana. Adhering to the Europeana Data Model (http://pro.europeana.eu/page/edm-documentation), this query may be entered in Europeana’s SPARQL form (http://europeana. ontotext.com/sparql): PREFIX dc:

SELECT ?item ?title WHERE {

?item dc:title ?title .

FILTER (CONTAINS (?title, “library”)) }

LIMIT 10

e above query returns the first ten items that are found with ‘library’ somewhere in the title. Although the Dublin Core prefix has been included in the query, in this case it is unnecessary as the Europeana SPARQL endpoint has predefined namespaces for those which are part of the Europeana Data Model. e inclusion of limits can be important where there are a large number of items in a triplestore, as text matching can be computationally intensive. It is also important to recognize that the query is case-sensitive (i.e., the above query matches ‘library’ and not ‘Library’). A more powerful (and accordingly more computationally intensive) matching

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 144

144

pRACTICAL OnTOLOgIES

function is REGEX, which enables the matching of a text against a regular expression pattern. is means that rather than matching a specific string of characters it is possible to match against a particular pattern, for example, rather than searching for a specific e-mail address it is possible to search for any e-mail address by defining the pattern e-mail addresses follow. REGEX is now supported in many programming languages, and there are a large number of tutorials and introductions to the subject (e.g., www.zytrax.com/tech/web/regex.htm). e SPARQL REGEX function also has an optional parameter (‘i’) to make a search case-insensitive. e following query filters the Europeana results according to whether or not there is a singular ‘library’ or plural ‘libraries’ in the title, insensitive to case. It would not match the term ‘librarian’: SELECT ?item ?title WHERE {

?item dc:title ?title .

FILTER (REGEX (?title, “librar(y|ies)”, “i”)) }

LIMIT 10

REGEX expressions can quickly become unwieldy, although there are simple online

tools for testing that an expression is making the expected matches (e.g., www.regexpal.com).

OPTIONAL and UNION matches Data is not always structured in the manner that a searcher requires, or in a uniform manner. is is especially true where the data has been crowdsourced, as is the case with DBpedia. DBpedia is derived from the structured data within Wikipedia, which is the accumulation of millions of people making over 800 million page edits. Unsurprisingly, Wikipedia pages for the same type of entity differ considerably in terms of the extent of entries and the terms used. It would be more surprising, in fact, if we found the page describing the President of the United States was only as detailed as that of a second-rate academic at a provincial university barely passing Wikipedia’s test of notability, or that every notable person’s occupation adhered to ISCO (the International Standard Classification of Occupations). It is therefore useful to be able to combine queries, and to include information only where it is available. e example below retrieves a set of information professionals represented in DBpedia and can be entered in the DBpedia SPARQL query form (http:// dbpedia.org/sparql). ere are many different types of information professional and

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 145

InTERROgATIng OnTOLOgIES

145

they may be referred to using a seemingly endless variety of different titles, including librarian, information scientist, cataloguer and bibliometrician. Using ISCO each of these titles could be subsumed under ‘Librarians and related information professionals’, but Wikipedia editors are not restricted to such classifications. e example below combines those people with the dbo:occupation property linking to dbr:Librarian with those linking to dbr:Information_Scientist: PREFIX dbo:

PREFIX dbr: SELECT ?person WHERE {

{?person dbo:occupation dbr:Information_Scientist} UNION

{?person dbo:occupation dbr:Librarian} }

e importance of union queries is illustrated by some of the notable information professionals who have not been identified by the above query. For example, Anthony Panizzi, the influential head of the British Museum in the nineteenth century, doesn’t have a dbo:occupation attribute, but rather he has a dc:description ‘British librarian’. Eugene Garfield, one of the founders of bibliometrics, doesn’t have dbo:occupation dbr:Information_Scientist but rather dbo:occupation dbr:Scientist. Creating an exhaustive list of information professionals in DBpedia would require a far more extensive query. ere is a wide variety of additional information about the set of librarians that may also be retrieved from DBpedia. OPTIONAL pattern matching means the information will be retrieved if it is available, but it will not lead to the exclusion of a resource if it is not available. For example, the query below will retrieve employer information about those with the dbo:occupation dbr:Librarian, but the existence of an employer is not a condition for a person to be matched: PREFIX dbo:

PREFIX dbr: SELECT ?person ?employer

WHERE {

?person dbo:occupation dbr:Librarian .

OPTIONAL {?person dbo:employer?employer . }

}

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 146

146

pRACTICAL OnTOLOgIES

CONSTRUCT, DESCRIBE, and ASK All of the above SPARQL examples have been based around a SELECT query. e other SPARQL query form the information professional is most likely to make use of is CONSTRUCT. is allows the creation of an RDF graph based on the results of a query on an old graph. is can be particularly useful if you wish to reuse part of a data set, or create new triples based on information in an existing graph. For example, the following query uses DBpedia’s dbo:hasParent property to return an RDF file where there is a grandparent relationship between people in DBpedia: PREFIX dbo:

PREFIX dbr: PREFIX fo:

CONSTRUCT {?person

WHERE {

fo:hasGrandparent ?grandparent}

?person dbo:parent?parent . ?parent dbo:parent ?grandparent . }

ere are also two other types of query that the information professional will use less oen: ASK and DESCRIBE. ASK allows for testing whether a query pattern has a solution or not, rather than providing the solution. DESCRIBE returns a single RDF graph as a response to a query.

COUNT and GROUP COUNT is a SPARQL 1.1 function that counts the number of values. For instance, in DBpedia the employer of people is oen listed, and the query below counts the number of times an organization is listed as someone’s dbo:employer and orders the results in descending order: PREFIX dbo:

PREFIX dbr: SELECT ?org (COUNT (?org) AS ?count) WHERE {

?person dbo:employer ?org .

}

ORDER BY DESC(COUNT(?org))

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 147

InTERROgATIng OnTOLOgIES

147

e most mentioned employer is the BBC, which is listed as an employer of 158 people. e validity of such answers relies on the knowledge that is encoded, and the consistency with which it is encoded. Other organizations have undoubtedly employed more than 158 people passing the criteria of Wikipedia notability, but whose relationships are not encoded as dbo:employed. Take, for example, the football club Manchester United. It has employed a large number of footballers over the years, although only two people have dbo:employerdbr:Manchester_United_F.C.. e COUNT function can be used to explore the other predicates that are used to link people to the football club: PREFIX dbo:

PREFIX dbr:

SELECT ?predicate (COUNT (?predicate) AS ?count) WHERE {

?person rdf:type dbo:Person .

?person ?predicate . }

ORDER BY DESC(COUNT(?predicate))

Rather than people having the football team as their employer, it can be seen that most football players actually have it as a dbo:team, for which it is the object of 2036 triples. Aggregate functions such as COUNT, as well as AVG and SUM, are oen used in conjunction with the keyword GROUP BY. As the name suggests, GROUP BY groups results together according to a particular criteria. For example, in the query below, the number of people with the predicate dbo:team isn’t counted for a football team, but rather all people and then grouped by team: PREFIX dbo:

PREFIX dbr: SELECT ?team (COUNT (?person) AS ?count) WHERE {

?person rdf:type dbo:Person .

?person dbo:team ?team .

}

GROUP BY ?team

ORDER BY DESC(COUNT(?person))

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 148

148

pRACTICAL OnTOLOgIES

Additional functionality This part of the chapter has provided a very brief introduction to SPARQL. The query language includes many additional functions and keywords that have not been mentioned, or only mentioned in passing, and more will undoubtedly emerge in the future. Different triplestores have developed their own extensions to SPARQL, whilst some triplestores also offer the ability for users to extend the functionality themselves. For example, the author of this book has previously built extensions for the SPARQL processor ARQ that measure the similarity between two strings (rather than necessitating they match exactly) and for applying named entity recognition. Those extensions that are successful may be adopted in future formal specifications. SPARQL is an increasingly powerful query language, and the library and information professional interested in the ontologies and the semantic web will benefit from exploring both its query language and the update functions introduced in version 1.1, which provides a standard way of deleting and adding triples to an existing graph. e power of SPARQL comes at a price, however. SPARQL is computationally intensive and SPARQL endpoints oen suffer from either being slow or down completely. In an ideal world it would be possible to query the whole of the web of data, but in reality we can only hope to query small parts of it with the full functionality of SPARQL in the near future, with far more limited functionality for large scale parts.

Understanding ontology use Understanding an ontology’s use helps not only with making a decision as to whether to adopt an ontology or not, but also for larger scientometric questions about the ontologies that are growing in use and the domains that are encoding information.

Ontology libraries and search engines Ontology libraries and search engines generally have quite limited ontology metrics, restricted to those intrinsic to the ontologies themselves (e.g., number of classes or properties) or the number of ontologies indexed. Some, however, provide additional useful metrics. For example, Linked Open Vocabularies provides information on the number of incoming and outgoing links, to provide insights into how the different ontologies link to one another. An ontology has an incoming link if another ontology is making use of one of its terms, whereas it has an outgoing link if it is making use of a term from another ontology. Widely used generalist vocabularies such as RDF, Dublin Core and SKOS unsurprisingly have a large number of incoming links, whilst the majority of vocabularies generate few or no incoming

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 149

InTERROgATIng OnTOLOgIES

149

Figure 6.1 Number of reusing vocabularies in rank order

links. This is shown in Figure 6.1, which shows the number of vocabularies reusing each of the vocabularies in rank order. This is based on the results of the following SPARQL query: PREFIX voaf: SELECT DISTINCT ?vocab ?o WHERE {

}

GRAPH {

?vocab voaf:reusedByVocabularies ?o.

e number of ‘reusing vocabularies’ has a clear power law distribution, with only the first 21 ranked vocabularies being reused by more than ten other vocabularies. Although the reuse of one ontology by another is undoubtedly useful information, it is only one example of reuse. Elements are also reused in large data sets, or may be used to encode information on millions of sites across the web. Linked Open Vocabularies provides information about the number of datasets a vocabulary appears in (voaf:reusedByDatasets), and the number of times a term occurs in data sets (voaf:occurrencesInDatasets), taking the information from LODstats (http://stats.lod2.eu). LODstats contains information on data sets in the DataHub (https://datahub.io),

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 150

150

pRACTICAL OnTOLOgIES

but this is just a small section of the semantic web. ere are many differences between the top classes and properties as viewed through LODstats and as seen on the semantic web. is can be seen in Bizer et al.’s (2013) finding that Facebook’s OGP accounted for six of the most frequently used classes, but it is not identified as being used by any linked open data sets on LOV. ere are also differences with vocab.cc (http://vocab.cc), which provides a similar service to LODstats, enabling people to search for and find statistics on linked data vocabularies. e difference, however, is that vocab.cc is based on information from the Billion Triple Challenge Dataset from the semantic web (Stadtmüller, Harth and Grobelnik, 2013). Services such as LODstats and vocab.cc nonetheless provide limited functionality. It is possible to see the top-ranking properties on the particular data set they have indexed, but further analysis is more limited: understanding the use of ontologies within a particular community, or the sites that are using a particular ontology, is best realized through the use of a semantic web crawl or crawler.

Semantic web crawls and crawlers Scientometrics refers to studies into ‘all quantitative aspects of the science of science, communication in science, and science policy’ (Hood and Wilson, 2001, 293), and the semantic web potentially provides a particularly useful resource for scientometric investigations. Widespread adoption of particular terms or ontologies may provide insights into the growth of scientific domains or ideas in different organizations or geographic regions, whilst the adoption of the ontologies themselves may provide evidence of impact for research evaluation purposes. Such studies, however, are likely to require a finer level of detail than is provided by simple web services. For example, for scientometric purposes it is probably not sufficient to know that a particular ontology has been reused x number of times within a particular set of triples; rather, it is necessary to know the pages where the ontologies are reused, so that use may be aggregated in different ways and investigations of how an ontology is used can be carried out. It should also be possible to investigate which ontologies are used together, to gain further insights into the relationships between different domains, and be able to specify which resources are included in an analysis (e.g., analysing use of ontologies in the UK’s academic domain or on a particular website) with a clear understanding of how the data has been gathered. Such detailed information can either be gathered through the use of a semantic web crawler, or by making use of a publicly available crawl data file.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 151

InTERROgATIng OnTOLOgIES

151

Web Data Commons – RDFa, Microdata and Microformat Data Set Web Data Commons (http://webdatacommons.org) at the University of Mannheim provides access to semantic content that is embedded within the billions of web pages that form the Common Crawl data set, a crawl of the web that has been made freely available online (http://commoncrawl.org). e Common Crawl data set and the Web Data Commons data set have significantly reduced the resources necessary for a researcher to investigate the nature of structured data embedded in web pages, and it has formed the basis of a number of studies (e.g., Bizer et al., 2013; Luczak-Rosch and Tolksdorf, 2013; Wylot, Cudré-Mauroux and Groth, 2015). Web Data Commons provides access to the embedded semantic content in the NQuads format. is is like the N-Triples format, but additionally with the URI of the web page where the data was found. In total the December 2014 data set consisted of 20 billion such quads. ese quads are available in compressed text files listed according to format: RDFa, Microdata, and separate lists for each of the microformats and schema.org classes. As would be expected, the amount of data available for each format and class varies considerably. For example, whereas there are over 287 million entities listed as having the type http://schema.org/Product class, there are only seven having the type http://schema.org/SeaBodyOfWater. e total size of the 2014 Web Data Commons crawl is, even when compressed, 64 terabytes. Nevertheless, although it may require significant computational resources, 20 billion triples may nonetheless be put into an appropriate triplestore. In fact, over a trillion triples have been placed and queried in a triplestore (Oracle, 2014), and even TDB, a triplestore that can be used to store RDF on a single desktop, can store 1.7 billion triples. Even people with relatively modest computing power and programming knowledge may download the files, extract those elements they are interested in, and if it amounts to less than 1.7 billion triples analyse it through a SPARQL endpoint on their desktop. By way of illustration, the http://schema.org/Book dataset from the 2014 Web Data Commons crawl was ingested into a triplestore for further analysis. At 68,817,749 quads from 1982 web hosts, it is a far larger sample than could have been simply captured by a researcher on their own, but it is nonetheless small enough to be ingested into the TDB triplestore without requiring the extraction of a smaller sample. e http://schema.org/Book data set includes both instances of the schema:Book class as well as other data found on the same web pages. Table 6.1 shows the 15 most popular properties within the data set. Although it is, unsurprisingly, dominated by properties associated with the book class, the commercial nature of much of this data is expressed through the large number of properties associated with the schema:Offer class. Of course, Web Data Commons will not necessarily have indexed all the data that

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 152

152

pRACTICAL OnTOLOgIES

Table 6.1 The most common properties associated with schema:Book Predicate www.w3.org/1999/02/22-rdf-syntax-ns#type http://schema.org/Book/name http://schema.org/Book/author http://schema.org/Book/isbn http://schema.org/Book/url http://schema.org/Book/image http://schema.org/Book/offers http://schema.org/Book/description http://schema.org/Offer/price http://schema.org/Offer/priceCurrency http://schema.org/Offer/availability http://schema.org/Offer/itemCondition http://schema.org/Offer/url http://schema.org/Book/publisher http://schema.org/Book/contributor http://schema.org/Book/reviews

Count 12437304 7292118 5472972 4578597 3443898 3224493 2642211 2526446 2438747 2130707 1910342 1765114 1574005 1343083 1284528 1204839

a person may wish to analyse, in which case it may be necessary for researchers to capture the data they need with a web crawler themselves.

Semantic web crawlers A web crawler is a computer program that is designed to automatically download a set of documents from the web. Given a list of ‘seed’ URIs the web crawler will download each of the associated documents in turn, extract any embedded URIs from the documents, and add them to a list of documents to download, in an iterative manner. Web crawlers are oen used by both researchers and search engines. Search engines use web crawlers to gather the information they need to index the web, whilst researchers may download a particular portion of the web for archiving or analysis. Such crawlers do not produce a comprehensive index of the web, however. ey can be limited in three ways: a web crawler may have technical limits; a crawler owner may impose restrictions on a crawl; and there are topological limitations caused by the web itself. eoretically the technical limitations of a crawler are less of a problem when crawling the semantic web than it may have been in the past for the web of documents, as it is primarily designed to be read and understood by machines. Early web crawlers

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 153

InTERROgATIng OnTOLOgIES

153

struggled with reading and extracting links from non-HTML documents on the web, such as Flash websites and PDF pages, whereas semantic content on the web is published on web pages that are designed to be machine-readable. Nevertheless, semantic standards change over time, and a crawler may not be able to extract all the various formats within which semantic content may be created. A crawler owner may also impose restrictions on a web crawler; limiting a crawler to a certain number of requests or depth of crawl on a particular site or domain. is may be to achieve a broad crawl as quickly possible, to respect the terms and conditions of a website or to prevent a crawler getting stuck in an infinite loop. Web Data Commons extracts structured data from the Common Crawl, which respects the robots exclusion protocol (allowing a website owner to exclude their site from being indexed) and limits the number of pages that can be downloaded from a website to a maximum of two per second. e topology of the web means that a page on the web can only be indexed if its URI is known, or it is linked to by a page that is known. is means that ontologies and linked data published on an internationally well known and popular site such as www.bbc.co.uk is far more likely to be indexed than a personal FOAF page hosted on someone’s personal website. It also means that extremely large search engines, such as Google, may include a lot of RDF pages that may not be identified otherwise, and therefore may provide a useful starting point for RDF data collection (e.g., Ding et al., 2004). ere have been a number of studies where researchers have collected semantic data for themselves. Ashraf and Hussain (2013) and Ashraf, Hussain and Hussain (2013) crawled part of the semantic web to analyse the usage of different ontologies, compiling a list of seed URIs from the semantic search engines Swoogle (http://swoogle.umbc.edu) and (the now defunct) Sindice. Schmachtenberg, Bizer and Paulheim (2014) used a crawl to explore linked data practices in different domains, and Kämpgen, O’Riain and Harth (2015) used a crawler to collect data for testing a new interface to linked data. Most of the studies that have used a publicly available crawler (as opposed to developing one’s own) have used LD Spider (https://github.com/ldspider/ldspider). LDSpider is designed for both standalone RDF pages as well as embedded RDF by supporting the use of Any23 (http://any23.apache.org) to extract data from web pages. It offers both breadth-first and load-balancing crawling strategies, and comes in the form of a compact jar file that can be used from the command line (Isele et al., 2010). Increasingly, however, there is a focus on the development of more focused crawlers, crawling relevant documents rather than everything. Anthelion (https://github. com/yahoo/anthelion) is a crawler designed for being more efficient in identifying data-rich HTML pages (Meusel, Mika and Blanco, 2014), and CRAWLER-LD is a crawler that searches for triplesets that include terms related to an initial set of terms

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 154

154

pRACTICAL OnTOLOgIES

(Gomes et al., 2014). Other crawlers are designed to deal with a specific problem, for example, OPAC (Ontology Property-based Adaptive Crawler for Linked Data) is designed for dealing with the problem of recrawling data to ensure an index is up-todate by considering frequency of changes at both the document and entity level (An et al., 2013).

Conclusion ere are a wide range of tools available for exploring ontologies, and the semantic web more widely, although many of them continue to be aimed at computer scientists or professional ontologists rather than the casual user. e appropriateness of different tools will depend heavily on the size of the ontology, its complexity and the associated documentation, as well as the number of classes and properties. Where a simple ontology of one or two classes, and a dozen or so associated properties, has been published as an explicit OWL document, then the easy way to explore it may be within a text editor (e.g., Notepad for Windows, or TextEdit for Apple). For more complicated ontologies, however, a text editor might provide little help, especially if the structure has not been explicitly stated anywhere, and visualization functionality will be important for understanding the different relationships. As is so oen the case, it is about identifying the right tool for the job, and unfortunately that only comes with time. As a rule of thumb, however, don’t do anything more complicated than necessary.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 155

CHAPTER 7

The future of ontologies and the information professional

Introduction In summarizing the law of the instrument Kaplan (1964, 28) wrote ‘Give a small boy a hammer, and he will find that everything he encounters needs pounding’ and such a law applies equally to the application of ontologies. By this stage of the book, if it has achieved its aim of demonstrating the value of ontologies, readers should see the potential of developing and applying ontologies in an ever-increasing number of situations in their working life and possibly even their personal life. is will, undoubtedly, include scenarios where other technologies may be more appropriate, or that the formality of the required ontology requires a skill level beyond that of the information professional. is final chapter tries to reign in the sprawling potential of ontologies by considering the future of ontologies within the context of the original aims of this work, considering: • the future of ontologies for knowledge discovery • the future contribution of information professionals to the development of ontologies • the practical development of ontologies. e adoption (or non-adoption) of any technology is not based on the intrinsic qualities of the technology alone, but is reflective of a wide variety of factors in the associated ecosystem, and the widespread adoption and development of ontologies, and the nature of these ontologies, will depend on competing technologies and culture as well as limitations of the ontologies themselves. Inevitably such a chapter raises more questions than it provides answers.

The future of ontologies for knowledge discovery Within the information ecosystem ontologies must compete for territory with other

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 156

156

pRACTICAL OnTOLOgIES

alternative technologies and approaches, and the innate value of a technology does not necessarily lead to its domination. For example, the value of controlled vocabularies is well known amongst the community of information professionals, but controlled vocabularies are by no means the only or even the principal way that people retrieve information. For the majority of information enquiries, that is those typed by the average user into the search box of Google.com, the power of full-text indexing surpasses the value of controlled vocabularies. In much the same way, the power of formal structured ontologies and knowledge bases sit alongside the increasingly powerful natural language processing (NLP) tools that develop by brute force what could not be achieved through human efforts alone. As with a controlled vocabulary for information retrieval, the future of knowledge discovery is not likely to be one or the other, but both, and on occasion a combination of the two. Whilst NLP and machine learning can have an important role in populating a knowledge base, there is nonetheless added value in the reduction in ambiguity that may be achieved by having a human (preferably working in conjunction with the producer of a data set) explicitly encoding knowledge or information. In comparison to the painstaking process of constructing an ontology associated with a particular domain, NLP seems to offer a simple solution, but as van Hooland and Verborgh (2014, 172) have pointed out, ‘Because these services can be applied in a quick and low-cost manner, we should not assume they offer any added value.’ Returning to Swanson’s (1986) three forms of undiscovered public knowledge, it is easy to see the challenges that remain in the automated discovery of undiscovered public knowledge that would undoubtedly be eased if some of that knowledge were appropriately encoded by a person, either in a single knowledge base or across the semantic web. Whether considering hidden refutations (where a hypothesis and its refutation are known but not to the same person), missing links (where consecutive logical links are unknown to the same person), or combining multiple weak tests to produce a strong result, much of the knowledge is currently contained within the freeform text of books and journal articles across different domains using different styles and languages. If documents are to be subjected to NLP for the creation of an ontology, then a crowdsourced approach to cleaning a proposed dra ontology is likely to be a prerequisite to ensuring the value of the ontology; whether such a crowdsourced approach is suitable as the number of ontologies increases remains to be seen. It is therefore important to recognize the limits imposed by what has been encoded. Irrespective of whether the future of knowledge discovery makes use of the brute force of NLP, ontologies, or a combination of the two, it can only provide answers to what has been encoded. Ontologies are imperfect partial representations of the world, rather than reflecting the world’s true complexity (Davis, Shrobe and Szolovits, 1993),

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 157

ThE FUTURE OF OnTOLOgIES AnD ThE InFORMATIOn pROFESSIOnAL

157

and it is important to understand what is lost in the encoding of knowledge. As has been noted, ‘data’ would have been bettered named ‘capta’, as it would have been more appropriately derived from the Latin capere (‘to take’) rather than dare (‘to give’) (Kitchin, 2014). Data collection involves numerous decisions on what data to capture and how to structure it, and having an understanding on various ways data can be structured, and how other similar data has been structured, will undoubtedly impact the potential value of the data. In the short term, when the focus may be more on the encoding of metadata and information retrieval rather than the encoding of knowledge itself, what is not encoded is more obvious. As knowledge itself becomes encoded, however, what is lost as something is broken into its constituent parts may be less clear. Take, for example, the encoding of a narrative work, such as the Bible. On the one hand an ontology may provide an exciting new method for exploring the roles and relationships of different concepts and entities; on the other, the question remains as to how much is lost as the coherent narrative whole is reduced to explicit triples. Is there a loss of tacit knowledge? What about opportunities for serendipitous discovery and skim reading? What about exegesis, the critical reading and interpretation of a text? It is also important to remember that when something is encoded, it is encoded with a particular world view. Even deciding what is to be encoded and what isn’t has implications for the representation of marginalized voices and perspectives. Of course this has always been the case, from the first manuscripts to the web of documents, but encoded data can have an air of scientific objectivity about it, even when the decisions or encodings are anything but scientific. Although there is undoubtedly huge potential for the common sharing of ontologies, the future of knowledge discovery, at least in the immediate future, seems likely to be focused on the prosaic. Undoubtedly aggregating the commonplace has the potential to provide numerous novel insights, from public opinion on genetically modified crops (elwall, 2009) to the spread of diseases (Eysenbach, 2006), but such insights are tangential to the primary purpose of many of the dominant data companies such as Google and Facebook, where some of the best minds in the world are currently trying to increase the rate with which people click on advertisements rather than searching for new knowledge. It is important that we move beyond the limited ontologies of the commercial sector, and for that to happen there is an important role for the library and information professional.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 158

158

pRACTICAL OnTOLOgIES

The future role of library and information professionals e role of the library and information professional as double expert was emphasized in the very first chapter of this book, and the library and information professional reaching this stage of the book should recognize the potential of an increasingly semantic web, the importance of ontologies in the sharing of data and, most importantly, the natural extension of the library and information professionals’ work in the development and application of ontologies. Of course library and information professionals are not the only people who may be involved with the development of ontologies: there are already many with a background in knowledge representation and ontology engineering from the field of computer science who will continue to make important contributions to ontology development. However, as has already been seen in the adoption of many computing technologies, what was once the preserve of computer scientists soon becomes more widely embedded as new user-friendly tools and technologies emerge. Inevitably the role of information professionals in the development of ontologies will depend heavily on the nature of ontologies themselves: for example, whether they are increasingly formal or informal, whether they make heavy use of NLP technologies, and how the interfaces are built for users to interact with them. Nonetheless, library and information professionals are ideally placed to acquire many of the necessary ontology development skills, and even those who do not wish to develop the most informal of ontologies themselves can nonetheless contribute to a more semantic web, and will increasingly be expected to be able to.

Using ontologies Undoubtedly the primary way that library and information professionals will engage with ontologies will be through the use of ontologies that already exist, in the same way that they reuse other forms of knowledge organization systems rather than create their own all the time. Ontologies will undoubtedly be increasingly embedded within many of the traditional roles of the information professional, such as cataloguing, classifying and indexing, as well as newer roles such as supporting the publication of data and gathering new indicators of impact. Cataloguing, classifying and indexing have long been core activities within the information profession, and whether or not information professionals ever get their hands dirty with raw RDF or OWL, they will undoubtedly come into contact with some of the many bibliographic ontologies that have been developed for use in bibliographic cataloguing. In many cases this will be through RDA, which is increasingly being implemented in library catalogues throughout the world, even if it is not always being utilized to its fullest extent; for example, the Bodleian Library, Oxford has implemented RDA within a traditional MARC structure rather than

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 159

ThE FUTURE OF OnTOLOgIES AnD ThE InFORMATIOn pROFESSIOnAL

159

according to the FRBR model (O’Reilly, 2013). However, as more FRBR catalogues are created and installed RDA will undoubtedly be used to a fuller extent, and more services will be built on top of the data that is made available. e focus of the library cataloguer will not be limited to the catalogue data, however, but will include linking this information to the other resources on the web, where a range of data will exist in a multitude of formats. As well as many bibliographic ontologies, there are also a wide range of domain-specific ontologies that may be used to provide index terms, terms with a far richer set of relationships than simple keywords. Ontologies are also likely to be an increasingly important part of some of the growth areas of the information profession. In recent years the remit of many libraries has expanded to include responsibility for an organization’s institutional repository and, increasingly, for measuring the impact of an organization’s research outputs through bibliometrics and a wide range of altmetrics. Both of which will once again bring the information professional into contact with a wide range of ontologies and structured data. Institutional repositories are not only hosting digital versions of traditional publications but also outputs that had limited opportunity for sharing and accessing before the web era, such as computer code and data sets. Although the lack of sharing of research data has been described as the ‘dirty little secret’ of open science data promotion (Borgman, 2012, 1059), its potential for both increasing the robustness of science and the potential for new valuable research to make use of existing research data has meant that there is growing support for open data amongst many of the actors involved within the research ecosystem, including policy makers, funders, publishers, and researchers themselves. It is important to recognize, however, that there is a wide variety of data that is collected in the research process, from small social science surveys at one end of the spectrum to the petabytes of data created by the Large Hadron Collider (Brumfiel, 2011). Some of this data may never be collected under the same conditions again (e.g., weather data and social surveys), whilst other data may be prohibitively expensive for the data to be recaptured by another set of researchers (e.g., data from the Large Hadron Collider). Best practice in terms of which data needs to be published and the best ways of publishing it is still to emerge, although there are an increasing range of options available as new data repositories and publications emerge and explicit policies are made regarding the publication of data accompanying published research, although the difficulty in getting authors to comply with these requirements has been highlighted (van Noorden, 2014). ere is a great variety in the ways data can be published and not all data should be directly encoded within an ontology or knowledge base, although ontologies will nonetheless have an important role to play in making sure the data is accessible

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 160

160

pRACTICAL OnTOLOgIES

through the creation of metadata and in the cataloguing process. In many cases, however, data will be suitable for publishing as linked data, and whilst some large research projects will include experienced computer or information scientists taking responsibility for publishing the data, in many cases the responsibility will fall as a secondary task to researchers with limited experience. It has been suggested that information professionals should play a more active role in the research process, not merely with publishing the data that has already been captured at the end of a project but contributing to discussions about the publishing of research data from the start (Stuart, 2011). e potential need for assistance in using even a simple ontology can be seen with SKOS, where even the terms skos:broader and skos:narrower can create confusion as to whether the subject or object of a triple is the broader or narrower term; whereas more formal ontologies incorporating disjunctions, negations and functional properties may quickly find both new and established users of ontologies confused (Warren et al., 2015). Even if it is possible to introduce semantically rich information later in the research process it might take longer and be more difficult to get researcher buy-in when the project is already finished. Although contributing to discussions about the publishing of data from the start of a project may require cultural changes in many institutions, there will undoubtedly be benefits in terms of the quality and usability of the data, and the growth in the amount of data that will need to be published in the future is likely to necessitate a more formalized approach to the process. Irrespective of whether information professionals wish to actively contribute to the development of ontologies, they are likely to have little choice about using ontologies in the future. erefore understanding how they are constructed and the added value that comes from a network of concepts as against a list of terms is essential if information professionals are to do their job properly. Of course, such an understanding is likely to be a minimum; the conscientious information professional has far more to add to the development of ontologies.

Building ontologies As well as making use of existing ontologies in their current form, there are likely to be increasing opportunities for the library and information professional to contribute to the development of ontologies, either through the support and development of existing ontologies or the creation of new ontologies. As the number of ontologies increases in the future, so will the need increase for information professionals to contribute to their development. ose ontologies that are available today have been developed with a wide variety of funding models and opportunities for contributions, and such diversity in ontology publication may be expected to remain in the future.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 161

ThE FUTURE OF OnTOLOgIES AnD ThE InFORMATIOn pROFESSIOnAL

161

Most existing ontologies need to be actively developed, or at least need to be maintained; ontologies need to reflect the current knowledge in a field, which is regularly being revised, and if this level of support is not possible it is nonetheless necessary to maintain an ontology so that it continues to be accessible to others in its original state. Some ontologies will have dynamic user groups that information professionals may choose to contribute to, others may require information professionals (or more oen their institution) to take more responsibility for the development and sustainability of an ontology created within an organization, so that those ontologies created in association with a short-term research project have some minimal support in the long term. At a minimum this may involve taking responsibility for the publication of the ontology under a suitable licence, easing the process of allowing the continued development of an ontology. If, however, an ontology aligns with an information service’s core subjects and activities, it may be that an information service decides to take a more active role in moderating the development of an ontology. ose information professionals within commercial organizations and special libraries will have a role to play in the development of ontologies to meet the needs of their organizations, whilst it is equally important that information professionals aligned with public and academic libraries are involved in the development of ontologies that have a wider application than that of a single organization. As Facebook’s Open Graph Protocol demonstrates, a single commercial organization can be very successful in the promotion of one particular ontology, but that doesn’t mean that the ontology is particularly well designed or widely applicable. Nonetheless such an ontology may be widely used because it is available and requires no development work. All information professionals, but especially those in public and academic libraries, need to critically engage with these ontologies, to recognize their limitations and what they fail to capture. As well as extending and contributing to existing ontologies, there is also a role for the information professional in developing new ontologies. As is returned to below, it seems highly unlikely that there will ever be a single ontology to rule them all; there will always be the need for new ontologies reflecting new objects, as well as new perspectives on existing objects. Although ontologies are likely to become increasingly important to a greater proportion of the population, we cannot expect that the skills necessarily to develop such ontologies will become equally widespread. e library and information professional, with its traditional experience of mediating between information resource and users, is ideally placed to facilitate the development of new ontologies from a wide range of fields. is is especially important if the ontologies are to reflect the needs of the fields, and not be a quick solution to a problem from the nearest data model to hand.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 162

162

pRACTICAL OnTOLOgIES

ere are plenty of opportunities for the information professional in the development of ontologies, and as the value of ontologies and linked data is increasingly recognized the information professional wishing to enhance his or her CV would undoubtedly benefit from demonstrating a sustained contribution to the development of an ontology or two. ere are also challenges to be overcome. ese include both cultural and technical challenges. Although there is increasing recognition of the importance of data, and the library catalogue is increasingly becoming FRBR-ized and integrated into the web of data, the library profession has nonetheless developed around the concept of the document and more established traditions of authority. ere are also a wide range of technical skills within the information profession, from those who consider basic programming skills increasingly essential for the information professional at the beginning of the 21st century to those whose interest in the profession revolves primarily around traditional printed materials, and the widespread contribution of information professionals to the ontology development will undoubtedly require new, more accessible, tools for building ontologies, ontology alignment and querying data sets. Users also come with a wide range of technical abilities, and many won’t have the high-tech knowledge necessary for interfacing with a SPARQL endpoint. One way to overcome this problem will be through the use of natural language interfaces (e.g., Paredes-Valverde et al., 2015).

The practical development of ontologies Along with questions about the future role of ontologies in knowledge discovery, and the role of the library and information professional in the development of ontologies, questions still remain about the nature of ontologies in the future. Will the most important ontologies be formal or informal? Will a single player or upper ontology dominate the future of ontology development? Will the ontologies that are developed be primarily open or closed? ere are vast differences in the formality of the ontologies that have been discussed within this book, from the very formal or heavy, such as the Gene Ontology, to the very informal or lightweight, such as the Open Graph Protocol. e original vision of the semantic web was in many ways far richer than its current incarnation as linked data, and rather than the semantic web becoming increasingly mainstream, it could be argued that it has split in two. On the one hand there are the widely adopted but informal ontologies that enable data to be encoded and gathered from across the web, on the other there are the rich situation-specific ontologies that enable inferences to be made and new knowledge discovered within small domains of knowledge. Van Hooland and Verborgh (2014, 110–11) have suggested that ‘time will tell whether large

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 163

ThE FUTURE OF OnTOLOgIES AnD ThE InFORMATIOn pROFESSIOnAL

163

scale inititatives such as Facebook’s Open Graph, Google’s Knowledge Graph, and projects such as schema.org will find the sweet spot between elegant but too ambitious semantics and the brutal force of full-text indexing’. Rather than an end in themselves, however, these ontologies should be seen as a small stepping stone in the direction of a semantic web, to which additional semantics may be added at a later date. e question as to whether any single ontology is in a position to dominate the web is associated with that of whether ontologies will be primarily formal or informal. Aer all, if schema.org or Basic Formal Ontology dominated future ontologies, there would be a tendency for ontologies to be either informal or formal respectively, and the monopoly of either would not be particularly welcomed. Arp, Smith and Spear (2015, xvii ) point out that ‘the very success of ontology-based approaches to the integration of data has led to a multiplication of ontologies in ways that threaten to create a new form of the very problems of interoperability that ontologies were themselves designed to solve’. However, there are very different ontology requirements from different communities of users, and this would weigh against the chance of there being a single dominant ontology or even approach to ontologies. Whilst a common upper ontology may enable the alignment of knowledge from different fields, we should be wary of being prescriptive in the adoption of any particular ontology, as they all limit what may be expressed to a certain extent. As to whether the future of ontologies will be open or closed, the move towards increasingly open resources oen appears to have an unstoppable momentum at present. ose in favour of closed access are oen perceived as being on the wrong side of history; however, history is rarely inevitable. is book has focused primarily on the open ontologies, as they are primarily the ones which information professionals will come into contact with; nevertheless, there are also many closed ontologies. is includes both enterprise ontologies providing access to resources on private intranets and vast public resources such as Google’s Knowledge Vault. However, whilst there is undoubtedly a need for shared public ontologies, there is also a need to find models that make the development of ontologies financially sustainable. For that, one suggestion has been the provision of ontologies as a service – with a sub-ontology extracted from a main ontology and then extended for a particular domain (Flahive, Taniar and Rahayu, 2015). In practice each of the above questions are putting forward false dichotomies. It is not a choice of open or closed, formal or informal, or centralized or distributed ontology development, but rather ontologies will continue to come in a wide variety of forms, and information professionals will have the opportunity to work in those areas that most closely align with their philosophies and skills. As Allemang and Hendler (2011) point out, the semantic web is designed to support multiple solutions.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 164

164

pRACTICAL OnTOLOgIES

Conclusion e promise of the semantic web has yet to be realized in the manner described in the 2001 articles. at computers don’t understand most web content becomes quickly apparent whenever we use the web. Search engines still primarily return pages rather than answers, and many of those pages are false drops. Comparison services oen fail to be as comprehensive or as detailed as we wish, and multiple comparison sites oen need to be visited for the same transaction, whilst most of us would still not have sufficient trust in the semantic web to have agents carry out tasks on our behalf, even for something as simple comparing prices and buying a book. However, that doesn’t mean we haven’t gained a more semantic web, or there isn’t more semantic enrichment within organizations, and this looks set to continue in the future. e way we represent, publish, and retrieve data will continue to change as new technologies emerge and although new technologies oen promise to overturn older technologies and previous ways of doing things, more oen they find a niche alongside existing technologies. Ontologies won’t replace older forms of controlled vocabulary at any point in the near future, but neither will ontologies be replaced by natural language processing. ey are just one of a wide variety of tools that promise to enable new insights to be gleaned from the vast quantities of data that are increasingly being created. Many information challenges are posed by the coming of increasingly semantically rich data sets enabled by a web of data and an internet of things, not least with regard to our notions of privacy in a world of joined-up data that is exemplified by ontologies. Such challenges make it essential that library and information professionals play their part.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:14 Page 165

Bibliography

Ackoff, R.L. (1989) From Data to Wisdom, Journal of Applied Systems Analysis, 16, 3–9. Ahmed, E.B., Tebourski, W., Karaa, W. B. A. and Gargouri, F. (2014) ONTOSSN:Scientific Social Network Ontology. In Jo, J. Y. and Takahashi, S. (eds), 2014 IEEE/ACIS 15th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, IEEE Computer Society, 25–28. Alexander, K. (2008) RDF/JSON: aspecification for serializing RDF in JSON, Talis Information Ltd, http://ceur-ws.org/Vol-368/paper16.pdf. Allemang, D. and Hendler, J. (2011) Semantic Web for the Working Ontologist: effective modeling in RDFS and OWL, 2nd edition, Morgan Kaufman. An, J., Kim, Y., Lee, M. and Lee, Y. (2013) Ontology Property-based Adaptive Crawler for Linked Data (OPAC). In Fourth International Conference on the Network of the Future: October 23–25, 2013, IEEE. Arbesman, S. (2012) The Half–Life of Facts, Current. Arp, R., Smith, B. and Spear, A.D. (2015) Building Ontologies with Basic Formal Ontology, MIT Press. Ashraf, J. and Hussain, O. K. (2013) Ontology Usage Network Analysis Framework. In Ishikawa, Y., Li, J., Wang, W., Zhang, R. and Zhang, W. (eds),Web Technologies and Application: proceedings of the 15th Asia-Pacific Web Conference, APWeb 2013, Sydney, Australia, Lecture Notes in Computer Science 7808, Springer, 19–30. Ashraf, J., Hussain, O. K. and Hussain, F. H. (2013) A Framework for Measuring Ontology Usage on the Web, The Computer Journal, 56 (9), 1083–1101. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R. and Ives, Z. (2007) DBpedia: a nucleus for a web of open data. In Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L., Goldbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G. and Cudré-Mauroux, P. (eds), The Semantic Web: proceedings of the 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11–15, 2007, Lecture Notes in Computer Science 4825, Springer, 722–35.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 166

166

pRACTICAL OnTOLOgIES

Auer, S., Dietzold, S., Lehmann, J., Hellmann, S. and Aumueller, D. (2009) Triplify: lightweight linked data publication from relational databases. In WWW ’09: proceedings of the 18th International Conference on World Wide Web, 621–30. Baker, T. (2012) Libraries, Languages of Description, and Linked Data: a Dublin Core perspective, Library Hi Tech, 30 (1), 116–33. Berners-Lee, T. (2006) Linked Data, www.w3.org/DesignIssues/LinkedData. html. Berners-Lee, T. (2007) Giant Global Graph, timbl’s blog, http://dig.csail.mit.edu/breadcrumbs/node/215. Berners-Lee, T. and Hendler, J. (2001) Publishing on the Semantic Web, Nature, 26 April, 1023–5. Berners-Lee, T., Hendler, J. and Lassila, O. (2001) The Semantic Web, Scientific American, May, 29–37. Bizer, C., Eckert, K., Meusel, R., Mühleisen, H., Schuhmacher, M. and Völker, J. (2013) Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis. In Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, X., Aroyo, L., Noy, N., Welty, C. and Janowicz, K. (eds), The Semantic Web – ISWC 2013, Part II, Lecture Notes in Computer Science 8219, Springer, 17–32. Borgman, C. L. (2012) The Conundrum of Sharing Research Data, Journal of the American Society for Information Science and Technology, 63 (6), 1059–78. Borst, W. M. (1997) Construction of Engineering Ontologies for Knowledge Sharing and Reuse, SIKS, the Dutch Graduate School for Information and Knowledge Systems, http://eprints.eemcs.utwente.nl/17377/01/t0000004.pdf. Bowker (2014) Self–Publishing in the United States, 2008–2013: print and ebook, www.bowker.com/assets/downloads/products/bowker_selfpublishing_report2013.pdf. Brank, J., Grobelnik, M. and Mladenić, D. (2005) A Survey of Ontology Evaluation Techniques. In Proceedings of the Conference on Data Mining and Data Warehouses, http://ailab.ijs.si/dunja/sikdd2005/papers/BrankEvaluationSiKDD2005.pdf. Brewster, C. and O’Hara, K. (2007) Knowledge Representation with Ontologies: present challenges – future possibilities, International Journal of Human-Computer Studies, 65 (7), 563–8. Brumfiel, G. (2011) Down the Petabyte Highway, Nature, 20 January, 282–3. Cabinet Office (2013) G8 Open Data Charter and Technical Annex, www.gov.uk/government/publications/open-data-charter/g8-open-data-charter-andtechnical-annex. CENDARI (2011) Guidelines for Ontology Building, www.cendari.eu/sites/default/files/ CENDARI%20_D6.3%20Guidelines%20for%20Ontology%20Building.pdf. Ceusters, W. (2011) An Information Artifact Ontology Perspective on Data Collections and Associated Representational Artifacts. In Mantas, J., Andersen, S. K., Mazzoleni, M. C., Blobel, B., Quaglini, S. and Moen, A. (eds), Quality of Life through Quality of Information, IOS Press.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 167

BIBLIOgRAphy

167

Chandramouli, K., Stewart, C., Brailsford, T. and Izquierdo, E. (2008) CAE-L: an ontology modeling cultural behavior in adaptive education. In Mylonas, P., Wallace, M. and Angelidas, M.(eds), Proceedings – 3rd International Workshop on Semantic Media Adaptation and Personalization, SMAP 2008, IEEE Computer Society, 183–8. Chen, R.-C., Huang, Y.-H., Bau, C.-T. and Chen, S.-M. (2012) A Recommendation System Based on Domain Ontology and SWRL for Anti-Diabetic Drugs Selection, Expert Systems with Applications, 39 (4), 3995–4006. Chen, X. (2014) Open Access in 2013: reaching the 50% milestone, Serials Review, 40 (1), 21–7. Cheng, G., Ji, F., Luo, S., Ge, W. and Qu, Y. (2011) BipRank: ranking and summarizing RDF vocabulary descriptions. In Pan, J. Z., Chen, H., Kim, H.-G., Li, J., Wu, Z., Horrocks, I., Mizoguchi, R. and Wu, Z. (eds), The Semantic Web: proceedings of the Joint International Semantic Technology Conference, JIST 2011, Hangzhou, China, December 4–11, 2011, Lecture Notes in Computer Science 7185, Springer, 226–41. Chowdhury, G. G. (2003) Natural Language Processing. In Cronin, B. (ed.) Annual Review of Information Science and Technology, 37, 51–89. Chowdhury, G. G. (2014) Sustainability of Scholarly Communication, Facet Publishing. Coffey, J. W., Cañas, A. J., Hill, G., Carff, R., Reichherzer, T. and Suri, N. (2003) Knowledge Modeling and the Creation of El-Tech: a performance support system for electronic technicians, Expert Systems with Applications, 25 (4), 483–92. Coffey, J. W., Hoffman, R. R., Cañas, A. J. and Ford, K. M. (2002) A Concept-Map Based Knowledge Modeling Approach to Expert Knowledge Sharing. In Boumedine, M. (ed.), Proceedings of IKS 2002 – the IASTED international conference on information and knowledge sharing, Acta Press, 212–17. Cooke, N.J. (1994) Varieties of Knowledge Elicitation Techniques, International Journal of Human-Computer Studies, 41 (6), 801–49. Corcho, O. (2006) Ontology Based Documentation Annotation: trends and open research problems, International Journal of Metadata, Semantics and Ontologies, 1 (1), 47–57. Dalwadi, N., Nagar, B. and Makwana, A. (2012) Semantic Web and Comparative Analysis of Inference Engines, International Journal of Computer Science and Information Technologies, 3 (3), 3843–7. Danial-Saad, A., Kuflik, T., Tamar Weiss, P. L. and Schreuer, N. (2013) Building an Ontology for Assistive Technology Using the Delphi Method, Disability and Rehabilitation: assistive technology, 8 (4), 275–86. Davis, R., Shrobe, H. and Szolovits, P. (1993) What is Knowledge Representation?, AI Magazine, 14 (1), 17–33. Derbentseva, N., Safayeni, F. and Cañas, A. J. (2007) Concept Maps: experiments on dynamic thinking, Journal of Research in Science Teaching, 44 (3), 448–65. Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R. S., Peng, Y., Reddivari, P., Doshi, V. and Sachs, J. (2004) Swoogle: a search and metadata engine for the semantic web. In CIKM ’04:

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 168

168

pRACTICAL OnTOLOgIES

Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management, ACM Press, 652–9. Ding, L., Zhou, L., Finin, T. and Joshi, A. (2005) How the Semantic Web is Being Used: an analysis of FOAF documents. In HICSS ’05: proceedings of the 38th Hawaii International Conference on System Sciences, Vol. 4, IEEE Computer Society, 113c. Doerr, M. (2011) Harmonized EDM-CRM-FRBRoo, http://cidoc-crm.org/docs/EDM-DC-ORE-CRM-FRBR_Integration_ORE_fix.ppt. Ducharme, B. (2013) Learning SPARQL, 2nd edn, O’Reilly Media. Euzenat, J. and Shvaiko, P. (2013) Ontology Matching, 2nd edn, Springer. Eysenbach, G. (2006) Infodemiology: tracking flu-related searches on the web for syndromic surveillance. In American Medical Informatics Association Annual Symposium Proceedings, 244–8, www.ncbi.nlm.nih.gov/pmc/articles/PMC1839505. Facebook (2015) Company Info, http://newsroom.fb.com/company-info. Faria, D., Pesquita, C., Santos, E., Palmonari, M., Cruz, I. F. and Couto, F. M. (2013) The AgreementMakerLight Ontology Matching System. In Meersman, R. et al. (eds), OTM 2013, Lecture Notes in Computer Science 8185, Springer, 527–41. Fernández-López, M., Gómez-Pérez, A. and Juristo, N. (1997) METHONTOLOGY: from ontological art towards ontological engineering. In Farquhar, A. and Grüninger, M. (eds), Ontological Engineering: papers from the AAAI Spring Symposium, AAAI Press, 33–40. Flahive, A., Taniar, D. and Rahayu, W. (2015) Ontology as a Service (OaaS): extending subontologies on the cloud, Concurrency and Computation: practice and experience, 27 (8), 2028–40. Fransson, M. N., Rial-Sebbag, E., Brochhausen, M. and Litton, J.-E. (2015) Toward a common language for biobanking, European Journal of Human Genetics, 23 (1), 22–8. Franz Inc. (2015) Overview of RDFS++, http://franz.com/agraph/support/learning/Overview-of-RDFS++.lhtml. Fürber, C. and Hepp, M. (2010) Using SPARQL and SPIN for Data Quality Management on the Semantic Web. In Abramowicz, W. and Tolksdorf, R. (eds), Business Information Systems: proceedings of the 13th International Conference, BIS 2010, Berlin, Germany, May 3–5, 2010, Lecture Notes in Business Information Processing 47, Springer, 35–46. Gavrilova, T. and Andreeva, T. (2012) Knowledge Elicitation Techniques in a Knowledge Management Context, Journal of Knowledge Management, 16 (4), 523–37. Golbeck, J. and Rothstein, M. (2008) Linking Social Networks on the Web with FOAF: a semantic web case study. In Fox, D. and Gomes, C. P. (eds) Proceedings of the TwentyThird AAAI Conference on Artificial Intelligence, AAAI Press, 1138–43. Gomes, R. do V. A., Casanova, M. A., Lopes, G. R. and Leme, L. A. P. P. (2014) CRAWLERLD: a multilevel metadata focused crawler framework for linked data. In Cordeiro, J., Hammoudi, S., Maciaszek, L., Camp, O. and Filipe, J. (eds), Enterprise Information Systems: 16th international conference, ICEIS 2014, Lisbon, Portugal, April 27–30, 2014,

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 169

BIBLIOgRAphy

169

revised selected papers, Lecture Notes in Business Information Processing 227, Springer, 302–19. Gómez-Pérez, A. and Corcho, O. (2002) Ontology Languages for the Semantic Web, IEEE Intelligent Systems, January/February 2002, 54–60. González García, F.M. and Zuasti Urbano, J. (2008) The Running of the Bulls: apractical use of concept mapping to capture expert knowledge. In Proceedings of 3rd International Conference on Concept Mapping, Tallin, Helsinki, 242–5. Google (2011) Introducing schema.org: search engines come together for a richer web, http://googleblog.blogspot.co.uk/2011/06/introducing-schemaorg-search-engines.html. Grau, B. C., Horrocks, I., Motik, B., Parsia, B., Patel-Schneider, P. and Sattler, U. (2008) OWL 2: the next step for OWL, Journal of Web Semantics, 6 (4), 309–22. Greenberg, J. and Méndez, E. (eds) (2012) Knitting the Semantic Web, Routledge. Gruber, T. R. (1993) A Translation Approach to Portable Ontology Specifications, Knowledge Acquisition, 5 (2), 199–220. Gruber, T. R. (1995) Toward Principles for the Design of Ontologies Used for Knowledge Sharing, International Journal of Human-Computer Studies, 43 (5–6), 907–28. Gruber, T. R. (2009) Ontology. In Liu, L. and Özsu, M. T. (eds), The Encyclopedia of Database Systems, Springer. Grüninger, M. and Fox, M. S. (1995) Methodology for the Design and Evaluation of Ontologies, Workshop on Basic Ontological Issues in Knowledge Sharing, IJCAI–95, Montreal, www.eil.utoronto.ca/wp-content/uploads/enterprise-modelling/papers/ gruninger-ijcai95.pdf. Harpring, P. (2013) Introduction to Controlled Vocabularies: terminology for art, architecture, and other cultural work, updated edn, Getty Publishing. Hayden, E. C. (2014) Technology: the $1,000 genome, Nature, 20 March, 294–5. Heath, T. and Bizer, C. (2011) Linked Data: evolving the web in a global data space, Morgan & Claypool. Hedden, H. (2010) The Accidental Taxonomist, Information Today. Heery, R. and Patel, M. (2000) Application Profiles: mixing and matching metadata schemas, Ariadne, 25, www.ariadne.ac.uk/issue25/app-profiles. HEFCE (2015) The Metric Tide: report of the independent review of the role of metrics in research assessment and management, www.hefce.ac.uk/pubs/rereports/Year/2015/metrictide/Title,104463,en.html. Hey, T., Tansley, S. and Tolle, K. (eds) (2009) The Fourth Paradigm: data-intensive scientific discovery, Microsoft Research, http://research.microsoft.com/en-us/collaboration/ fourthparadigm/4th_paradigm_book_complete_lr.pdf. Hodson, H. (2014) Google’s Fact-checking Bots Build Vast Knowledge Bank, New Scientist, 23 August, www.newscientist.com/article/mg22329832.700-googles-factchecking-botsbuild-vast-knowledge-bank.html.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 170

170

pRACTICAL OnTOLOgIES

Hoffman, R. R. and Lintern, G. (2006) Eliciting and representing the knowledge of experts. In Ericsson, K. A., Charness, N., Feltovich, P. and Hoffman, R. (eds), Cambridge Handbook of Expertise and Expert Performance, Cambridge University Press, 203–22. Hood, W. W. and Wilson, C. S. (2001) The Literature of Bibliometrics, Scientometrics, and Informetrics, Scientometrics, 52 (2), 291–314. Horridge, M. (2011) A Practical Guide to Building OWL Ontologies Using Protégé 4 and CO– ODE tools, http://mowl-power.cs.man.ac.uk/protegeowltutorial/resources/ ProtegeOWLTutorialP4_v1_3.pdf. Idrees, H. (2012) Library Classification Systems and Organization of Islamic Knowledge, Library Resources & Technical Services, 56 (3), 171–82. Instagram (2015) Stats, http://instagram.com/press. [Accessed 19 February 2015] International Working Group on FRBR and CIDOC-CRM Harmonisation (2015) FRBR: object-orientated definition and mapping from FRBRER, FRAD and FRSAF (version 2.2), www.ifla.org/files/assets/cataloguing/frbr/frbroo_v2.2.pdf. Isele, R., Umbrich, J., Bizer, C. and Harth, A. (2010) LDSpider: an open-source crawling framework for the web of linked data. In International Semantic Web Conference 2010, Shanghai, China, November 7–11, http://iswc2010.semanticweb.org/pdf/495.pdf. Jakus, G., Milutinovic, V., Omerovic, S. and Tomažic, S. (2013) Concepts, Ontologies, and Knowledge Respresentation, Springer. Jayadianti, H., Pinto, C. S., Nugrohu, L. E., Santosa, P. I. and Widayat, W. (2013) Leveraging Knowledge from Different Communities using Ontologies. In Rocha, Á., Correia, A. M., Wilson, T. and Stroetmann , K. A. (eds), Advances in Information Systems and Technologies, Advances in Intelligent Systems and Computing 206, Springer, 67–76. Kalender, W. A. (2011) Computed Tomography: fundamentals, system technology, image quality, applications, 3rd edn, Wiley VCH. Kalibatiene, D. and Vasilecas, O. (2011) Survey on Ontology Languages. In Grabis, J. and Kirikova, M. (eds), Perspectives in Business Informatics Research: 10th international conference, BIR 2011, Riga, Latvia, October 6–8, 2011, proceedings, Lecture Notes in Business Information Processing 90, Springer, 124–41. Kämpgen, B., O’Riain, S. and Harth, A. (2015) Interacting with Statistical Linked Data via OLAP Operations. In Simperl, E., Norton, B., Mladenic, D., Valle, E. D., Fundulaki, I., Passant, A. and Toncy, R. (eds), The Semantic Web: ESWC 2012 Satellite Events, Heraklion, Crete, Greece, May 27–31, 2012, Lecture Notes in Computer Science 7540, Springer, 87–101. Kaplan, A. (1964) The Conduct of Inquiry: methodology for behavioral science, Chandler Publishing. Khan, W. A., Amin, M. B., Khattak, A. M., Hussain, M., Afzal, M., Lee, S. and Kim, E. S. (2015) Object Orientated and Ontology Alignment Patterns-based Expressive Mediation Bridge Ontology (MBO), Journal of Information Science, 41 (3), 296–314.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 171

BIBLIOgRAphy

171

Kitchin, R. (2014) The Data Revolution: big data, open data, data infrastructures and their consequences, Sage Publishing. Knublauch, H., Hendler, J. A. and Idehen, K. (2011) SPIN – Overview and Motivation, www.w3.org/Submission/2011/SUBM-spin-overview-20110222. Kozaki, K., Kitamura, Y., Ikeda, M. and Mizoguchi, R. (2002) Hozo: an environment for building/using ontologies on a fundamental consideration of ‘Role’ and ‘Relationship’. In Gómez-Pérez, A. and Benjamins, V. R. (eds), Knowledge Engineering and Knowledge Management: ontologies and the semantic web: 13th international conference, EKAW 2002 Sigüenza, Spain, October 1–4, 2002 proceedings, Lecture Notes in Computer Science 2473, Springer, 213–18. Kuhn, T. S. (1970) The Structure of Scientific Revolutions, 2nd edn, University of Chicago Press. Lanthaler, M. and Gütl, C. (2012) On Using JSON-LD to Create Evolvable RESTful Services. In WS-REST ’12: Third International Workshop on RESTful Design, Lyon, France, April 16, 2012, ACM, 25–32. LaPolla, F. (2013) Perceptions of Librarians Regarding Semantic Web and Linked Data Technologies, Journal of Library Metadata,13 (2–3), 114–40. Lee, M., Matentzoglu, N., Parsia, B. and Sattler, U. (2015) A Multi-reasoner, Justificationbased Approach to Reasoner Correctness. In Arenas, M., Corcho, O., Simperl, E., Strohmaier, M., d’Aquin, M., Srinivas, K., Groth, P., Dumontier, M., Heflin, J., Krishnaprasad, T. and Staab, S. (eds), The Semantic Web – ISWC 2015: 14th international semantic web conference, Bethlehem, PA, USA, October 11–15, 2015: proceedings, part 2, Lecture Notes in Computer Science 9367, Springer, 393–408. Li, J., Shi, P. and Cheng, M. (2014) Ranking Ontologies Based on Formal Concept Analysis, Journal of Computers, 9 (1), 215–21. Luczak-Rosch, M. and Tolksdorf, R. (2013) On the Topology of the World Wide Web. In HT ’13: 24th Conference on Hypertext and Social Media, Paris, France, May 1–3, ACM, 253–7. Malik, S., Goel, A. and Maniktala, S. (2010) A Comparative Study of Various Variants of SPARQL in Semantic Web. In Computer Information Systems and Industrial Management Applications (CISIM), 2010 International Conference, IEEE, 471–4. Marrero, M., Urbano, J., Sánchez-Cuardrado, S., Morato, J. and Gómez-Berbís, J. M. (2013) Named Entity Recognition: fallacies, challenges and opportunities, Computer Standards & Interfaces, 35 (5), 482–9. Matuszek, C., Cabral, J., Witbrock, M. and DeOliveira, J. (2006) An Introduction to the Syntax and Content of Cyc. In Proceedings of the 2006 AAAI Spring Symposium on Formalizing and Compiling Background Knowledge and Its Applications to Knowledge Representation and Question Answering, AAAI Press, 44–9. Meusel, R., Mika, P. and Blanco, R. (2014) Focused Crawling for Structured Data. In CIKM ’14 Proceedings of the 23rd ACM International Conference on Conference on Information

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 172

172

pRACTICAL OnTOLOgIES

and Knowledge Management, ACM, 1039–48. Microformats (2014) Microformats.org Turns 9 – Upgrade to Microformats2 and More, http://microformats.org/2014/06/20/microformats-org-turns-9-upgrade-tomicroformats2. Mika, P. (2015) On Schema. org and Why it Matters for the Web, IEEE Internet Computing, 19 (4), 52–5. Mizoguchi, R. (2004) Tutorial on Ontological Engineering – Part 2: Ontology Development, Tools and Languages, New Generation Computing, 22 (1), 61–96. Murdock, J., Buckner, C. and Allen, C. (2012) Containing the Semantic Explosion. Philoweb Workshop, 2012, www.jamram.net/docs/philoweb12-paper.pdf. Naisbitt, J. (1984) Megatrends: ten new directions transforming our lives, Grand Central Publishing. Nature (2014) Code Share: papers in Nature journals should make computer code accessible where possible, Nature, 30 October, 536. Neches, R., Fikes, R., Finin, T., Gruber, T., Patil, R., Senator, T. and Swartout, W. R. (1991) Enabling Technology for Knowledge Sharing, AI Magazine, 12 (3), 36–56. Niles, I. and Pease, A. (2001) Towards a Standard Upper Ontology. In FOIS ’01, October 17–19, 2001, Ogunquit, Maine, USA, IOS Press. Nisheva-Pavlova, M., Spyratos, N. and Stanchev, P. (2014) Museum Collections and the Semantic Web, Digital Presentation and Preservation of Cultural and Scientific Heritage, 4, 33–9. Novak, J. D. and Cañas, A. J. (2008) The Theory Underlying Concept Maps and How to Construct and Use Them, Technical Report IHMC CmapTools 2006–01 Rev 01–2008, Florida Institute for Human and Machine Cognition, 2008, http://cmap.ihmc.us/ Publications/ResearchPapers/TheoryUnderlyingConceptMaps.pdf. Noy, N. F. and d’Aquin, M. (2011) Where to Publish and Find Ontologies?: a survey of ontology libraries, Journal of Web Semantics: science, services and agents on the world wide web, 11, 96–111. Noy, N. F. and McGuinness, D. L. (2001) Ontology Development 101: a guide to creating your first ontology, Stanford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, March 2001, http://protege.stanford.edu/publications/ontology_development/ontology101.pdf. Nurseitov, N., Paulson, M., Reynolds, R. and Izurieta, C. (2009) Comparison of JSON and XML Data Interchange Formats: a case study. In CAINE 2009, 157–62. Oliver, C. (2010) Introducing RDA: a guide to the basics, ALA Editions. Omoronyia, I., Sindre, G., Stålhane, T., Biffl, S., Moser, T. and Sunindyo, W. (2010) A Domain Ontology Building Process for Guiding Requirement Elicitation. In Wieringa, R. and Persson, A. (eds) Requirements Engineering: foundations for software quality, 16th international working conference, REFSQ 2010, Essen, Germany, June 30–July 2, 2010,

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 173

BIBLIOgRAphy

173

Lecture Notes in Computer Science 6182, Springer, 188–202. Opalički, I. and Lovrenčić, S. (2012) How Well are Domain and Upper Ontologies Connected? In Hunjak, T., Lovrenčić, S. and Tomičić, I. (eds), Central European Conference on Information and Intelligent Systems, September 19–21, 2012, University of Zagreb, 17–22. Oracle (2014) Oracle Spatial Graph: benchmarking a trillion edges RDF graph, http://download.oracle.com/otndocs/tech/semantic_web/pdf/OracleSpatialGraph_ RDFgraph_1_trillion_Benchmark.pdf. O’Reilly, B. (2013) RDA at Oxford University, Catalogue and Index, 173, December, 50–6. O’Riain, S., McCrae, J., Cimiano, P. and Spohr, D. (2015) Using SPIN to Formalise XBRL Accounting Regulations on the Semantic Web. In Simperl, E., Norton, B., Mladenic, D., Valle, E. D., Fundulaki, I., Passant, A. and Toncy, R. (eds), The Semantic Web: ESWC 2012 satellite events, Heraklion, Crete, Greece, May 27–31, 2012, Lecture Notes in Computer Science 7540, Springer, 58–72. Paredes-Valverde, M. A., Rodríguez-Garcia, M. A., Ruiz-Martínez, A., Valencia-Garcia, R. and Alor-Hernandez, G . (2015) ONLI: an ontology-based system for querying DBpedia using natural language paradigm, Expert Systems with Applications, 42 (12), 5163–76. Paulheim, H. and Hertling, S. (2013) WeSeE-Match Results for OAEI 2013. In Shvaiko, P., Euzenat, J., Srinivas, K., Mao, M. and Jiménez-Ruiz, E. (eds), Proceedings of the 8th International Workshop on Ontology Matching co-located with the 12th International Semantic Web Conference (ISWC 2013), ACM, 197–202. Pike, W. and Gahegan, M. (2007) Beyond Ontologies: toward situated representation of scientific knowledge, International Journal of Human-Computer Studies, 65 (7), 674–88. Polanyi, M. (1966) The Tacit Dimension, University of Chicago Press. Powers, S. (2003) Practical RDF, O’Reilly. Price, D. J. S. (1963) Little Science, Big Science, Columbia University Press. Pritchard, A. (1969) Statistical Bibliography or Bibliometrics? Journal of Documentation, 25 (4), 348–9. Raman, B. (2012) The Rhetoric and Reality of Transparency: transparent information, opaque city spaces and the empowerment question, The Journal of Community Informatics, 8(2), http://ci-journal.net/index.php/ciej/article/view/866/909. Ranganathan, S. R. (1931) The Five Laws of Library Science, Madras Library Association. Rattanasawad, T., Saikaew, K. R, Buranarach, M. and Supnithi, T. (2013) A Review and Comparison of Rule Languages and Rule-based Inference Engines for the Semantic Web. In International Computer Science and Engineering Conference, ICSEC 2013, IEEE, 1–6. Rizwan, I., Aida, M. and Zulkifli, M. Y. (2013) An Experience of Developing Quran Ontology with Contextual Information Support, Multicultural Education & Technology Journal, 7 (4), 333–43. Rocha da Silva, J., Castro, J. A., Ribeiro, C., Honrado, J., Lomba, A. and Gonçalves, J. (2014)

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 174

174

pRACTICAL OnTOLOgIES

Beyond INSPIRE: an ontology for biodiversity metadata records. In Meersman, R., Panetto, H., Mishra, A., Valencia-Garcia, E., Soares, A. L., Ciuciu, I., Ferri, F., Weichhart, G., Moser, T., Bezzi, M. and Chan, H. (eds), On the Move to Meaningful Internet Systems: OTM 2014 Workshops, Lecture Notes in Computer Science 8842, Springer, 597–607. Rowley, J. (2007) The Wisdom Hierarchy: representation of the DIKW hierarchy, Journal of Information Science, 33 (2), 163–80. Sarawagi, S. (2007) Information Extraction, Foundations and Trends, 1 (3), 261–377. Sauermann, L. and Cyganiak, R. (2008) Cool URIs for the Semantic Web, www.w3.org/TR/cooluris. Schaible, J., Gottron, T., Scheglmann, S. and Scherp, A. (2013) LOVER: support for modeling data using linked open vocabularies. In Guerrini, G. (ed.) Proceedings of the Joint EDBT/ICDT 2013 Workshops, Genoa, Italy, March 18–22, 2013, 89–92. Schema.org (2011) Yandex Now Supports Schema.org Markup, Schema Blog, 4 November, http://blog.schema.org/2011/11/yandex-now-supports-schemaorg-markup.html. Schmachtenberg, M., Bizer, C. and Paulheim, H. (2014) Adoption of the Linked Data Best Practices in Different Topical Domains. In Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowica, K. and Goble, C. (eds), The Semantic Web – ISWC 2014: 13th international semantic web conference, Riva del Garda, Italy, October 19–23, 2014: proceedings, part 1, Lecture Notes in Computer Science 8796, Springer, 245–60. Shadbolt, N. R. and Smart, P. R. (2015) Knowledge Elicitation: methods, tools, and techniques. In Wilson, J. R. and Sharples, S. (eds), Evaluation of Human Work, 4th edn, CRC Press, 163–200. Shvaiko, P. and Euzenat, J. (2013) Ontology Matching: state of the art and future challenges, IEEE Transactions on Knowledge and Data Engineering, 25 (1), 158–76. Singer, R. (2009) Content Sources and Mashing Them Up. In Engard, N. (ed.), Library Mashups: exploring new ways to deliver library data, Facet Publishing. Singh, S. and Karwayun, R. (2010) A Comparative Study of Inference Engines. In Latifi, S. (ed.) ITNG 2010: Seventh International Conference on Information Technology, IEEE, 53–7. Smith, B. (2004) Beyond Concepts: ontology as reality representation. In Varzi, A. and Vieu, L. (eds), Proceedings of FOIS 2004, International Conference on Formal Ontology and Information Systems, Turin, 4–6 November, 2004. Sridevi, K. and Umarani, R. (2014) A Novel and Hybrid Ontology Ranking Framework using Semantic Closeness Measure, International Journal of Computer Applications, 87 (5), 44–8. Stadtmüller, S., Harth, A. and Grobelnik, M. (2013) Accessing Information About Linked Data Vocabularies with vocab.cc. In Li, J., Qi, G., Zhao, D., Nejdl, W. and Zheng, H.-T. (eds), Semantic Web and Web Science, Springer Proceedings in Complexity, Springer, 391–6. Starr, R. R. and de Oliveira, J. M. P. (2013) Concept Maps as the First Step in an Ontology

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 175

BIBLIOgRAphy

175

Construction Method, Information Systems, 38 (5), 771–83. Stewart, D. L. (2011) Building Enterprise Taxonomies, MokitaPress. Stock, K., Stojanovic, T., Reitsma, F., Ou, Y., Bishr, M., Ortmann, J. and Robertson, A. (2012) To Ontologise or not to Ontologise: an information model for a geospatial knowledge infrastructure, Computers and Geosciences, 45, 98–108. Stoilos, G., Stamou, G. and Kollias, S. (2005) A String Metric for Ontology Alignment. In Gil, Y., Motta, E., Benjamins, V. R. and Musen, M. A. (eds), The Semantic Web – ISWC 2005: 4th international semantic web conference, ISWC 2005, Galway, Ireland, November 6–10, Lecture Notes in Computer Science 3729, Springer, 624–37. Stuart, D. (2011) Facilitating Access to the Web of Data: a guide for librarians, Facet Publishing. Stuart, D. (2012) FOAF Within UK Academic Web Space: a webometric analysis of the semantic web. In Widen, G. and Holmberg, K. (eds), Social Information Research, Emerald, 173–91. Suda, B. (2006) Using Microformats, O’Reilly Short Cuts. Sun, Q. and Liang, Y. (2014) Study on Pear Diseases Query System Based on Ontology and SWRL. In Li, D. and Chen, Y. (eds), Computer and Computing Technologies in Agriculture VII: 7th IFIP WG 5. 14 International Conference, CCTA 2013, Beijing, China, September 18–20, 2013, revised selected papers, part II, IFIP Advances in Information and Communication Technology 420, 24–33. Sun, Y., Ma, L. and Wang, S. (2015) A Comparative Evaluation of String Similarity Metrics for Ontology Alignment, Journal of Information & Computational Science, 12 (3), 957–64. Swanson, D. R. (1986) Undiscovered Public Knowledge, The Library Quarterly, 56 (2), 103–18. Szekely, P., Knoblock, C. A., Young, F., Zhu, X., Fink, E. E., Allen, R. and Goodlander, G. (2013) Connecting the Smithsonian American Art Museum to the Linked Data Cloud. In Cimiano, P., Corcho, O., Presutti, V., Hollink, L. and Rudolph, S. (eds), The Semantic Web: semantics and big data: 10th international conference, ESWC 2013, Montpellier, France, May 26–30, 2013, Lecture Notes in Computer Science 7882, Springer, 593–607. Techcrunch (2014) Snapchat Has Raised $485 million More from 23 Investors, at Valuation of up to $20B, http://techcrunch.com/2014/12/31/snapchat-485m. Thelwall, M. (2009) Introduction to Webometrics: quantitative web research for the social sciences, Morgan & Claypool. Tonkin, E. and Pfeiffer, H. D. (2010) Data-driven or Background Knowledge Ontology Development. In International Conference on Knowledge Management (ICKM), 22–23 October 2010, Pittsburgh, USA, http://opus.bath.ac.uk/20523/1/ickm10_hdp_et-v2.1.1.pdf. Tonkin, E., Pfeiffer, H. and Hewson, A. (2010) An Evidence-based Approach to Collaborative Ontology Development. In Workshop on Matching and Meaning 2010, 31 March–1 April 2010, http://opus.bath.ac.uk/18033/1/tonkin-pfeiffer.pdf.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 176

176

pRACTICAL OnTOLOgIES

Twitter Engineering Blog (2013) New Tweets Per Second Record, and How! https://blog.twitter.com/2013/new-tweets-per-second-record-and-how. Uschold, M. and King, M. (1995) Towards a Methodology for Building Ontologies. Presented at Workshop on Basic Ontological Issues in Knowledge Sharing, www.aiai.ed.ac.uk/project/oplan/documents/1995/95-ont-ijcai95-ont-method.pdf. van Hooland, S. and Verborgh, R. (2014) Linked Data for Libraries, Archives, and Museums: how to clean, link and publish your metadata, Facet Publishing. van Noorden, R. (2014) Confusion Over Publisher’s Pioneering Open-data Rules, Nature, 27 November, 478. Vrandečić, D. (2009) Ontology Evaluation. In Staab, S. and Studer, R. (eds), Handbook of Ontologies, Springer, 293–314. W3C (2002) Resource Description Framework (RDF): concepts and abstract data model, www.w3.org/TR/2002/WD-rdf-concepts-20020829. W3C (2004) Resource Description Framework (RDF): concepts and abstract syntax, www.w3.org/TR/2004/REC-rdf-concepts-20040210. W3C (2010) Ontology Dowsing, www.w3.org/wiki/Ontology_Dowsing. W3C (2012) OWL 2 Web Ontology Language Document Overview, 2nd edn, www.w3.org/TR/owl2-overview. W3C (2013) Vocabularies, www.w3.org/standards/semanticweb/ontology. W3C (2014) JSON-LD 1. 0 Processing Algorithms and API, www.w3.org/TR/2014/REC-json-ld-api-20140116. W3C Library Linked Data Incubator Group (2011) Datasets, Value Vocabularies and Metadata Element Sets, www.w3.org/2005/Incubator/lld/XGR-lld-vocabdataset-20111025. Warren, P., Mulholland, P., Collins, T. and Motta, E. (2014) Using Ontologies: understanding the user experience. In Janowicz, K., Schlobach, S., Lambrix, P. and Hyvönen, E. (eds), Knowledge Engineering and Knowledge Management, 19th International Conference, EKAW 2014, Linköping, Sweden, November 24–28, Proceedings, Lecture Notes in Computer Science 8876, Springer, 579–90. Warren, P., Mulholland, P., Collins, T. and Motta, E. (2015) Making Sense of Description Logics. In Hellmann, S., Parreira, J. X. and Polleres, A. (eds), SEMANTICS: Vienna 2015, Proceedings of the 11th International Conference on Semantic Systems, ACM, 49–56. Weibel, S. L. (2009) Dublin Core Metadata Initiative: a personal history. In Bates, M. J. and Maack, M. N. (eds), Encyclopedia of Library and Information Science, 3rd edn, CRC Press. Welsh, A. and Batley, S. (2012) Practical Cataloguing, Facet Publishing. Wikipedia (2015) Article Titles, https://en.wikipedia.org/wiki/Wikipedia:Article_titles. Wildgaard, L., Schneider, J. W. and Larsen, B. (2014) A Review of the Characteristics of 108 Author-level Bibliometric Indicators, Scientometrics, 101 (1), 125–58. Willer, M. and Dunshire, G. (2013) Bibliographic Information Organization in the Semantic Web, Chandos.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 177

BIBLIOgRAphy

177

Wylot, M., Cudré Mauroux, P. and Groth, P. (2015) Executing Provenance-enabled Queries over Web Data. In WWW 2015, May 18–22, 2015, Florence, Italy, ACM, 1275–85. YouTube (2015) Statistics, www.youtube.com/yt/press/statistics.html. Ziman, J. M.(1969) Information, Communication, Knowledge, Nature, 25 October, 318–24. Zins, C. (2007) Conceptual Approaches for Defining Data, Information, and Knowledge, Journal of the American Society for Information Science and Technology, 58 (4), 479–93.

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 178

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 179

Index

AACR2 44, 49, 66–7 Ackoff, R. L. 5 AgreementMakerLight 121 Ahmed, E. B. 82 Aida, M. 68 Alchemy API 115 Alexander, K. 39 Allemang, D. 12, 37, 39–40, 61, 63, 78, 82, 101, 135, 140, 163 Allen, C. 14, 16, 22, 48, 127 Amazon 19, 20, 36 An, J. 154 Andreeva, T. 107–8 Anthelion 153 Any23 153 Apache-Jena 37, 139 APIs (application programming interfaces) 36, 84, 115, 121 Arbesman, S. 94 archives 9, 16, 47–50, 59, 63, 70, 77, 93–4, 98, 102, 117 Arp, R. 13, 80, 116, 118–19, 163 Art & Architecture Thesaurus 8, 80, 116 artificial intelligence 9, 135 Ashraf, J. 153 Association of College and Research Libraries 116 Auer, S. 72, 87 authority files 7–8 authority list 13 automated assistants 137, 139 automatic agents 28, 80 Baker, T. 15 BARTOC (Basel Register of Thesauri, Ontologies & Classifications) 84–5 Basic Formal Ontology 14–15, 53, 68–70, 128, 163 Batley, S. 65, 100 BBC Programmes Ontology 20, 33 Berners-Lee, T. 16, 27, 73, 77 Bibframe 84 Bible Ontology 53, 67–8 Bibliographic Ontology 64, 76, 129 bibliometrics 127–8, 130, 132, 145, 159 Bibliometrics Indicators Ontology 130–5 Bibliometrics Metrics Ontology 127–9 Bibliothèque Nationale de France 8 big data 4

Billion Triple Challenge 150 Bing 74, 121 Bing Translator 122 BioPortal 85 Bizer, C. 33, 46, 77, 86, 150–1, 153 Blanco, R. 153 Bodleian Library 158 Boolean operators 18 Borgman, C. 2, 159 Borst, W.M. 11, Bowker 2 Brank, J. 123–4 Brewster, C. 14, 22–3, 126 British Library 7, 48, 74, 90, 141–2 British Museum 50, 71, 145 British National Bibliography 10–11, 36, 48, 142 Brumfiel, G. 3, 159 Buckner, C. 14, 16, 22, 48, 127 Cabinet Office (UK Government) 2 Cañas, A. 108–9 Carroll, Lewis 4 cataloguing 12–13, 16–19, 22–4, 48, 62, 65, 67, 79, 86, 145, 158–60, 162 Categories for the Description of Works of Art (CDWA) 16 CENDARI 70, 98 CERIF 82, 129–30 Ceusters, W. 68 Chandramouli, K. 108 CHEBI 34 Chen, R.-C. 42 Chen, X. 2 Cheng, G. 88 Cheng, M. 90 Chowdhury, G. G. 110, 126 CIDOC-CRM 15, 53, 65–6, 70–1 CiTO (Citation Typing Ontology) 64, 129–31, 133–4 CLAROS 71 classification 6, 8, 11, 16, 23, 58, 79, 84, 91, 95, 98, 110, 145, 158 Cmap 109 Coffey, J. 108 Common Crawl 151, 153 computed tomography 3 computer code 2, 159 (see also open code) content negotiation 20, 33, 42

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 180

180

pRACTICAL OnTOLOgIES

controlled vocabularies 1, 5–9, 13–14, 17–18, 20, 22–3, 47, 59, 84–5, 88–9, 91–2, 95, 102, 111, 116–18, 156, 164 Cooke, N. J. 108 Corcho, O. 15, 43 Cortana 139 CRAWLER-LD 153 Creative Commons 48, 95 crowdsourcing 51, 72, 92, 119, 127, 144, 156 Cudré-Mauroux, P. 151 cURL 111–12 Cycl 15 Cyganiak, R. 33 Dalwadi, N. 139 DAML 94 DAML+OIL 60 Danial-Saad, A. 108 d’Aquin, M. 83 data deluge 1 data, information, knowledge hierarchy 5 DataHub 149 Davis, R. 83, 156 DBpedia 21, 33–4, 53, 71–2, 102, 111–13, 144–7 (see also Wikipedia) DBpedia Spotlight 111–12, 114–15 de Oliveira, J. M. P. 109 Dead-Journal 74 Derbentseva, N. 109 Dewey Decimal Classification (DDC) 7, 48, 84, 91 digital cameras 2 digital trails 3 Digitized Manuscripts to Europeana (DM2E) 70 Ding, L. 73, 86, 88, 153 Doerr, M. 71 Dublin Core (DC) 6, 13, 15, 38, 44, 63–4, 78, 80, 82, 84, 88, 93–4, 116, 129, 131, 142–3, 148 Ducharme, B. 140 Dunshire, G. 12 EAC-CPF 50 EAD (Encoded Archival Description) 16, 50 eBiquity 86 ERIC (Education Resources Information Center) thesaurus 8 Europeana Data Model (EDM) 15, 53, 67, 70–1, 81, 115, 143–4 Eurovoc 84 Euzenat, J. 120–1 Event Ontology 92 Eysenbach, G. 157 FaBiO 65, 66 Facebook 2, 49, 73, 77–8, 82, 86, 94, 150, 157, 161, 163 (see also Open Graph Protocol) faceted search 18, 84 FaCT++ 139 Falcons 87

Faria, D. 121 FAROO 121 Fernández-López, M. 98–9, 106 Flahive, A. 98, 163 Flickr 8, FluentEditor 105 FOAF (Friend of a Friend) 30–1, 43–5, 47, 54–5, 64, 72–7, 82, 88, 90, 94, 115, 119, 125, 153 FOAF-a-Matic 72 Fox, M. S. 98–9 Fransson, M. N. 108 Franz Inc. 61 FRBR (Functional Requirements for Bibliographic Records) 15, 65–9, 159, 162 FRBRoo 65–6 Fürber, C. 42 Gahegan, M. 22–3 galleries 50 GATE (General Architecture for Text Engineering) 111, 114–15 Gavrilova, T. 107–8 Gene Ontology 80, 162 German National Library 59 Getty Thesaurus of Geographic Names 8 Getty Vocabularies 59, 95 GitHub 77, 127, 134–5 Goel, A. 41 Golbeck, J. 74 Gomes, R. do V. A. 154 Gómez-Pérez, A. 15, 98–9, 106 González García, F. M. 108 GoodRelations Ontology 75 Google 28, 41, 74–5, 78, 88, 140, 153, 156–7, 163 Google Base 74 Google Data 74 Google Glass 3 Google Hangouts 108 Google Scholar 4 Knowledge Graph 28, 163 Knowledge Vault 28, 163 graph matching 18–19 Grau, B.C. 40 Graves Ontology 93 Greenberg, J. 48 Grobelnik, M. 123–4, 150 Groth, P. 151 Gruber, T.R. 9, 12, 83, 92–3, 139 Grüninger, M. 98–9 Gütl, C. 39 Harpring, P. 5–6, 12 Harth, A. 150, 153 Hayden, E.C. 3 Heath, T. 33 Hedden, H. 5–6, 9, 14, 17, 80, 100, 118 Heery, R. 15, 80, 82, 93 HEFCE 130

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 181

InDEx

Hendler, J. 12, 27, 37, 39, 40, 42, 61, 63, 78, 82, 101, 135, 140, 163 Hepp, M. 42, 75 HermiT 139 Hertling, S. 121 Hewson, A. 97, 100 Hey, T. 2 hierarchical term tree 6 Hillman, D. 82 Hodson, H. 28 Hoffman, R. R. 110 Hood, W.W. 150 Horridge, M. 119 HTML (Hypertext Markup Language) 20, 33, 37, 39, 42–7, 74, 77, 124–5, 134, 145, 153 HTTP (Hypertext Transfer Protocol) 16, 33 human genome 3 Hussain, F. H. 153 Hussain, O. K. 153 Idehen, K. 42 Idrees, H. 7 indexing 4–5, 9, 12, 17–18, 21, 23, 59, 83, 86, 88, 90, 101, 126, 138, 148, 150–6, 158–9, 163 inference 11, 18–19, 21, 39–40, 42, 59, 92, 118, 120, 135, 138, 162 information discovery 4–5, 48 half-life 94 overload 1, 4, 24 retrieval 1, 2, 7, 12, 18, 21, 23, 48, 79, 101, 110, 156–7 Information Artifact Ontology (IAO) 68–70, 103 InPho 59, 127 Instagram 2, 8 institutional repositories 159 International Working Group on FRBR and CIDOC-CRM Harmonization 66 IRIs see URIs ISAD(g) 94 ISCO (International Standard Classification of Occupations) 144–5 Isele, R. 153 Jakus, G. 11, 29 JavaScript 39 Jayadianti, H. 87 Jena see Apache Jena JSON 37, 39 Juristo, N. 98–9, 106 (KA)2 43 Kalender, W.A. 3 Kalibatiene, D. 15 Kämpgen, B. 153 Kaplan, A. 155 Karwayun, R. 139 Khan, W. A. 122

181

King, M. 98–100 Kitchin, R. 5, 49, 157 knowledge acquisition 98–9, 106, 110–11, 115, 117, 132 bases 1, 9, 13, 17, 20–1, 28, 42, 51, 72 80, 100– 1, 110, 117, 135, 137, 139, 141, 156, 159 discovery 9, 24, 102, 106, 110–11, 115–16, 155– 7, 162 elicitation 48, 106–8, 110 brainstorming 107; concept sorting 107; concept/process mapping 107–9; critical decision method 107; Delphi method 108; interviews 107–8; laddered grids 107; limited information task 107; protocol analysis 107; questionnaires 107–8; repertory grids 107; round-table discussions 107 organization systems 5, 6, 9, 13, 22, 58–9, 79, 83, 88, 118, 158 representation 12, 14, 23, 109, 126, 158 Knowledge Interchange Format (KIF) 15 Knublauch, H. 42 Kollias, S. 120 Kozaki, K. 98 Kuhn, T.S. 22 Lanthaler, M. 39 LaPolla, F. 24, 48 Large Hadron Collider 3, 159 Larsen, B. 132 Lassila, O. 27 LD Spider 153 Lee, M. 139 Li, J. 90 Liang, Y. 42 Library Holdings Ontology 127 Library of Congress 7, 8, 36–7, 60 Name Authority File 6 Subject Headings 6–7, 58–9 life streaming 3 linked data 14–17, 23–4, 27–8, 33, 37, 48, 50, 68, 72, 77, 82, 84, 87, 105, 135, 150, 153, 160, 162 Linked Open Vocabularies (LOV) 82, 84, 86, 104, 129, 148–9 Linking Open Data Cloud 72 Linnaean taxonomy of biological classification 6 Lintern, G. 110 LiveJournal 74 LODE (Linking Open Descriptions of Events) 64, 92, 94 LODE (Live OWL Documentation Environment) 125, 134 LODLAM 47, 49 LodLive 138 LODStats 86, 149–50 log files 3 Lovrenčić, S. 14 Luczak-Rosch, M. 151

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 182

182

pRACTICAL OnTOLOgIES

Ma, L. 120 machine learning 114, 156 machine translation 110 MADS (Metadata Authority Description Schema) 60 Makwana, A. 139 Malik, S. 41 Maniktala, S. 41 MARC 60, 158 Marrero, M. 110–11 Matuszek, C. 15 McGuinness, D. L. 97–100 Mediation Bridge Ontology 122 Medical Subject Headings (MESH) 7, 95, 101 Méndez, E. 48 metadata 1, 4, 13, 15–16, 22, 48–50, 63, 64, 70, 77, 82, 84–6, 88, 93, 143 157, 160 Meusel, R. 153 microdata 46–7, 74–5, 151 microformats 46, 74, 151 Mika, P. 77, 153 Mizoguchi, R. 100 Mladenić, D. 123–4 mobile phones 2–3, 75 Morgan, E.L. 50 Murdock, J. 14, 16, 22, 48, 127 museums 16, 50, 70 N8 Equipment Ontology 81, 124 Nagar, B. 139 Naisbitt, J. 1 named entity recognition (NER) 110–11, 140 National Agricultural Library Thesaurus 59 natural language processing (NLP) 23, 106, 110–11, 114–15, 156, 158, 164 Natural Language Toolkit 115 Nature 2, 27, 59, 60 Neches, R. 5, 9 NeOn 104 Niles, I. 15 Nisheva-Pavlova, M. 50 Novak, J. 29, 108–9 Noy, N. F. 83, 97–100 Nurseitov, N. 39 OAD 50 OAI-ORE 63 OBO-Edit 106 OCLC 48 O’Hara, K. 14, 22–3, 126 Oliver, C. 116 Omoronyia, I. 111 OnAGUI 121 ontologies alignment 119–22, 133, 162–3 building 97–8, 100, 104, 110, 160 definition 1, 4–5, 9, 11–12, 15, 88

documentation 15, 53, 88, 91, 93, 95, 98–9, 100, 104, 124–5, 134, 138, 154 evaluation 89, 92, 99–100, 103, 122–4, 134 history 9 limitations 22, 83, 155, 161 maintenance and sustainability 1, 22, 92, 95, 99–100, 104, 106, 126–7, 135, 161, 163 metrics 85, 89–90, 148 naming conventions 31, 116–17 purpose 4–5, 17, 18–21, 137 158 (see also indexing; knowledge bases; and information retrieval) ranking 90–1 reasoners 138 reuse 79, 81,102–3, 128, 137–8 role of information professionals 1, 21–2, 24, 47–8, 158, 161 scope 48, 98–102, 109, 115, 128 software 91, 103–6, 131, 138, 154 structure 12–14 classes 13, 30–1, 55; element set 13, 17 instances 13–14; properties 30–1, 55; types 14–15; application profiles 14–15, 63, 71, 78–80, 96, 135; domain 14, 22, 53, enterprise 22, 51, 97–8, 102, 110, 122, 126, 163; formal/informal 9, 61, 101, 135, 138, 158, 162–3; languages 14–15, 41, 106, 139; lightweight 14, 40, 101, 162; middle–level 15; open/closed 162–3; upper 14–17, 53, 68–70, 128–9, 162–368, 101; web 71 Ontology Alignment Evaluation Initiative (OAEI) 119–21 ontology libraries 83–6, 88–92, 95, 129, 148 Ontology Lookup Service 124 ontology search engines 83, 86–7, 137, 148 ONTOSSN 82 OPAC (Ontology Property-based Adaptive Crawler for Linked Data) 154 Opalički, I 14 open access 1–2 Open Annotation Data Model 115 open code 2 open data 2 Open Data Commons 48 open government 2, 50 Open Graph Protocol (OGP) 77, 82, 94, 150, 161–3 Open Knowledge Foundation 84 Open Library 36 open licences 48, 127 open science 2, 159 open source 53, 104, 111, 114, 125 OpenCalais 115 OpenCyc 14–15, 23 OpenRefine 122 Oracle 151 ORCID 81 O’Reilly, B. 159 O’Riain, S. 42, 153

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 183

InDEx

OWL (Web Ontology Language) 15, 24, 32, 40–2, 51, 60–4, 69, 88, 94, 103–6, 118–19, 121, 125, 128–9, 134, 139, 154, 158 OWL syntax 61 OWLGrEd 105 Paredes-Valverde, M. A. 162 Patel, M. 15, 80 Paulheim, H. 121, 153 Pease, A. 15 personal assistants see automated assistants Pfeiffer, H. 97–8, 100 Pike, W. 22–3 Polanyi, M. 23 Powers, S. 43 precision and recall 2, 6, 9 Price, D. J. S. 4 Pritchard, A. 127 Protégé 42, 59–60, 104, 131, 138–9 publishing 1, 2, 4 Rahayu, W. 98, 163 Raman, B. 2 Ranganathan, S.R. 10, 49 Rattanasawad, T. 41–2 RDA (Resource Description and Access) 15, 34, 44, 49, 66–7, 82, 94, 101, 116, 129, 158–9 RDF (Resource Description Framework) 15, 17, 28, 41, 46–7, 50, 51, 61, 64, 72, 77, 88–9, 94, 103, 121–2, 130, 139, 141, 146, 151, 153, 158 embedding 39, 42–3, 112, 153 serialization 33–4, 36–9, 47, 55, 61, 124, 140, 142; eRDF 43; JSON-LD 37–9, 43–6; NQuads 37–8, 55, 151; N-Triples 37–8, 151; N3 in HTML 43; RDF/JSON 39; RDF/XML 33–8, 42, 47, 55, 72, 123–4, 134; RDFa 43– 4, 46–7, 75, 112, 151; RDFa Lite 44 translation 37–8, 45 triples 28–9, 34, 42–3, 115 validation 36–7 vocabulary 35–6, 39, 53–8, 62–3, 128–9, 148 RDF Book Mashup 36 RDF Gravity 11 RDF-Translator 37–8 RDF2RDF 37 RDFS (RDF Schema) 15, 29–30, 32, 39–40, 42, 51, 53–8, 61, 70, 103, 106, 118, 128–9 RDFS+ 61 RDFS++ 61 reasoning see inferencing recall see precision and recall REGEX 144 relationship extraction 110–11 RIF (Rule Interchange Format) 32, 41–2 Rizwan, I. 68 Rocha da Silva, J. 14, 118 Rothstein, M. 74 Rowley, J. 74

183

rule languages 41–2, 139 Safayeni, F. 109 Sarawagi, S. 111 Sauermann, L. 33 Schaible, J. 87 Schema.org 47, 64, 74–7, 81–2, 92–4, 129, 151–2, 163 Schema Bib Extend Community Group 76 Schmachtenberg, M. 153, 174 Schneider, J. W. 132 scientometrics 150 search engines 23, 28, 74, 77, 83, 86, 88–9, 95, 110, 137, 139, 152–3, 164 semantic crawlers 137, 150, 152–4 semantic web 11, 14–15, 17, 21–4, 27–34, 37, 39– 44, 46–51, 54, 58, 61, 64–5, 71–3, 77–8, 83, 86–90, 92–4, 101, 104, 117, 119, 122, 124, 126, 129, 135–6, 138–40, 148, 150, 152–4, 156, 158, 162–4 Semantic Web in Libraries 47 semantic web stack 31–2, 42 sentiment analysis 110 Shadbolt, N.R. 23, 102, 107–8 Shi, P. 90 SHOE (Simple HTML Ontology Extensions) 43 Shrobe, H. 83, 156 Shvaiko, P. 120–1 Sindice 86, 153 Singer, R. 46 Singh, S. 139 SIOC Core Ontolgy 63, 73 Siri 139 SKOS (Simple Knowledge Organization Systems) 24, 53, 58–60, 94, 106, 118, 121, 134, 148, 160 SKOS Play 59, 60, 138 SKOS-XL 60 SKOSed 60 Skype 108 Smart, P.R. 23, 102, 107–8 Smith, B. 9, 13, 80, 116, 118–19, 163 Smithsonian 50 Snapchat 2 Snowball Metrics 130 social media 2, 8, 49, 71, 73, 75, 132 social network sites 4, 8, 73–4, 77, 132 SPARQL 32, 37–8, 40–2, 51, 72, 84, 137, 140–4, 146, 148–9, 151, 162 Spear, A.D. 13, 80, 116, 118–9, 163 speech recognition 110 SPIN 41, 42 Spyratos, N. 50 Sridevi, K. 90 Stack Overflow 54 Stadtmüller, S. 150 Stamou, G. 120 Stanchev, P. 50 Starr, R. R. 109

Stuart_Practical ontologies_TEXT PROOF_04 28/07/2016 09:15 Page 184

184

pRACTICAL OnTOLOgIES

Stewart, D.L. 15, 102, 110 Stock, K. 21 Stoilos, G. 120 Stuart, D. 49, 74, 86, 160 STW Thesaurus for Economics 59 subject headings 7, 58, 84–5, 91 Suda, B. 46 Suggested Upper Merged Ontology (SUMO) 14–15 Sun, Q. 42 Sun, Y. 120 SUO-KIF 15 Swanson, D.R. 21, 156 Swoogle 86–7, 153 SWRL (Semantic Web Rule Language) 41–2 synonym ring 6, 90 Szekely, P. 50 Szolovits, P. 83, 156 tacit knowledge 23, 107, 157 tagging 8–9 Taniar, D. 98, 163 Tansley, S. 2 Tawny-OWL 105 taxonomies 1, 5, 8–9, 12, 58, 80, 100–1 Taxonomy Warehouse 85 TDB 151 Techcrunch 2 Thelwall, M. 157 thesaurus 6, 7–9, 12–13, 20, 23, 58, 84, 116 Thesaurus for Graphic Materials 8 Tolksdorf, R. 151 Tolle, K. 2 Tonkin, E. 97–8, 100 TopBraid 104, 138 triplestores 41–2, 138, 140–1, 143, 148, 151 Turtle 37–8, 44, 55, 140, 142 Twitter 2, 8, 73, 78 UK Archival Thesaurus 59 Umarani, R. 90 undiscovered public knowledge 1, 21, 156 Unicode 32 URIs 16, 29–30, 32–6, 38, 61, 67, 77, 117, 119, 123, 142–3, 152–3 Uschold, M. 98–100 van Hooland, S. 6, 16, 23, 28, 34, 37, 40, 46–8, 60, 111, 122 , 156, 162

van Noorden, R. 159 Vasilecas, O. 15 Verborgh, R. 6, 16, 23, 28, 34, 37, 40, 46–8, 60, 111, 122, 156, 162 VIAF (Virtual International Authority File) 8, 48 vocab.cc 150 VOWL 54, 104, 125 Vrandečić, D. 61, 123 W3C (World Wide Web Consortium) 9, 11, 15, 29, 39–40, 42, 44, 60, 83, 93, 139 W3C Library Linked Data Incubator Group 13 W3C RDF Validator 36–7, 44 Wang, S. 120 Warren, P. 15, 42, 104, 160 Watson 87 Web 2.0 71 Web Data Commons 151, 153 WebProtégé 105 WebVOWL 125, 138 Weibel, S. L. 63 Welsh, A. 65, 100 Wikipedia (see also DBpedia) 32, 71, 72, 84, 116, 144–5, 147 Wildgaard, L. 132 Willer, M. 12 Wilson, C.S. 150 wisdom 5 WordNet 91, 120 WorldCat 36, 48 Wylot, M. 151 XBRL 42 XML 32, 34–5, 37, 39, 62, 84, 114, 130 XSLT 115 Yahoo 74, 77 Yandex 74 YouTube 2 Ziman, J.M. 4 Zins, C. 5 Zuasti Urbano, J. 108 Zulkifli, M. Y. 68