300 66 5MB
English Pages 224 [232] Year 2018
Dr. Anne E. Thessen (Ed.)
Application of Semantic Technology in Biodiversity Science
APPLICATION OF SEMANTIC TECHNOLOGY IN BIODIVERSITY SCIENCE
Studies on the Semantic Web
www.semantic-web-studies.net
Semantic Web has grown into a mature field of research. Its methods find innovative applications on and off the World Wide Web. Its underlying technologies have significant impact on adjacent fields of research and on industrial applications. This new book series reports on the state of the art in foundations, methods, and applications of Semantic Web and its underlying technologies. It is a central forum for the communication of recent developments and comprises research monographs, textbooks and edited volumes on all topics related to the Semantic Web. Editor-in-Chief: Pascal Hitzler Editorial Board: Diego Calvanese, Vinary Chaudhri, Fabio Ciravegna, Michel Dumontier, Dieter Fensel, Fausto Giunchiglia, Carole Goble, Asunción Gómez Pérez, Frank van Harmelen, Manfred Hauswirth, Ian Horrocks, Krzysztof Janowicz, Michael Kifer, Riichiro Mizoguchi, Mark Musen, Daniel Schwabe, Barry Smith, Steffen Staab, Rudi Studer Publications: Vol. 033 – Dr. Anne E. Thessen (Ed.), Application of Semantic Technology in Biodiversity Science Vol. 032 Vol. 031 Vol. 030 Vol. 029 Vol. 028 Vol. 027 Vol. 026 Vol. 025 Vol. 024 Vol. 023 Vol. 022 Vol. 021 Vol. 020 Vol. 019 Vol. 018 Vol. 017 Vol. 016 Vol. 015 Vol. 014
Pascal Hitzler et al. (Eds.), Advances in Ontology Design and Patterns Michael Färber, Semantic Search for Novel Information Hassan Saif, Semantic Sentiment Analysis in Social Streams A. Ławrynowicz, Semantic Data Mining: An Ontology-Based Approach R. Zese, Probabilistic Semantic Web: Reasoning and Learning M. Kejriwal, Populating a Linked Data Entity Name System Muhammad Saleem, Efficient Source Selection and Benchmarking for SPARQL P. Hitzler et al. (Eds.), Ontology Engineering with Ontology Design Patterns. Foundations and Applications Olaf Hartig, Querying a Web of Linked Data. Foundations and Query Execution Natalia A. Diaz Rodgriguez, Semantic and Fuzzy Modelling for Human Behaviour Recognition in Smart Spaces. A Case Study on Ambient Assisted Living Juan F. Sequeda, Integrating Relational Databases with the Semantic Web Laurens Rietveld, Publishing and Consuming Linked Data Tom Narock and Peter Fox, (Eds.) The Semantic Web in Earth and Space Science. Current Status and Future Directions Aidan Hogan, Reasoning Techniques for the Web of Data Jens Lehmann and Johanna Völker (Eds.), Perspectives on Ontology Learning Jeff Z. Pan and Yuting Zhao (Eds.): Semantic Web Enabled Software Engineering Gianluca Demartini, From People to Entities: New Semantic Search Paradigms for the Web Carlos Buil-Aranda, Federated Query Processing for the Semantic Web Sebastian Rohjans, Sematic Service Integration for Smart Grids For more information see www.semantic-web-studies.net
Application of Semantic Technology in Biodiversity Science
Dr. Anne E. Thessen (Ed.)
Dr. Anne E. Thessen The Ronin Institute for Independent Scholarship 190 Weston Street Waltham, MA 02453 USA [email protected]
Bibliographic Information published by the Deutsche Nationalbibliothek: The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie. Detailed bibliographic data are available on the Internet at https://portal.d-nb.de. Publisher
Co-Publisher
Akademische Verlagsgesellschaft AKA GmbH P.O. Box 22 01 16 14061 Berlin Germany Tel.: 0049 (0)30 36 43 01 64 [email protected] www.aka-verlag.com
IOS Press BV Nieuwe Hemweg 6B 1013 BG Amsterdam The Netherlands www.iospress.nl
© 2018, Akademische Verlagsgesellschaft AKA GmbH, Berlin All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without prior permission from the publisher. Reproduced from PDF supplied by the author Printer: Bookstation GmbH, Anzing Printed in Germany ISSN 1868-1158 (print) / ISSN 2215-0870 (online) ISBN 978-3-89838-733-0 (AKA, print)) ISBN 978-1-61499-853-2 (IOS Press, print) ISBN 978-1-61499-854-9 (IOS Press, online)
Application of Semantic Technology in Biodiversity Science Anne E. Thessen (Ed.) ISBN 978-3-89838-733-0 c 2018 AKA Verlag Berlin
v
Contents Part I. Introduction Chapter 1. Introduction
3
Anne E Thessen
Part II. Linked Biodiversity Data Chapter 2. Darwin Core as an RDF Vocabulary
15
Steven J Baskauf and Joel Sachs
Chapter 3. Linked Open Data for Biodiversity
29
Rathachai Chawuthai, Utsugi Jinbo, and Hideaki Takeda
Chapter 4. Semantic Phenotypes
53
Matthew J Yoder, Michael B Twidale, Andrea K Thomer, Lars Vogt, Nico M Franz, Jinlong Guo, Andrew R Deans, and James P Balhoff
Part III. Biodiversity Ontologies Chapter 5. The Biocollections Ontology
81
Ramona L Walls, Pier Luigi Buttigieg, John Deck, Rob Guralnick, and John Wieczorek
Chapter 6. Flora Phenotype Ontology
107
Robert Hoehndorf, Claus Weiland, Marco Schmidt, Quentin Groom, George Gosline, Stefan Dressler, and Thomas Hamann
Chapter 7. An Ontology for Fish Reproduction
121
Fabrice Teletchea
Part IV. Novel Insights Through Application Chapter 8. Mapping with CartograTree
137
Nic Herndon, Taylor Falk, Emily S Grau, Sook Jung, Stephen Ficklin, Dorrie Main, Margaret E Staton, and Jill L Wegrzyn
Chapter 9. Large-Scale Study of Prokaryotic Phenotypes151 Carrine E Blank, Hong Cui, Jin Mao, Lisa R Moore, Robert W Thacker, and Ramona L Walls
vi Chapter 10. Predictive Phenomics
187
Ian Braun, James P Balhoff, Tanya Z Berardini, Laurel Cooper, Georgios Gkoutos, Lisa Harper, Eva Huala, Pankaj Jaiswal, Toni Kazic, Hilmar Lapp, James A Macklin, Chelsea D Specht, Todd Vision, Ramona L Walls, and Carolyn J Lawrence-Dill
Chapter 11. Semantic Analysis of Traits and Genes Paula M Mabee, Wasila M Dahdul, James P Balhoff, Hilmar Lapp, Prashanti Manda, Josef Uyeda, Todd Vision, and Monte Westerfield
207
Part I Introduction
This page intentionally left blank
Application of Semantic Technology in Biodiversity Science Anne E. Thessen (Ed.) ISBN 978-3-89838-733-0 c 2018 AKA Verlag Berlin
3
Chapter 1
Application of Semantic Technology in Biodiversity Science: An Introduction a
Anne E THESSENa,b The Ronin Institute for Independent Scholarship, Montclair, New Jersey, USA b The Data Detektiv, Waltham, Massachusetts, USA
1. Why apply semantic technology to biodiversity science? Biodiversity is present all over the globe and has been for billions of years; thus, the scale of biodiversity data is large spatially and temporally. Over the centuries, biodiversity data have been collected by an abundance of people for a variety of reasons using a vast array of equipment, making these data incredibly heterogeneous in form and quality. These data are notoriously difficult to work with; yet, are necessary for answering big, pressing questions such as those about the origin of life and how to mitigate the effects of climate change. The large scale of the data is beyond the scope of what most humans can process in a lifetime. Modern computing can help with the scalability problem, but cannot cope with the extreme heterogeneity. A hybrid approach is needed that combines modern computing power with the human ability to understand ambiguity. Semantic technology is an approach that can help biodiversity science take full advantage of the computing power of the 21st Century.
4
A.E. Thessen / Introduction
The aim of this interdisciplinary book is to introduce experts in biodiversity science and semantic technology to each other. As such, this book is written for a mixed audience of semantic technology experts and biodiversity scientists. This introduction will briefly explain the basics of both fields, relating concepts in each to chapters in this book. Our goal is to raise awareness of what semantic technology has to offer biodiversity science and make semantic technology experts more aware of this application of their work. Part I contains three chapters that give an introduction to biodiversity standards and introduces semantic technology through the lens of biodiversity. The first chapter in this section discusses Darwin Core, a biodiversity data standard, and its translation into RDF, a semantic model. The second chapter describes a Linked Open Data (LOD) framework for representing changes in taxonomic nomenclature, the system of rules governing the naming of organisms. LOD is a specific manifestation of the semantic web that publishes data in RDF. The third chapter describes a taxonomic workflow that incorporates semantic technology in the context of a workbench that helps taxonomists create semantic phenotypes, machine-readable descriptions of species. Part II describes the development, maintenance, and application of several ontologies in the biodiversity domain. The Biological Collections Ontology (Chapter 5) is an applied biodiversity ontology for integrating data from activities such as sampling and trait observation. The Flora Phenotype Ontology (Chapter 6) was built by mining floras for describing descriptive plant data in a way that is reusable across domains. STOREFISH (Chapter 7) is a database of fish reproductive information in its very early stages that can be used to build a common ontology for use in aquaculture. Each chapter in Part III describes specific applications of semantic technology that advance biodiversity science. CartograTree links genotype, phenotype, and environmental data for forest trees in a web-accessible framework for association mapping analysis (Chapter 8). Chapter 9 describes a large-scale study of prokaryotic phenotypes using Natural Language Processing (MicroPIE) and MicrO, an ontology of prokaryotic phenotypic terms. Chapter 10 describes the use of semantic reasoning to compare phenotypes across species and disciplines using examples from medical and agricultural applications. The Phenoscape KB, discussed in the last chapter, is a semantic
A.E. Thessen / Introduction
5
knowledgebase that uses machine reasoning to link phenotypes to their candidate genes. The Phenoscape KB has been used to build innovative applications applying to diverse species. 2. Brief description of semantic technology Semantic technology is a two decades old subfield of computer science that can be difficult to define. This book will focus on aspects of the field that are most relevant to biodiversity science i.e., ways to encode meaning so that data can be efficiently shared, discovered, integrated, and reused. These technologies form the basis of the Semantic Web, which is described at a very basic level by Tim Berners Lee’s rules (https://www.w3.org/DesignIssues/LinkedData.html). Tim Berners Lee’s Rules for the Semantic Web 1. Use URIs as names for things. 2. Use HTTP URIs so that people can look up those names. 3. When someone looks up a URI, provide useful information, using standards. 4. Include links to other URIs so that they can discover more things.
The backbone of semantic technology is the URI (Uniform Resource Identifier), which acts as a persistent, unique identifier for any object or concept one wishes to identify. (URIs are discussed in Chapters 2 and 3). The basic logic behind the need for semantic technology is that human language is ambiguous. We use the same word to mean different things, we use different words to mean the same thing, and the meanings of words change over time. Despite many amazing advances in algorithms designed to help computers read human language, machines cannot cope with this. A URI is unique, persistent, and unambiguous, perfect for computers. They can be very ugly for humans to look at. URI for “color”: http://purl.obolibrary.org/obo/PATO_0000014 When two different data sets, supplied by people who may or may not know about each other, are using the same URI to name the same
6
A.E. Thessen / Introduction
thing, a computer can automatically link the two without any extra work from a person. When this happens over and over again, you get a massive network of interlinked data sets. One exciting example of this is the Linked Open Data cloud (LOD; Discussed in Chapter 3). Preparing your data set to be a part of LOD is non-trivial and requires a specific set of data curation tasks which have a corresponding star rating (http://5stardata.info/en/). One star: make your stuff available on the web (in any format) under an open license Two stars: make it available as structured data (for example as .csv instead of a .pdf) Three stars: make it available in a non-proprietary, open format (for example, .csv instead of .xls) Four stars: use URIs to denote things, so that people can point at your stuff Five stars: link your data to other data to provide context
When those URIs are part of a type of structured vocabulary called an ontology, then computers can engage in reasoning. An ontology is a dictionary for a computer. (Specific ontologies are discussed in Part II.) It differs from a vocabulary in that in addition to defining a term, it relates terms to each other. For example, an ontology about vertebrate anatomy would define the terms limb, forelimb, and hand in addition to asserting that a forelimb is a type of limb and that a hand is part of a forelimb. Once ontologies are in place, a computer can perform reasoning by applying a piece of software called a “reasoner”. For example, if an ontology states that 1. birds have feathers and 2. a robin is a bird, then a reasoner can infer that 3. a robin has feathers. A very common piece of software used to build and manage ontologies is Protégé (http://protege.stanford.edu/), which comes with built-in reasoners. In order to apply reasoning, one must have their data sets communicated in RDF or OWL, standards approved by W3C that model knowledge and allow users to make machine-readable assertions, such as “birds have feathers”. There are several examples of data in the RDF and OWL formats throughout this book (Chapters 3 and 6). Once the data are in RDF or OWL, it can be queried using a lan-
A.E. Thessen / Introduction
7
guage called SPARQL. Again, there are several examples of queries in SPARQL throughout this book (Chapters 2 and 6). A good resource for learning more about modeling data in RDF or OWL and using SPARQL is the book, Semantic Web for the Working Ontologist [1] . 3. Brief description of biodiversity science Biodiversity science is a centuries old study of organisms that includes fields such as taxonomy, systematics, demography, metagenomics, biogeography, and phylogenetics. Many additional natural science disciplines depend on biodiversity data, such as ecology, biogeochemistry, agricultural science, and population biology. The scale of biodiversity data is large and relies on equally large collections of specimens and data. There are billions of specimens in museums around the world [2] . Biodiversity databases like GBIF contain hundreds of millions of data points and the Biodiversity Heritage Library has digitized tens of millions of pages. The way biodiversity data are created today is still very similar to the way they have been created for centuries when the state-ofthe-art was a published book with text, images, and data tables, i.e., humans go out into the world and make observations or do experiments in the laboratory. More recently, this work is automated with in situ sensors or laboratory equipment. Data are now digitized in spreadsheets, databases, and text files, which allow us to automate complex queries, calculations, and figure generation. These technologies were broadly accepted across the biodiversity and natural science communities decades ago and are used frequently instead of doing manual calculations and figure drawing. Within the past decade we have seen the beginnings of the use of semantic technology in biodiversity science enabled by the development and maturation of applicable data infrastructure, such as standards like Darwin Core [3] and the collection of ontologies like those at OBO Foundry [4] . The system for describing and naming taxa, which forms the basis for much of biodiversity data management, was developed centuries ago in Europe and is difficult to change. (This system is also known as Linnaean taxonomy or Linnaean nomenclature.) As a result, much of biodiversity science is not utilizing best available technology. This is partly due to significant resistance from practitioners, but taxonomic nomenclature was not designed for computation and, by nature, can-
8
A.E. Thessen / Introduction
not be fully automated. (Chapter 3 discusses semantic applications in nomenclature.) In addition, taxonomy is still very active and current knowledge is being revised while data and methods from 200 years ago must also be considered. The past several decades have seen a major push to digitize and make available the wealth of data that have been created with the goal of making it usable within and outside of the biodiversity science. Databases such as the Global Biodiversity Information Facility (GBIF), the Encyclopedia of Life (EOL), the Biodiversity Heritage Library (BHL), GenBank, etc., all contain important pieces of the biodiversity puzzle [5] . Biodiversity science has a centuries old research culture that does not readily translate to modern computing methods, but several efforts to digitize data and modernize methods are facing the challenge. The current “biodiversity landscape” has been mapped and includes data repositories, internet aggregators, nomenclatural authorities, and publishers [5] . Each link contains an important piece of biodiversity data that can be linked in part because of the work of Biodiversity Information Standards (TDWG; http://www.tdwg.org/), a standards body that has developed, approved, and maintained several biodiversity data standards. (The Darwin Core data standard is discussed in Chapter 1.) In addition, a “biodiversity knowledge graph” has been proposed that links taxa, taxonomic names, publications, people, species, sequences, images, and collections [6,7] . The first real implementation of the biodiversity knowledge graph, although partial, is OpenBiodiv-O, an ontology that aligns scholarly publishing and biological taxonomy [8] . The complete biodiversity knowledge graph, as proposed, has yet to be fully implemented, likely due to the social difficulties of coordinating so many projects and people and the technical difficulty of assigning identifiers to the enormous body of existing research products [9–11] . Partial implementations, achieved by linking identifiers across data repositories, do exist [6,12] , but many internet aggregators, such as the Encyclopedia of Life, link data based on taxonomic names [13] . Despite this progress, many social and technical challenges remain. 4. Conclusion This book has two important missions. First, to introduce biodiversity scientists and the community of scientists that use biodiver-
A.E. Thessen / Introduction
9
sity data to the utility of semantic technology and the importance of adopting good data practices to facilitate its use. If research practitioners see the increased analytical potential of their data when using semantic technology, they may stimulate a shift in the incentive structure that binds many academic researchers to an old-fashioned reward system that neglects data stewardship. Second, to introduce computer scientists working in the development of semantic technology to this application of their work and familiarize them with the technology needs of the biodiversity community. I hope to convince a wide audience that these technologies can take biodiversity research to the next level by increasing the scale and scope of studies and thus further stimulate the ongoing “data revolution”. 5. Acknowledgements The NSF Phenotype Research Coordination Network (NSF 0956049) was instrumental in helping develop the use of semantics in the biodiversity sciences and bringing many of the collaborators across these chapters together. 6. References [1] Allemang D, Hendler J (2011) Semantic Web for the Working Ontologiest. 2nd Edition. Elsevier. 354 pp. [2] Vollmar A, Macklin JA, Ford L (2010) Natural history specimen digitization: Challenges and concerns. Biodiversity Informatics 7(2):93-112. https://doi.org/10.17161/bi.v7i2.3992 [3] Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, Giovanni R, Robertson T, Vieglais D (2012) Darwin Core: An evolving community-developed biodiversity data standard. PLoS ONE 7(1):e29715. https://doi.org/10.1371/journal.pone.0029715 [4] Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg LJ, Eilbeck K, Ireland A, Mungall CJ, The OBI Consortium, Leontis N, Rocca-Serra P, Ruttenberg A, Sansone S-A, Scheuermann RH, Shah N, Whetzel PL, Lewis S (2007) The OBO Foundry: Coordinated evolution of ontologies to support biomedical data integration. Nature Biotech 25:1251-1255. 10.1038/nbt1346
10
A.E. Thessen / Introduction
[5] Bingham HC, Doudin M, Weatherdon LV, Despot-Belmonte K, Wetzel FT, Groom Q, Lewis E, Regan E, Appeltans W, Güntsch A, Mergen P, Agosti D, Penev L, Hoffmann A, Saarenmaa H, Geller G, Kim K, Kim H, Archambeau A-S, Häuser C, Schmeller DS, Geijzendorffer I, García Camacho A, Guerra C, Robertson T, Runnel V, Valland N, Martin CS (2017) The biodiversity landscape: Elements, connections and opportunities. RIO 3:e14059. https://doi.org/10.3897/rio.3.e14059 [6] Page R (2016) Towards a biodiversity knowledge graph. RIO 2:e8767. https://doi.org/10.3897/rio.2.e8767 [7] Senderov V, Penev L (2016) The open biodiversity knowledge management system in scholarly publishing. RIO 2:e7757. https://doi.org/10.3897/rio.2.e7757 [8] Senderov V, Simov K, Franz N, Stoev P, Catapano T, Agosti D, Sautter G, Morris RA, Penev L (2018) OpenBiodiv-O: ontology of the OpenBiodiv knowledge management system. J Biomed Semant 2018(9):5. 10.1186/s13326-017-0174-5 [9] / A, Hyam H, Hagedorn G, Chagnoux S, Röpert D, Casino A, Droege G, Glöckler F, Gödderz K, Groom Q, Hoffmann J, Holleman A, Kempa M, Koivula H, Marhold K, Nicolson N, Smith VS, Triebel D (2017) Actionable, longterm stable and semantic web compatible identifiers for access to biological collection objects. Database 2017:bax003. https://doi.org/10.1093/database/bax003 [10] McMurry JA, Juty N, Blomberg N, Burdett T, Conlin T, Conte N, Courtot M, Deck J, Dumontier M, Fellows DK, GonzalezBeltran A, Gormanns P, Grethe J, Hastings J, Hériché J-K, Hermjakob H, Ison JC, Jimenez RC, Jupp S, Kunze J, Laibe C, Le Novère N, Malone J, Martin MJ, McEntyre JR, Morris C, Muilu J, Müller W, Rocca-Serra P, Sansone S-A, Sariyar M, Snoep JL, Soiland-Reyes S, Stanford NJ, Swainston N, Washington N, Williams AR, Wimalaratne SM, Winfree LM, Wolstencroft K, Goble C, Mungall CJ, Haendel MA, Parkinson H (2017) Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximuze utility and impact of life science data. PLoS ONE 15(6):e2001414. https://doi.org/10.1371/journal.pbio.2001414 [11] Page RDM (2008) Biodiversity informatics: The challenge of linking data and the role of shared identifiers. Brief Bioinform
A.E. Thessen / Introduction
11
9(5):345-354. https://doi.org/10.1093/bib/bbn022 [12] Page RDM (2013) BioNames: Linking taxonomy, texts, and trees. PeerJ 1:e190. https://doi.org/10.7717/peerj.190 [13] Parr CS, Wilson N, Leary P, Schulz KS, Lans K, Walley L, Hammock JA, Goddard A, Rice J, Studer M, Holmes JTG, Corrigan Jr. RJ (2014) Then Encyclopedia of Life v2: Providing global access to knowledge about life on Earth. Biodivers Data J 2014(2):e1079. 10.3897/BDJ.2.e1079
This page intentionally left blank
Part II Linked Biodiversity Data
This page intentionally left blank
Application of Semantic Technology in Biodiversity Science Anne E. Thessen (Ed.) ISBN 978-3-89838-733-0 c 2018 AKA Verlag Berlin
15
Chapter 2
Darwin Core as a Vocabulary for Expressing Biodiversity Data as RDF a
Steven J. BASKAUFa and Joel SACHSb Vanderbilt University Department of Biological Sciences, Nashville, Tennessee, USA b Agriculture and Agri-Food Canada, Ottawa, Ontario, Canada
Darwin Core is a glossary of terms intended to facilitate the sharing of biodiversity occurrence data and related information, such as specimen metadata, checklists, determination histories, and taxonomy. It is widely used by museums and herbaria to share specimen data with aggregators through fielded text files. This chapter discusses efforts to use Darwin Core terms to express these data as RDF. It describes the history and motivation of the Darwin Core RDF Guide, presents some of the difficulties involved in using Darwin Core in RDF, and gives examples that demonstrate the challenges that still remain for the community to address. 1. Background In 2005, Biodiversity Information Standards (TDWG) embarked on a mission to establish an umbrella architecture for TDWG standards. The goal of this effort was to enable data integration among providers of heterogeneous datasets in cases where the ultimate use and users were unknown [1] . By 2007, architecture development was focused
16
S.J. Baskauf and J. Sachs / Linked Biodiversity Data
on creation of a high level OWL ontology, development of an XMLbased exchange protocol (TAPIR [2] ), and on adoption of Life Science Identifiers (LSIDs), a globally unique identifier system [3] . Within several years, it became apparent that this approach was not working. Developing a TDWG ontology was too complex and time-consuming, consensus was elusive, data transfer by the TAPIR protocol was too slow, and implementation of LSIDs carried too heavy of a technical burden for it to be widely adopted.
By 2009, a simpler system had evolved to allow effective data transmission. The Darwin Core (DwC) Standard [4] provided a vocabulary of terms that could be used to describe occurrences and taxa, the core resources of interest to TDWG. Darwin Core Archives, a system for transmitting records via compressed delineated text files, was highly efficient [5] . A consensus resolution to the problem of globally unique identifiers was not attained, and providers used ad hoc solutions such as “Darwin Core Triplets” or UUIDs. Although this system facilitated automated harvesting of provider data, it did not satisfy the original goal of enabling integration of heterogeneous data.
From 2009-2011, increased interest in Linked Open Data [6] within TDWG resulted in the suggestion that a system for machine-mediated data integration might be built around Darwin Core using standard Linked Data technologies. In 2011, TDWG chartered an RDF/OWL Task Group to explore best practices for the use of RDF in the biodiversity informatics realm. It became quickly clear that the semantics of Darwin Core, while sufficient for representing data in spreadsheets, were too ambiguous for the semantic web, where a human user cannot be counted on to interpret intended meaning from surrounding context. Some of the issues related to distinguishing between resources and descriptions of resources that are common to all domains migrating to Linked Data. Others were particular to biodiversity informatics, and its historical grounding in natural history collections. Workshops were held in 2013 and 2014 to address these issues, and sufficient progress was made to enable the Task Group to deliver a Darwin Core RDF guide, which was adopted as an addition to the Darwin Core standard [7] .
S.J. Baskauf and J. Sachs / Linked Biodiversity Data
17
Figure 1. An occurrence record expressed as “Simple Darwin Core”.
2. How the RDF guide makes Darwin Core usable in Linked Data The simplest way to represent metadata as Darwin Core is in the form of a single spreadsheet or CSV table. Figure 1 shows an imaginary occurrence record for red oak. Each column header of the table contains a Darwin Core property from the namespace http://rs.tdwg.org/dwc/terms/ (commonly abbreviated dwc:), and the cells of the table contain the values of those properties. A particular row of the table contains information about a particular instance of an occurrence. Since the column headers contain abbreviated IRIs (CURIEs [8] ), it would be a straightforward matter to convert the data in this table directly into an RDF graph. Graphically, the metadata for the row would look like Figure 2.
Figure 2. The occurrence record of Figure 1 converted directly into RDF.
18
S.J. Baskauf and J. Sachs / Linked Biodiversity Data
Figure 3. The occurrence record of Figure 1 converted into RDF in the spirit of Linked Data.
From this diagram, we can easily see several features of the resulting RDF graph. First, the occurrence instance itself is not named with a IRI - it is a blank node. Second, the graph is very “flat”, meaning that the occurrence node has no more than a single edge connecting it to any other entity. Third, the object resource of every triple is a literal1 (indicated by the rectangle). In some cases, the object resource is best represented by a literal, such as the identifier “apsc:plants:02346” or the number “1”. But many of the resources would be better represented by non-literal objects, since they represent real-world entities (place, person) or abstract concepts (a taxon) to which URIs have (or could be) assigned. The graph in Figure 3 differs from Figure 2 in several significant ways. The focal occurrence instance is identified by a IRI. For triples where the subject is the occurrence instance and the object is a non-literal resource (ovals), the object is denoted by an IRI or by a 1
In RDF, the object of a triple can be a blank node, an IRI, or a literal. Examples of literals are “19”, “July 22, 2011”, “Tim Robertson”, “London England”. If the goal is data integration, best practice is to use IRIs where possible, and to use literals only for representing things like numbers and dates. Because RDF does not permit literals as subjects of triples, literals are always sinks in an RDF graph (i.e., they are nodes with no outgoing edges).
S.J. Baskauf and J. Sachs / Linked Biodiversity Data
19
blank node identifier. In many cases, these non-literal resources are themselves the subjects of additional triples that describe their properties. As a result, the graph is not “flat”, but rather is a network. In the spirit of Linked Data, many of the linked non-literal objects are denoted by IRIs minted by authorities other than the source that created the occurrence record. The primary purpose of the Darwin Core RDF Guide was to facilitate the transition of flat, literal-object based RDF that might be directly exported from a relational database or spreadsheet into RDF that describes a more complex graph containing links to resources described more fully by other providers. That goal was achieved by establishing conventions for handling several idiosyncrasies that result from the fact that Darwin Core was designed with flat tables in mind. Listed under each class in Darwin Core, there is a term whose local name ends in “ID” and whose purpose is to indicate an identifier for an instance of that class (e.g., dwc:occurrenceID in Figures 1 and 2 [9] ). The ID terms can be used as either foreign keys or primary keys in Darwin Core records. For example, in an occurrence record, the occurrenceID field would be used to provide an identity for the record itself, while the identificationID field would be used to specify an identification record related to the occurrence. In RDF, however, triples should not depend on context for their semantics, so the RDF Guide specifies that the well-known term dcterms:identifier should be used to indicate the identifier associated with the subject resource and the standard term rdf:type should be used to link to the IRI of the class of which the subject is an instance (Fig. 3). This is standard practice in the Linked Data community. Term definitions in generic Darwin Core are not specific about what kind of literal should be used to denote objects like people, places, and taxa. The value might be a name, an identifier (including an IRI), or a string representing a controlled vocabulary term. For each instance, an aggregating client would probably be required to do some sort of string cleaning and matching with a list of known values in order to determine whether the object resource is already a known entity. There is also the possibility that the value is not unique to a particular resource. This places a large burden on the consumer. For example, in Figure 2, the value of dwc:recordedBy could denote any of several persons named “Tim Robertson” and would not match
20
S.J. Baskauf and J. Sachs / Linked Biodiversity Data
with other records that might refer to the same person but use different forms of his name. The RDF Guide allows for terms in the generic dwc: namespace whose values denote non-literal resources to be used with literals in this way. However, the Guide creates alternative analogs of those terms whose values are expected to be IRIs. Those analogs have the same local name as the generic terms, but are placed in the namespace http://rs.tdwg.org/dwc/iri/ (commonly abbreviated as dwciri:). The burden on the aggregating client discovering a triple containing a dwciri: predicate should be less that a triple containing a predicate from the dwc: namespace, since the value of a dwciri: predicate should be a globally unique IRI. If that IRI is dereferenceable, the client may be able to acquire additional information about the object resource that will make its identity clear, and which may also lead the client to discover useful related information. In Figure 3, the value of dwciri:recordedBy is an ORCID ID that is unique and unambiguous, and that will provide additional information when dereferenced. If a data provider or aggregator makes the effort to replace an ambiguous dwc: term literal value with an unambiguous dwciri: value, that effort need only be made once, rather than requiring the disambiguation to be done by every user.
Darwin Core defines a number of sets of terms that describe a hierarchy of literal values that, as a set, provide an unambiguous reference to some resource. For example, terms such as dwc:county, dwc:stateProvince, dwc:country, and dwc:continent can be used to describe the geographic hierarchy into which a location falls (Fig. 2). Such a hierarchy of terms is convenient for searching (hence the name “convenience terms”), but using those terms requires defining the hierarchy in every record of the dataset. The RDF Guide provides several new terms that facilitate linking to the IRI of the lowest level of a standardized hierarchy. For example, in Figure 3, the term dwciri:inDescribedPlace links to the IRI for Williamson County, Tennessee. Using this IRI eliminates ambiguity about which Williamson County is denoted and by dereferencing the IRI, a user can discover that Tennessee is part of the United States and that the United States is part of North America without having to repeat that information in every record.
S.J. Baskauf and J. Sachs / Linked Biodiversity Data
21
3. Barriers to implementation 3.1. Lack of IRIs In addition to the difference in complexity between the graphs represented in Figures 1 and 2, another obvious difference is that the focal resource in Figure 1 is a blank node, whereas in Figure 2 that resource is identified with an HTTP IRI. A key tenet of the Linked Data paradigm is that resources should be identified by HTTP IRIs. But who should be responsible for minting those IRIs? One approach is to leave the minting to some centralized authority. The ARCTOS database [10] mints globally unique IRIs for all specimens from collections whose data it aggregates based on the specimen’s Darwin Core Triple. For example, the IRI http://arctos.database.museum/ guid/MVZ:Mamm:165861 dereferences to a record from the Museum of Vertebrate Zoology. Another approach is to delegate the responsibility of minting IRIs to the institution holding the described resource. This approach was taken in the Stable Identifiers initiative of the Consortium of European Taxonomic Facilities (CETAF) [11] . The initiative describes best practices for identifier creation, but leaves it up to participating institutions to implement those practices. If an institution does not mint its own HTTP IRI identifiers or participate in an aggregation effort that mints IRIs for it, then there will be no well-known IRI for the resources described in that institution’s records. Of course, anyone could “make up” an IRI (as we did in Fig. 3), but such an IRI would not dereference, nor would it correspond to another ad hoc IRI for the same resource minted by someone else. This lack of HTTP IRI identifiers for core resources in the biodiversity informatics realm is a serious impediment to implementation of Linked Data principles. Given that there are many possible technical solutions for the problem of generating stable HTTP IRIs, this barrier is social, not technological. 3.2. Inconsistent graph models RDF is fundamentally a system for expressing data as a graph. So although several providers may agree to follow the Darwin Core RDF Guide, there is no guarantee that the graphs they create could be
22
S.J. Baskauf and J. Sachs / Linked Biodiversity Data
Figure 4. Graphical representation of some RDF triples related to the occurrence record http://coldb.mnhn.fr/catalognumber/mnhn/p/p00084058 from the Muséum national d’Histoire naturelle, Paris.
productively merged if they based their RDF on different graph models. Figures 4 and 5 illustrate RDF graphs of two specimen-related occurrences from different providers. Although both graphs provide essentially the same metadata about the occurrences, the graphs are radically different in their degree of complexity. In Figure 4, the properties that describe the collector name, the country in which the collection was made, and the scientific name are all linked directly to the occurrence record. Although it is not explicitly declared to be an instance of dwc:PreservedSpecimen, it is clear by the dcterms:type declaration of dctype:PhysicalObject that the data providers consider the occurrence to be the same entity as the specimen. (It should be noted that this usage is not consistent with the definition of dwc:Occurrence - An existence of an Organism (sensu http://rs.tdwg.org/dwc/terms/Organism) at a particular place at a particular time.) In contrast, in Figure 5, none of these metadata are linked directly to the occurrence. Rather, the collector name is linked to a foaf:Person instance, the country code is linked to a dcterms:Location instance, and the scientific name is linked to a dcterms:Location instance. Additionally, in Figure 5 the actual
S.J. Baskauf and J. Sachs / Linked Biodiversity Data
23
Figure 5. Graphical representation of some RDF triples related to the occurrence record http://bioimages.vanderbilt.edu/specimen/ncu592813#occ from University of North Carolina Herbarium.
preserved specimen is considered a distinct resource that provides evidence for the occurrence instance, but which is not considered equivalent to it. There is nothing that would prevent merging these two graphs, but constructing a simple query that would work for both datasets would be difficult. For example, using the graph model of Figure 4, finding occurrences from Algeria could be accomplished with this query: SELECT DISTINCT ? o c c u r r e n c e WHERE { ? o c c u r r e n c e dwc : countryCode "DZ" . } Using the graph model of Figure 5, finding occurrences from the United States could be done with this query: SELECT DISTINCT ? o c c u r r e n c e WHERE { ? o c c u r r e n c e dsw : atEvent ? e v e n t . ? e v e n t dsw : l o c a t e d A t ? l o c a t i o n .
24
S.J. Baskauf and J. Sachs / Linked Biodiversity Data
? l o c a t i o n dwc : countryCode "DZ" . } At first glance, the graph model of Figure 5 seems unnecessarily complex. However, the dataset of which it is a part is designed to allow for multiple determinations per organism, multiple occurrences per organism, multiple forms of evidence serving as evidence for a single occurrence, multiple events at one location, and multiple occurrences at one event. None of these many-to-one relationships can be easily described using the model of Figure 4. The main point here is that the graph model that is right in a particular circumstance is one that is just complex enough to satisfy the use cases that are important to users of that model. Thus, it is not really possible to say what the “right” graph model is without first determining who the potential users are, and what use cases are important to those users. 3.3. Variety of datasets represented in the wild We can get some sense of the scope of use cases that interest members of the biodiversity informatics community by examining the structure of a variety of publically available datasets. Clearly, there are many preserved specimen datasets represented in a single table that can be modeled simply like the example in Figure 4. However, the structure of a Darwin Core Archive allows for a core table to be linked to multiple extension tables in an organizational pattern that has been called a “star schema” [12] . If represented as an RDF graph, such a pattern would be composed of a node of a central resource class linked by a single edge to nodes representing one or more additional classes (Fig. 6). Although the core table frequently contains occurrence records linked to extension tables of images or identifications, we can also find Darwin Core archives with other structures: • core table: event, extension table: occurrence [13] • core table: taxon, extension tables: occurrence, specimen, distribution, reference, and description [14] • core table: organism, extension tables: still image and identification [15]
S.J. Baskauf and J. Sachs / Linked Biodiversity Data
25
Figure 6. Graph modeling the “star schema” pattern of Darwin Core core and extension tables.
• core table: preserved specimen, extension tables: PCR amplification, loan, material sample, permit, and preparation [16] There are other more complex examples of datasets where it is not possible to use the Darwin Core Archive format because there is no single, central resource node of one type that can be related to all related resource nodes via a single edge. A more complicated graph model than the “star schema” model would be required to convert such datasets to RDF [17] . If TDWG’s goal is the same as it was in 2007 (to enable data integration among providers of heterogeneous datasets), then development of a graph model complex enough to handle the kinds of examples listed above is a necessary step to follow the Darwin Core RDF Guide. There have been attempts to construct more complex graph models to express biodiversity data as RDF, such as taxonconcept.org [18] , Darwin-SW [19] , and Filtered Push’s Darwin FP [20] . However, in order for efforts such as these to succeed, there needs to be an organized effort to involve all interested potential stakeholders
26
S.J. Baskauf and J. Sachs / Linked Biodiversity Data
and to develop a list of use cases to be satisfied by the graph model. The new TDWG Vocabulary Management specification [21] lays out a formal process for developing “vocabulary enhancements” to vocabulary standards such as Darwin Core. That process could be used to create a consensus graph model sufficiently complex to model the heterogeneous datasets in our community in a way that would allow them to be merged into a single, easily queried graph. As with the problem of lack of consensus IRIs, this is fundamentally a social challenge, not a technological one. 4. References [1] Hyam R (2007) TDWG Technical Roadmap 2007. Biodiversity Information Standards Technical Architecture Group. [2] De Giovanni R, Döring M, Güntsch A, Vieglais D, Hobern D, de la Torre J, Wieczorek J, Robert G, Hyam R, Blum S, Perry S (2010) Access Protocol for Information Retrieval (TAPIR), Version 1.0. Biodiversity Information Standards (TDWG). http: //www.tdwg.org/standards/449 [3] Object Management Group (2004) Object Management Group. 2004. Life Sciences Identifiers Specification: OMG Adopted Specification. http://www.omg.org/cgi-bin/doc? dtc/04-05-01 [4] Darwin Core Task Group (2009) Darwin Core. Biodiversity Information Standards (TDWG). http://rs.tdwg.org/dwc/ (accessed on 2017-03-08). [5] Robertson T, Döring M, Wieczorek J, De Giovanni R, Vieglais D (2009) Darwin Core Text Guide. Biodiversity Information Standards (TDWG). [6] http://linkeddata.org/ [7] Baskauf SJ, Wieczorek J, Deck J, Webb C, Morris PJ, Schildhauer M (2015) Darwin Core RDF Guide. Biodiversity Information Standards (TDWG). http://rs.tdwg.org/dwc/terms/ guides/rdf/index.htm [8] RDF-in-HTML Task Force (2010) CURIE Syntax 1.0. World Wide Web Consortium (W3C). https://www.w3.org/TR/ curie/ [9] Wieczorek J, Döring M, De Giovanni R, Robertson T, Vieglais D (2009) Darwin Core Terms: A quick reference guide. Bio-
S.J. Baskauf and J. Sachs / Linked Biodiversity Data
[10] [11] [12]
[13]
[14]
[15]
[16]
[17]
[18] [19]
27
diversity Information Standards (TDWG). http://rs.tdwg. org/dwc/terms/ http://arctosdb.org/ http://cetaf.org/cetaf-stable-identifiers Remsen D, Braak K, Döring M, Roberson T (2010) Darwin Core Archives - How-to Guide, version 1. Global Biodiversity Information Facility (GBIF). http://www.gbif.jp/v2/pdf/gbif_ dwc-a_how_to_guide_en_v1.pdf Groom QJ, Durkin JL, O’Reilly J, Mclay A, Richards AJ, Angel J, Horsley A, Rogers M, Young G (2015) A benchmark survey of the common plants of South Northumberland and Durham, United Kingdom. Biodivers Data Journal 3:e7318. http://dx.doi.org/10.3897/BDJ.3.e7318 Dataset: Botanical Garden Meise: A common plants survey of vascular plants in South Northumberland and Durham, United Kingdom. Accessed via http://www.gbif. org/dataset/5d784d06-fa1d-4f00-8cdc-663d04d26061 on 2016-10-26. Agricultural Research Council (2016) Catalogue of Afrotropical Bees. http://doi.org/10.15468/u9ezbh Accessed via http://www.gbif.org/dataset/ da38f103-4410-43d1-b716-ea6b1b92bbac on 2016-10-26. Bioimages (2016) http://bioimages.vanderbilt.edu/. 201605-07 release http://dx.doi.org/10.5281/zenodo.51121. Data as a Darwin Core archive in GBIF: http://doi.org/10.15468/jib4rt published 2014-07-14. GGBN (2016) Test dataset from Smithsonian National Museum of Natural History based on Global Genome Biodiversity Network (GGBN) Data Standard. http://collections.nmnh. si.edu/ipt/archive.do?r=nmnh_materialsample_test (accessed 2016-11-06). Baskauf SJ (2016) Guid-O-Matic meets the DwC-A RDF octopuses. http://baskauf.blogspot.com/2016/11/ guid-o-matic-meets-dwc-rdf-octopus.html https://web.archive.org/web/20160413214827/http: //www.taxonconcept.org/ Baskauf SJ, Webb CO (2016) Darwin-SW: Darwin Core-based terms for expressing biodiversity data as RDF. Semantic Web Journal 7:629-243. http://dx.doi.org/10.3233/SW-150203
28
S.J. Baskauf and J. Sachs / Linked Biodiversity Data
[20] Morris RA, Dou L, Hanken J, Kelly M, Lowery DB, Ludäscher B, Macklin JA, Morris PJ (2013) Semantic Annotation of Mutable Data. PLoS ONE 8:e76093. http://dx.doi.org/10.1371/journal.pone.0076093 [21] Vocabulary Maintenance Specification Task Group. 2017. Vocabulary Maintenance Specification. Biodiversity Information Standards (TDWG). http://www.tdwg.org/standards/642
Application of Semantic Technology in Biodiversity Science Anne E. Thessen (Ed.) ISBN 978-3-89838-733-0 c 2018 AKA Verlag Berlin
29
Chapter 3
Linked Open Data Model for Taxonomic Information Rathachai CHAWUTHAIa , Utsugi JINBOb , and Hideaki TAKEDAc a King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand b National Museum of Nature and Science, Tokyo, Japan c National Institute of Informatics, Tokyo, Japan Taxonomy (or biological taxonomy) is the science of classifying and naming the hierarchical groups of organisms based on shared characteristics. A lack of a single globally-accepted taxonomy creates a challenge for representing taxonomic knowledge, including groups of organisms, their names, and associations between taxonomic names [1] . Pieces of taxonomic knowledge, especially scientific names and classifications, change over time due to new taxonomic discoveries. Since the mid 18th century, millions of articles containing taxonomic knowledge have been published following rules for describing taxonomic knowledge, named taxonomic nomenclatures; however, the formats are not suitable for preserving, querying, and accessing taxonomic knowledge over the long term. In order to organize taxonomic knowledge in the age of the Internet by connecting the change in knowledge from different times and communities, we have to care about data association, context integration, and knowledge exchange. This chapter reviews the progress of linking taxonomic databases together and the use of Linked Open Data (LOD) for enhancing the exchangeability among knowledge repositories. The lessons learned from the data models described in
30
R. Chawuthai et al. / Linked Open Data for Biodiversity
this chapter will provide guidelines for the development of a knowledge management system for taxonomy.
1. Taxonomic knowledge The term taxonomic knowledge refers to information about taxonomy including the groups of organisms, taxonomic names, and associations between names. Groups of organisms and the hierarchal structures of groups are organized by taxonomists, and each group is named following an official nomenclature. The naming rules and formats are different for different taxonomic groups. Scientific names of animals are controlled by the International Code of Zoological Nomenclature, while those of plants and fungi depend on the International Code of Nomenclature for Algae, Fungi, and Plants. Scientific names and taxonomies are a global standard and are key to accessing knowledge about organisms; however, taxonomies and names are not stable. The changes in taxonomies and names are a significant challenge for the design and implementation of biodiversity knowledge management. 1.1. Taxonomy Taxonomists place organismal groups into a hierarchical structure on the basis of their shared common characteristics. Each group is called a “taxon” (plural taxa) and all taxa must be named. Every level in a hierarchy is called a “Rank”. The primary ranks are kingdom, phylum, class, order, family, genus, and species, in order from the broadest level to the narrowest. Prefixes “super” and “sub” can be added to a primary rank to create a more general or specific rank, respectively, such as superspecies and subspecies [2] . The classification system is improved as technology progresses. For example, before the development of gene sequencing technology, taxonomic classifications were made based on morphology and life history of specimens such as the presence of organs, sizes, colors, behaviors, etc. Now, genetic analyses become a key factor for developing taxonomic classifications and phylogenetic trees, which, in some cases, are more accurate than those solely based on morphology.
R. Chawuthai et al. / Linked Open Data for Biodiversity
31
1.2. Scientific name The “scientific name” is used for naming taxa such as Animalia (animals), Plantae (plants), Fungi (fungi), etc. It is the essential metadata for accessing reference and descriptive information [3] . Each group in the whole taxonomic hierarchy must be named in Latin [3] . For example, all of Animalia, Plantae and Fungi are scientific names of kingdom, which is the highest rank in the popular taxonomic hierarchy. A scientific name of any taxon above the rank of species contains one part, while a scientific name of species has two parts: a generic name (genus name) and a specific name. For example, the scientific name of a tiger is Panthera tigris. Panthera tigris is a species currently under the genus named Panthera. The species name can be shortened into P. tigris in text. For animals, the name of a subspecies (a subgroup of species) contains three parts, namely, generic, specific and subspecific names. For example, Panthera tigris corbetti is the scientific name of the Indochinese tiger. It can be shortened into P. tigris corbetti as well. Moreover, many scientific names include an author’s name and year published, for example P. t. corbetti Mazák, 1968. It means that this name has been introduced by Mazák in 1968. 1.3. Change in taxonomy Taxonomic knowledge was not completely shared among communities of biologists. Due to the differing viewpoints of taxonomists, improvement of technology, and the gradual accumulation of observable evidence, many pieces of taxonomic knowledge change over time and differ between communities. Thus, the name and the classification of a single species can be assigned differently, with no one clear correct interpretation [3] . The following cases show examples of interesting nomenclatural changes in taxonomy. The first example is very simple and common. The Chinese yellow swallowtail, named Papilio xuthus Linnaeus, 1767 has had several different scientific names: P. xuthulus Bremer, 1861; P. chinensis Neuburger, 1900; P. koxinga Fruhstorfer, 1908; and P. neoxuthus Fruhstorfer, 1908 [3] . Authors subsequent to Linnaeus might examine specimens, recognize some differences and conclude that they belonged to another new species that could be distinguished from all known species. However, the series of species are currently regarded as variations of Papilio xuthus. In this case, the name Papilio xuthus is used for the Chinese yellow swallowtail,
32
R. Chawuthai et al. / Linked Open Data for Biodiversity
and the remaining names are treated as junior synonyms of Papilio xuthus. The next case is quite complex and shows some background information of the name change of a snowy owl. A species name consists of two words, namely, a genus name and a specific name (“specific name” is a term in the zoological nomenclature; “specific epithet” is used in botany). A genus is a species group established by a taxonomic study following the nomenclature. In this example, two genera were lumped into a single genus, so all species under both genera had to be transferred into the newly accepted genus [2] . Two genera of owls, Bubo and Nyctea, were merged into the prior genus named Bubo. Following the change, the scientific name of the snowy owl, Nyctea scandiaca, was changed to Bubo scandiacus in order to satisfy the zoological nomenclature [4] . A generic name is usually a latin noun and a specific name is an adjective. When a species is transferred from a genus to another, the ending of species name is often changed following the gender of generic name. The third case demonstrates a complicated situation of a change in taxonomy, the reclassification of birds named Baltimore oriole (Icterus galbula Linnaeus, 1758) and the Bullock’s oriole (I. bullockii Swainson, 1827). Due to the similarity of birds in both groups, in 1964, Sibley and Short [5] decided to merge these species into the former name, I. galbula [5] . Thus, all birds of both species were called I. galbula, and the name I. bullockii was just a synonym (junior synonym) of I. galbula. Until 1995, the genetic analysis of both birds has shown that they should not be classified into the same group, so I. galbula was split back into I. galbula and I. bullockii [6] . This case shows that the circumscription of I. galbula was changed two times, and this name was associated with I. bullockii for a few decades. Thus, a comprehensive study of I. bullockii should pay attention to the knowledge of I. galbula written between 1964 and 1995 as well. Sometimes there are multiple opinions on the classification of the same living organisms. The genus Actias, is a moth group including the Luna Moth. In Japan, two species, Actias aliena and A. gnoma have been known, and there also are some taxa currently regarded as synonyms. Zolotuhin [7] recognized six species from Japan, while Kishida [8] claimed that only two species should be recognized. Consequently, two opinions on the taxonomy of Japanese Actias coexist; some people adopt one opinion and some adopt the other.
R. Chawuthai et al. / Linked Open Data for Biodiversity
33
2. Requirements of taxonomic databases Changes in taxonomy are a result of published taxonomic works. An individual change is described in an article [3] . A comprehensive collection of species names in a specific group found in a certain locality, called a species checklist, is a summary of previous taxonomic studies and relationships among scientific names, written in a standardized format. The following is an example of statement quoted from a species checklist in a book entitled “Moths in Japan”. Psecadioides Butler, 1881, gen. rev. Luffiodes Matsumura, 1931, syn. n. 705. aspersus Butler, 1881, Trans. ent. Soc. Lond. 1881: 591, comb. rev. apicalis (Matsumura, 1931), 6000 illust. Insects Japan-Empire: 1107, fig. (Luffiodes), syn. n.
The statement shows information on two generic names Psecadioides and Luffiodes, and two species names aspersus and apicalis (each generic and specific name is written in italic). This checklist provides the author and published year for each generic name, while providing the author, published year and reference for each specific name (the number 705 is an ID for each species in this checklist). The first two lines explain the relationship and taxonomic treatment of two genus names Psecadioides and Luffiodes, which are names for the same taxon, meaning that they are recognized as synonyms. As stated above, the oldest name has priority. Thus, the name Psecadioides is selected for this genus (a valid name in Zoological Nomenclature); Luffiodes is regarded as a junior synonym and not used for this taxon. Similarly, aspersus is selected as the name for this species and apicalis is regarded as a junior synonym of aspersus. The parentheses surrounding the author and publishing year for apicalis means that this species is described under a genus other than Psecadioides. The genus name in the original work is shown in the parentheses following reference information. In this case, we can understand that the apicalis was described by Matsumura in 1931, and the name in original reference is Luffiodes apicalis. The last word of each line (gen. rev., syn. n., and/or comb. rev.) is the declaration of changes of taxonomic knowledge introduced by this checklist. This example can be interpreted that: 1) the genus Psecadioides has been recognized as an invalid name or ignored, but is newly recognized as a valid genus again (gen. rev.), 2) the genus
34
R. Chawuthai et al. / Linked Open Data for Biodiversity
Luffiodes is newly recognized as a junior synonym of Psecadioides, 3) the species aspersus is transferred to the genus Psecadioides and the combination of genus and species is changed again (comb. rev.) The name apicalis is a junior synonym of asperses. However, the checklist is just a snapshot of information at a single time. There is no relationship between the names in different versions of checklists. In other words, there is poor information for presenting the change in taxonomy over time. In the example above, we can see that apicalis was described as a member of Luffiodes and then transferred to Psecadioides, but we cannot identify the reference that transferred apicalis from Luffiodes to Psecadioides. Some checklists provide full history of a name (synonymic list), but some only provide a fraction of such taxonomic change. Taxonomic databases need to find a way to preserve the change in taxonomy over time and make data globally accessible. Thus, the points about data structure, ability to publish online, identification, data exchange, and data provenance should be considered in the design of a taxonomic database [9] . 2.1. Machine-readable taxonomic data Changes in taxonomy are described in many documents following traditional ways as shown above. Bringing together taxonomic information requires human effort to read and associate all pieces of knowledge such as finding relationships among scientific names. Employing a computer system can be helpful, but human-readable texts are not machine-readable. A computer needs to have machine-interpretable, structured data to be helpful. In this case, data can be stored using CSV (Comma-Separated Values), XML (Extensible Markup Language), JSON (JavaScript Object Notation), or a RDB (Relational Database), because these formats allow for an attribute-value system that is a basic knowledge representation framework. Many taxonomic publications can only be understood and interpreted by taxonomists, even though there are some formats for taxonomic documents. A structured data format might facilitate the understanding of contents in taxonomic documents by non-taxonomists. The first step is the digitization of taxonomic documents. For animals, the starting point of current taxonomy is the 10th version of Systema Naturae, published in 1758 by Carl Linnaeus [10] . Since then, an enormous amount of taxonomic literature has been published as
R. Chawuthai et al. / Linked Open Data for Biodiversity
35
journals and books for more than 250 years. Recently, much of the literature has been digitized. The largest digitization effort is the Biodiversity Heritage Library, a digitized library of taxonomic literature with 50 million pages [11] . This project provides the images, OCR (Optical Character Recognition) texts, and species names on each page. The Plazi project [12] maintains another digital library and appends machine-readable annotations for “taxonomic treatments” (the particular usage of a scientific name by an author at a given time) using XML-based format (TaxonX) and RDF. 2.2. Public taxonomic resources When data are structured, they should be accessible anywhere and anytime. To this need, the Internet technology can help data providers to publish their data via HTTP (HyperText Transfer Protocol), so online users can search and browse taxonomic knowledge conveniently. Most popular way to publish taxonomic information via the Internet is to launch an online taxonomic database in collaboration with taxonomists. Some taxonomists construct databases by themselves. In fact, many online taxonomic databases are currently available. Most of them are small, focusing only on a specific group of living organisms. Each small database is a product of research or educational activities and maintained by a person or a small community. For example, the BINRAN database [13] is the species name list of Japanese butterflies, only consisting of 300 species names. A group of butterfly taxonomists maintain the contents of the BINRAN database, and thus each scientific name is reviewed by them [14] . These small databases are very important resources. Another way to publish taxonomic information via the Internet is the digital publication of journal articles. An advanced example is the publishing framework of Pensoft [15] , a journal publisher for biodiversity studies. In this framework, each paper is openly published and taxonomic contents are extracted from the paper and automatically stored to data aggregators. Though electronic journals have become popular in recent years, the taxonomic nomenclatures have not allowed an online article as a proper way to propose new taxonomic knowledge, such as an establishment of a new taxon. For animals, the rule was changed so as to accept online articles in 2012 [16] .
36
R. Chawuthai et al. / Linked Open Data for Biodiversity
2.3. Sharing taxonomic name data Names of organisms are the most important information provided by the taxonomy. As stated above, taxonomic names such as species names are widely accepted as a standard way to specify organisms and globally used in various activities related to biodiversity. Thus, all users around the world should query and access the same taxon with a precise name, although scientific names of a taxon are differently assigned among local communities. To enable this type of query, data aggregators for taxonomic names are required. There are many types of repositories or data aggregators for taxonomic names. These global databases form an ecosystem of organism names together with small databases focusing on a taxonomic group or a specific location. Repositories aim to guarantee that a name is formally established. ZooBank [17] is the formal repository for animal names and Index Fungorum [18] is for fungal names. These repositories do not care if a name is widely used or not. Next are global name aggregators. The Catalogue of Life (CoL) [19] is the most popular taxonomic name database. CoL harvests information on each taxon including its name from collaborator databases, put together into a global database. The World Register of Marine Species (WoRMS) [20] is the de facto standard database for marine animals. These databases provide not only the spelling and authority of names, but also the relationship among names such as synonymies. These databases are updated as new taxonomic works are published. Nonetheless, these databases are just snapshots of taxonomic works. In other words, they do not provide full information of the history of each taxon, for example, when and who made a species a synonym of another species, or divided a species into multiple taxa. 2.4. Taxonomic data exchange Integration of data among various databases can create value to users, because integrated knowledge is more likely to cover a wider and deeper scale. Enabling data exchange requires common vocabularies and common schemas; however, these things are not always available. Many datasets have different data structures and formats and they need additional work to align them. The biodiversity community has developed several standards for data exchange formats. The Biodiversity Information Standards
R. Chawuthai et al. / Linked Open Data for Biodiversity
37
group (TDWG) [21] , the community aiming to develop standards for biodiversity data exchanges, proposed Darwin Core [22] , a set of standard vocabularies for biodiversity data, including taxonomic information. This standard is widely used in biodiversity informatics field, and many taxonomic data are described using its terms. The Global Biodiversity Information Facility (GBIF) [23] , one of the largest aggregator for biodiversity data, defined Taxon Core [24] , a data format for taxonomic name using Darwin Core terms. GBIF assembles taxonomic name data from various projects, including Catalogue of Life, and provides a catalogue site for taxonomic name datasets named the GBIF Checklist Bank. TDWG also proposed the Taxonomic Concept Transfer Schema [25] , a standard to describe the relationship between names and taxa, but this schema is not widely used. 2.5. Taxonomic data provenance Many databases prefer to store only the most recent data, so users lose the opportunity to learn the history of relationships among taxa that is key to understanding biodiversity. These users can seek out and read from the relevant publications to get this information, but it can be convenient if data are versioned with an appropriate structure for performing queries. Regarding the key features described above, taxonomic databases should shift themselves to be online databases and provide a flexible structure for knowledge exchange. Thus, linking data among taxonomic databases becomes a key technology for taxonomic knowledge management. 3. The advantage of linked open data To promote data integration, all repositories should use the same schema or have semantic mapping between different schemas. Linked Open Data (LOD) is designed to integrate diverse data types across repositories via the Internet. The power of knowledge inference in the Semantic Web can create a smooth integration of data from different schemas [26,27] . Pieces of knowledge in LOD are linked in a knowledge graph. The smallest unit of the graph is a triple that includes a subject, a predicate, and an object. The predicate is a name of a link from the subject to the object as written in Figure 1. In the figure,
38
R. Chawuthai et al. / Linked Open Data for Biodiversity
Figure 1. Example triple of the taxon :Bubo_scandiacus
species:Bubo_scandiacus, lodac:SuperTaxon, and genus:Bubo are a subject, a predicate, and an object respectively. The triple in this figure means that the super taxon of the species Bubo scandiacus is the genus Bubo. The subject and the predicate must be a resourse represented by a Uniform Resource Identifier (URI) or an abbreviated URI (as in Fig. 1), whereas the object can be either a resource written in a URI or a literal written in a free text. This is a machine-readable way to state that Bubo scandiacus has a parent, Bubo. In order to link data with LOD, each repository needs to create its own knowledge graph using Resource Description Framework (RDF [28] ) and by reusing vocabularies and schemas from existing ontologies such as DBpedia [29] and SKOS [30] . Then, as long as the URI is dereferenceable, the reasoning and querying mechanisms of Sematic Web can link all local knowledge graphs into a single, queryable LOD Cloud. Although Semantic Web and LOD address some pain points of developing a linkable taxonomic database, a taxonomic database that fills the needs of the biodiversity domain also needs stable, unique taxon identifiers and a data model that can preserve the history of taxonomic change. 4. Assigning taxonomic identifiers Stable, unique taxonomic identifiers are the key to link, query, and access all pieces of taxonomic knowledge across repositories. Many repositories use an internally unique identifier to increase information granularity, whereas the others use a name-based identifier to make the data human-readable. Both of them have advantages and disadvantages, but there is no best method, so developers should consider the nature of both techniques in building their own taxonomic knowledge management system that fits with their own requirements.
R. Chawuthai et al. / Linked Open Data for Biodiversity
39
Figure 2. Example triple of the taxon :Bubo_scandiacus identified by LSID
4.1. Unique identifier In database design, a primary key is a unique identifier that is used for accessing any data related to an entity. The primary key must be locally unique, but for the purpose of integrating different databases, it should be globally unique. Thus, it cannot be a human-readable taxonomic name because the name may be duplicated with other repositories or within the same repository. For this reason, the Biodiversity Information Standards group (TDWG) [21,31] has promoted the use of the Life Science Identifier (LSID) as a Global Unique Identifier for a taxon concept. For example, the LSID of Bubo scandicus is urn:lsid:catalogueoflife.org:taxon:14f3dbd3-d35f-11e6-9d3f-bc764e092680:col20170127
where the “catalogueoflife.org” is an authority, “taxon” is a namespace, “14f3dbd3-d35f-11e6-9d3f-bc764e092680” is an object ID, and “col20170127” is an optional version number. Thus, the RDF statement displayed in Figure 1 can be represented with LSID as shown in Figure 2. Several taxonomic repositories adopted the LSID in their data model such as Global Biodiversity Information Facility (GBIF) [23] , Catalog of Life (CoL) [32] , Universal Biological Indexer and Organizer (uBio) [33] , etc. LSIDS can be used in the Darwin Core schema [34] . Since the LSIDs are not human-friendly labels, they are hidden from the user, but have a key role in providing services and managing data. Biodiversity researchers still have to recognize and use LSIDs for communicating between taxonomic databases, and it can be complicated to map data between non-human-friendly identifiers and a taxonomic name.
40
R. Chawuthai et al. / Linked Open Data for Biodiversity
4.2. Machine-readable identifier A taxon name is not only related to one or more nomenclatural acts such as the establishment of a new taxon but is also used in various resources such as literatures and databases. Thus, a framework to link nomenclatural acts and name usage is highly desired. For this purpose, Patterson et al. [35] introduced the Global Names Architecture (GNA). GNA provides two significant modules: the Global Names Index (GNI) [36] and the Global Names Usage Bank (GNUB). GNI is the collection of name strings that are used by various resources in which taxon names are included. GNI bundles the set of name strings for one taxon, called a reconciliation group. GNUB describes name usages, i.e., links a reconciled name to its reference, called a Taxonomic Name Unit. ZooBank [17] is the official register of taxon names for animal and is an initial part of GNUB system [37,38] . ZooBank also uses LSID as persistent identifiers for name usages with nomenclatural acts, publications, and authors. 4.3. Name-centric identifier A human-readable label for a machine-readable identifier is another approach for implementation. This approach is widely used in resources written as Linked Open Data (LOD). LOD is openly published linked data and thus the identifier for an object must be a referenceable HTTP URI. For example, in the DBpedia ontology [13] , a well-known LOD representation of Wikipedia, the identifier for Papilio xuthus is http://dbpedia.org/page/Papilio_xuthus. Linked Open Data for Academia (LODAC) [24] describes academic resources as LOD, and uses a human-readable URI similar to those used by DBpedia. LODAC covers biodiversity data and provides a linked data hub for biodiversity knowledge. To represent a name and a taxon concept, LODAC uses a URI with a human-readable part, for example http://lod.ac/species/Bubo_scandiacus (short-hand writing is “species:Bubo_scandiacus”). Compared to machine-readable identifiers, a human-readable identifier with the taxon name is more intuitive and helps to reduce the gap between experts of biodiversity informatics and non-experts including biologists and the general public. As mentioned above, a taxon name is globally used as a label or noun for a specific living organism, and it is very difficult for non-experts to properly apply
R. Chawuthai et al. / Linked Open Data for Biodiversity
41
machine-readable identifiers to names. A human-readable URI using a taxon name is very similar to that for DBpedia. It means that users can treat a taxon name and other general nouns in the same way. In addition, the human-readable URI can represent a scientific name and a taxon concept, just as a URI for DBpedia represents the entry name and the explanation (concept). If necessary, A human-readable URI for a name can be separated from that for a concept. This human-readable URI approach is flawed in the case of taxonomic updates. When a new publication with a new nomenclatural act is released and a URI following the new publication is accepted, new information will be recorded under the last URI only. Thus, users have to check all URIs that are under the same taxon concept and access all the information related to those names. If a concept has been changed frequently, it becomes a time-consuming task to find and verify all related names. Fortunately, most of taxa have not changed many times, so the approach of a name-centric identifier is still feasible for the biodiversity domain. It should be noted an old URI not replace to the new URI immediately when the name and/or the concept has been changed, because the name and concept of the entry with the old URI is based on the old name and concept. There are two solutions: one is to update the original entries referring the new publication; another is to add a link between the old and new concepts. In conclusion, the name-centric identifier benefits a taxonomic system when most users are not experts of biodiversity informatics. Data are human-readable and the data model becomes simpler, as can be seen in Figure 1 and Figure 2. If users prefer to access information by scientific names; data from Figure 1 offers one hop for querying because it uses names as keys, while data from Figure 2 requires two hops because the LSID is between names and data. Although using a name-centric identifier is advantageous in terms of information access, the encapsulation of a taxon concept and its name into a single entity decreases the granularity of information, and exposes the system to the issues of name change and data redundancy. 5. Linked open data model for the change in taxonomy Change in taxonomy is an issue that every taxonomic knowledge management system should take this into account during design and im-
42
R. Chawuthai et al. / Linked Open Data for Biodiversity
plementation. Some systems, such as the International Organization for Plant Information (IOPI) [39] , address this issue by linking the taxonomic name with the circumscription references in the database model and taxonomic changes are handled by a separate transaction database. In order to make data maximally exchangeable, an LOD-friendly data model that incorporates a knowledge graph is recommended. Some ideas for future research include developing knowledge graphs for biodiversity such as the work of Rod Page [40] and The Open Biodiversity Knowledge Management System in Scholarly Publishing [41] . The former [40] discussed the connectivity between biodiversity entities in order to form a biodiversity knowledge graph for crosslinking datasets. The latter [41] includes novel ways of publication and visualization that helps taxonomists to reorganize their biodiversity knowledge into the Open Biodiversity Knowledge Management System. Both approaches focus on knowledge exchange rather than change in biodiversity knowledge. Thus, this chapter discusses data models for biodiversity knowledge graphs that accommodate for change in taxonomy over time. In this section, two data models, Taxon Meta-Ontology (TaxMeOn) [42] and Linked Taxonomic Knowledge (LTK) [9] , are reviewed. 5.1. Taxon Meta-Ontology (TaxMeOn) TaxMeOn [42] is an ontology schema for managing taxonomic names over time and linking biodiversity information to the name. The data model is initiated on the basis of Semantic Web RDF Schema (RDFS) and Web Ontology Language (OWL). It contains classes and properties for managing the change in taxonomy and names as shown in Figure 3. The taxonomic checklist collects taxon concepts and names. All names have metadata including nomenclature code, nomenclatural status, taxonomic status, vernacular name status, taxonomic rank, author, reference, etc. TaxMeOn attempts to develop a unified network of taxonomic names from multiple data sources. The relationships among names include changes in taxonomy such as change in circumscription, lumping of taxa, splitting of a taxon, and change in classification. The following is an example implementation of TaxMeOn. The associations between names from different data sources are the main contribution of TaxMeOn. There are about 100 classes,
R. Chawuthai et al. / Linked Open Data for Biodiversity
43
properties, and individuals in TaxMeOn. All of them are described at http://schema.onki.fi/taxmeon/ and Laurenne et al. [42] .
Figure 3. The core classes of the taxonomic meta-ontology (image from Laurenne & et al. [42] )
1 2 3 4 5 6 7
@prefix @prefix @prefix @prefix @prefix @prefix @prefix
8 9
ex : p1 ex : p2
taxmeon : . taxonomic−r a n k s : . rdf : . rdfs : . dc : . xsd : . ex : . r d f : t y p e taxmeon : P u b l i c a t i o n ; dc : d a t e "2012−05−21"^^xsd : d a t e . r d f : t y p e taxmeon : P u b l i c a t i o n ; dc : d a t e "2013−03−10"^^xsd : d a t e .
10 ex : p3 11 ex : p4 12 ex : p5 13 14 15 16
r d f : t y p e taxmeon : V a l i d ; taxmeon : p u b l i s h e d I n ex : p1 . r d f : t y p e taxmeon : V a l i d ; taxmeon : p u b l i s h e d I n ex : p2 . r d f : t y p e taxmeon : T a x o n I n C h e c k l i s t , taxonomic−r a n k s : Genus ; r d f s : l a b e l "Aus"^^xsd : s t r i n g ; taxmeon : completeTaxonName "Aus"^^xsd : s t r i n g ; taxmeon : hasNameStatus ex : p3 ; taxmeon : congruentWithTaxonInt ex : p6 .
17 ex : p6 18 19 20 21
r d f : t y p e taxmeon : T a x o n I n C h e c k l i s t , taxonomic−r a n k s : Genus ; r d f s : l a b e l "Aus"^^xsd : s t r i n g ; taxmeon : completeTaxonName "Aus"^^xsd : s t r i n g ; taxmeon : hasNameStatus ex : p4 ; taxmeon : congruentWithTaxonInt ex : p5 .
• Lines 1-7: Namespaces used in this statement
44
R. Chawuthai et al. / Linked Open Data for Biodiversity
• Line 8: A publication P1 is published at “2012-05-21” • Line 9: A publication P2 is published at “2013-03-10” • Line 14: A status P3 presents the valid status that is published in P1 • Line 15: A status P4 presents the valid status that is published in P2 • Line 12-16: A genus P5 having name “Aus” has name status P3 and congruent with the taxon P6 • Line 17-21: A genus P6 having name “Aus” has name status P4 and congruent with the taxon P5 5.2. Linked Taxonomic Knowledge (LTK) Linked Taxonomic Knowledge (LTK) provides a data model and vocabularies for preserving and presenting change in biodiversity knowledge [9] . LTK introduces a Contextual Nominal Entity that is a humanreadable identifier for taxonomic knowledge, and it is always used in the data model. It includes a scientific name together with a version number. For example, http://rc.lodac.nii.ac.jp/taxon/genus/ Bubo_1999 (short-hand writing is genus:Bubo_1999) is a URI of the genus Bubo that was changed in the year 1999. When the name was changed, a new version of contextual nominal entity of this taxon was created and linked to the former. Next, LTK provides an Event Entity for assuring Operations of change and for recording provenance data to show the Interval time of the change, to address Persons who perform or issue this change, and some referent Sources as shown in Figure 4. This model is also called an Event-Centric Model. The Operation of change is a key entity of LTK. It records the change in Contextual Nominal Entities which are the representation of taxa written by taxonomic names with version numbers. Some operations, which are commonly found in any publications about change in taxonomy, are built into LTK. These are Merging Taxa, Splitting a Taxon, Replacing a Taxon, Changing a Higher Taxon, Dividing a Taxon, Combining Taxa, and Linking a Synonym. LTK also allows the representation of relations between operations, such as cause and effect, in order to demonstrate the reasoning behind a change. For example, the species Nyctea scandiaca has been changed to Bubo scandiacus because the genus Nyctea was merged into the genus Bubo in
R. Chawuthai et al. / Linked Open Data for Biodiversity
45
Figure 4. Event-Centric Model of LTK
1999. Below is the RDF statement of this example using the LTK model. 1 2 3 4 5 6 7 8 9
@prefix @prefix @prefix @prefix @prefix @prefix @prefix @prefix @prefix
rdf : dct : tl : bibo : cka : ltk : genus : species ex :
. . . . . . . : . .
10 11 12 13 14
ex : e v e n t 1 9 9 9 b i b o : p e r f o r m e r ex : Wing , ex : H e i d r i c h ; b i b o : i s s u e r ex : Richard ; d c t : s o u r c e ex : 5 2 2 4 7 7 3 ; cka : i n t e r v a l [ t l : beginsAtDateTime " 1999 " ] ; cka : a s s u r e s ex : mg1 , ex : rp1 .
15 16 17 18
ex : mg1 ltk ltk ltk
19 20 21
ex : rp1 r d f : type l t k : TaxonReplacement ; l t k : t a x o n B e f o r e s p e c i e s : Nyctea_scandiaca_1826 ; l t k : taxonAfter s p e c i e s : Bubo_scandiacus_1999 .
22
ex : mg1
r d f : type : taxonBefore : taxonBefore : taxonAfter
cka : e f f e c t
l t k : TaxonMerger ; genus : Bubo_1805 ; genus : Nyctea_1826 ; genus : Bubo_1999 .
ex : rp1 .
• Lines 1-9: Namespaces used in this statement • Line 10-14: An event entity includes some operations and prove-
46
R. Chawuthai et al. / Linked Open Data for Biodiversity
nance data • Line 15-18: genus:Bubo_1805 and genus:Nyctea_1826 is merged into the genus:Bubo_1999 • Line 19-21: species:Nyctea_scandiaca_1826 is replaced to species:Bubo_scandiacus_1999 • Line 22: The merging of two genera is effect to the replacement of that species The Event-Centric Model uses binary relation to present context. The binary relation is just a named link between two entities, for example “x belongs-to y” shows that the entities x and y are linked, in other words, this triple has x as a subject, belongs-to as a property, and y as an object. These data are simple; however, further information cannot be included in this single relation. If context is needed e.g., year published, more entities and links must be added. For example, in order to show that the fact “x belongs-to y” is published in 1900, a context entity, e.g., event, is needed and more triples are added like “event has-subject x; event has-property belongs-to; event has-object y; event published-in 1900.” In this case, the data model becomes complicated by design. To have a usual structure of triples for linked data like data from DBpedia [29] , LODAC [43] , and TaxMeOn [42] , the Event-Centric model should be transformed into a simple one. LTK also provides Semantic Web rules, which are inference rules, to transform this model into a Transition Model and a Snapshot Model. The Transition Model is used for presenting the chronological changes of taxa, while the Snapshot Model presents valid triples at a given time point. The above statement can be transformed into s p e c i e s : Nyctea_scandiaca_1826 l t k : r e p l a c e d T o s p e c i e s : Bubo_scandiacus_1999
so it can be linked with other datasets. The main contribution of LTK is to have a flexible model that can manage changes in taxonomy in order to query and access the history, because the context of changes is included and the chronological change of taxonomic names is presented. More details about the LTK framework including formal expression, ontology, vocabularies, transformation rules, and services are published in the publication [9] . TaxMeOn [42] and LTK [9] attempt to address the issue of change
R. Chawuthai et al. / Linked Open Data for Biodiversity
47
in taxonomy. Both models can capture the change in taxonomy for Linked Open Data. TaxMeOn [42] focuses much more on the revision of taxonomic checklists and informs the background of change by linking to publication. Associations among names of taxa between checklists becomes a strong point of TaxMeOn, because users can easily read and understand the change. Due to the limitation of the binary relation used in TaxMeOn, the model is not suitable for making associations among changes in a knowledge graph, such as a species name change resulting from a change at the genera level. In contrast, LTK does not mention taxonomic checklists. Instead it focuses on how to present changes in taxonomy and associations among changes in a knowledge graph. The Event-Centric model of LTK offers pre-defined operations of change for capturing taxonomic change, links between the operations, and the background knowledge justifying the change. Thus, the change in taxonomy and the context are presented in a knowledge graph. 6. The future of taxonomic databases This chapter discussed the data model for capturing changes in taxonomy and the importance of capturing those changes for exchanging data between repositories. The design of the data model is a fundamental part of engineering a knowledge management system for biodiversity. In order to make a complete system, the following points should be considered. 6.1. Big data Millions of species are known to science and will require billions of triples to be described in RDF. Thus, selecting proper technology for handling big data should be considered in the beginning of the system development process. 6.2. Legacy data Most biodiversity knowledge exists in the form of natural language text behind a subscription paywall, in other words, the data are unstructured and not machine-readable. In many cases the text are not digitized. Biodiversity Heritage Library, which is a large online digital library for natural history, is a huge and global effort for
48
R. Chawuthai et al. / Linked Open Data for Biodiversity
digitization of vast amounts of “legacy” literature. In this project, taxon names are automatically extracted from the digitized text and added to the text as annotations, using a tool named FindIT [44] developed by uBio. The GNA project also released a tool for parsing and atomizing a taxon name into detailed elements. Moreover, there are interesting Natural Language Processing (NLP) components that parse text into structured data, such as, ClearTK [45] , CharaParser [46] , etc. Although many NLP tools can produce machine-readable data, finding tools that specifically address taxonomic change is challenging. Thus, a tool that can convert text into a well-shaped knowledge graph for describing change in taxonomy, together with existing tools and software mentioned above, would be very helpful. 6.3. Machine-human interaction Knowledge graphs generated by humans, especially biological taxonomists, are important, but encouraging taxonomists to build semantic web infrastructure is challenging. To assemble taxonomic knowledge as graphs, mature collaborations between taxonomists and informatics researchers are indispensible. Most taxonomists are interested in living organisms in nature and not information technologies, and vice versa. The first step is to increase communication across these disciplines. Some taxonomists are interested in building databases of taxonomic names with their changes, but have no opportunities to discuss this issue with informatics researchers. Another issue is the lack of a workflow to convert taxonomists’ knowledge to a graph. Some taxonomists assemble taxonomic names and their changes in documents, spreadsheets, or databases. A good first step would be to build converters that would create RDF from existing data. These would be developed based on communications between taxonomists and informatics researchers. It would also be very useful to propose a standard and easy-to-use protocol for taxonomists to prepare taxonomic knowledge as spreadsheets. Encouraging their participation will require a tool with a rich user interface so lay people can provide RDF data easily. 6.4. Linkable data To create RDF data, one must give a stable, unique identifier to every resource, which is a challenging task. Giving local URIs from labels
R. Chawuthai et al. / Linked Open Data for Biodiversity
49
in the unstructured text seems possible and straightforward, but is unlikely to be globally exchangeable. Having five-star linked data requires effort to map local data with or reuse existing URIs from wellknown ontologies already in the LOD Cloud. Links to GNA entities, which are more strictly defined by machine-readable identifiers, could act as “anchors” for LOD data with name-centric identifiers.
7. Conclusion This chapter discussed linked data models for representing changes in taxonomy and made recommendations for taxonomic databases. The change in taxonomic knowledge over time and the difference of taxonomic knowledge among local communities are a common issue for the implementation of taxonomic databases. Because of the nature of taxonomy and taxonomic nomenclature, all nomenclatural acts about a taxon must be respected, making knowledge exchange and data integration between multiple sources necessary. Thus, Semantic Web and LOD, with their ability to integrate data and build large networks of knowledge, are helpful technologies for building a taxonomic knowledge management system. There are two very important parts of a biodiversity data model: 1) stable, unique identifiers for taxa and 2) a data model that can capture changes in taxonomy. The use of a global identifier is good for data manipulation, whereas, using a name-centric identifier can make a data model be simpler and human-readable. This chapter describes the concepts of TaxMeOn and LTK. These approaches help users to intuitively understand the history of each taxon. They have some advantages and disadvantages according to the practical requirement from users. It should be emphasized that good collaborations between taxonomists and informatics scientists are indispensible for assembling contents and revising the data model. The harmonization among existing taxonomic databases and the proposed linked data model is also important. During the software development process, developers should reuse or combine useful parts of existing data models to build a powerful biodiversity knowledge management system that all users, especially biologists, are willing to use to contribute and consume knowledge.
50
R. Chawuthai et al. / Linked Open Data for Biodiversity
8. References [1] Franz N, Peet R (2009) Perspectives: Towards a language for mapping relationships among taxonomic concepts. Syst Biodivers 7(1):5-20. [2] ICZN (1999) International Code of Zoological Nomenclature, 4th Edition. International Trust for Zoological Nomenclatu History Museum. [3] Winston JE (1999) Describing species: practical taxonomic procedure for biologists. Columbia University Press. 541 pp. [4] Wink M, Heidrich P (1999) Molecular evolution and systematics of owls (Strigiformes). In: Owls: A Guide to the Owls of the World p. 39-57. [5] Sibley CG, Short LL (1964) Hybridization in the orioles of the Great Plains. Condor 66(2):130-150. [6] Freeman S, Zink RM (1995) A phylogenetic study of the blackbirds based on variation in mitochondrial DNA restriction sites. Syst Biol 44(3):409-420. [7] Zolotuhin VV (2011) The Actias Leach, 1815, in the Far East: how many species? Neue Entomologische Nachricten 67:40-56. [8] Kishida Y (2016) An opinion to Zolotuhin (2011) The Actias Leach, 1815, in the Far East: How many species? Japan Heterocerists’ Journal 278:88-89. [9] Chawuthai R, Takeda H, Wuwongse V, Jinbo U (2016) Presenting and preserving the change intaxonomic knoweldge for linked data. Semantic Web Journal 7(6):589-616. [10] von Linné C, Tessin CG, Beer GE, Gmelin JF (1758) Systema naturae: Per regna tria naturae, secundum classes, ordines, genera, species cum characteribus, differentiis, synonymis, locis. Laurentii Salvii. [11] Biodiversity Heritage Library http://www. biodiversitylibrary.org [12] Plazi http://plazi.org [13] BINRAN http://binran.lepimages.jp [14] Jinbo U, Ueda K, Inomata T, Uémura Y, Yago M (2013) An overview of the website, The Current Checklist of Japanese Butterflies. Jpn J Entomol 16:122-125. [15] Pensoft http://pensoft.net [16] ICZN (2012) Amendment of Articles 8, 9, 10, 21 and 78 of the
R. Chawuthai et al. / Linked Open Data for Biodiversity
[17] [18] [19] [20] [21] [22]
[23] [24] [25] [26] [27]
[28] [29]
[30] [31]
[32]
51
International Code of Zoological Nomenclature to expand and refine methods of publication. ZooKeys 219:1-10. ZooBank http://zoobank.org Index Fungorum http://www.indexfungorum.org Catalogue of Life (CoL) http://www.catalogueoflife.org World Register of Marine Species http://www. marinespecies.org/ Biodiversity Information Standards (TDWG) http://www. tdwg.org Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, Giovanni R, Robertson T, Vieglais D (2012) Darwin Core: An evolving community-developed biodiversity data standard. PLoS ONE 7(1):e29715. https://doi.org/10.1371/journal.pone.0029715 The Global Biodiversity Information Facility (GBIF) http:// www.gbif.org Darwin Core Taxon http://rs.gbif.org/core/dwc_taxon_ 2015-04-24.xml Taxonomic Concept Transfer Schema https://github.com/ tdwg/tcs Hitzler P, Krotzsch M, Rudolph S (2009) Foundations of semantic web technologies. CRC Press. 456 pp. Heath T, Christian B (2011) Linked data: Evolving the web into a global data space In: Synthesis lectures on the semantic web: theory and technology. 136 pp. Schreiber G, Raimond Y (2014) RDF 1.1 Primer https://www. w3.org/TR/rdf11-primer/ Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, van Kleef P, Auer S, Bizer C (2015) DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal 6(2):167-195. Simple Knowledge Organization System http://www.w3.org/ TR/skos-primer/ Taxonomic Names and Concepts Interest Group (2005) Taxonomic concept transfer schema http://www.tdwg.org/ standards/117/ Jones AC, White RJ, Orme ER (2011) Identifying and relating biological concepts in the Catalogue of Life. J Biomed Semant
52
R. Chawuthai et al. / Linked Open Data for Biodiversity
2(7). doi:10.1186/2041-1480-2-7 [33] Sarkar IN (2007) Biodiversity informatics: Organizing and linking information across the spectrum of life. Brief Bioinform. 8(5):347-357. [34] Darwin Core http://rs.tdwg.org/dwc/terms/ [35] Patterson DJ, Cooper J, Kirk PM, Pyle RL, Remsen DP (2010) Names are key to the big new biology. Trends Ecol Evol 25(12):686-691. [36] Global Names Index http://gni.globalnames.org [37] Pyle RL, Michel E (2008) ZooBank: Developing a nomenclatural tool for unifying 250 years of biological information. Zootaxa 1950:39-50. [38] Pyle RL (2016) Towards a Global Names Architecture: The future of indexing scientific names. ZooKeys 550:261-281. [39] Berendsohn WG (1997) A taxonomic information model for botanical databases: The IOPI model. Taxon 46(2):283-309. [40] Page R (2016) Towards a biodiversity knowledge graph. RIO 2:e8767. [41] Senderov V, Penev L (2016) The Open Biodiversity Knowledge Management System in Scholarly Publishing. RIO 2:e7757. [42] Laurenne N, Tuominen J, Saarenmaa H, Hyvonen E (2014) Making species checklists understandable to machines–a shift from relational databases to ontologies. J Biomed Semant 5(1). [43] Minami Y, Takeda H, Kato F, Ohmukai I, Arai N, Jinbo U, Ito M, Kobayashi S, Kawamoto S (2013) Towards a Data Hub for Biodiversity with LOD. In: The 2nd Joint International Semantic Technology Conference. p. 356-361. [44] FindIT http://ubio.org/tools/recognize.php [45] ClearTK https://cleartk.github.io/cleartk/index.html [46] CharaParser http://phenoscape.org/wiki/CharaParser
Application of Semantic Technology in Biodiversity Science Anne E. Thessen (Ed.) ISBN 978-3-89838-733-0 c 2018 AKA Verlag Berlin
53
Chapter 4
Taxonomy and the Production of Semantic Phenotypes Matthew J. YODERa , Michael B. TWIDALEb , Andrea K. THOMERc , Lars VOGTd , Nico M. FRANZe , Jinlong GUOb , Andrew R. DEANSf , and James P. BALHOFFg a Illinois Natural History Survey, University of Illinois, Champaign, Illinois, USA b School of Information Sciences, University of Illinois, Champaign, Illinois, USA c School of Information, University of Michigan, Ann Arbor, Michigan, USA d Institut für Evolutionsbiologie und Ökologie, Universität Bonn, Bonn, Germany e Arizona State University, School of Life Sciences, Tempe, Arizona, USA f Department of Entomology, Penn State University, State College, Pennsylvania, USA g Renaissance Computing Institute, University of North Carolina, Chapel Hill, North Carolina, USA Taxonomists produce a myriad of phenotypic descriptions. Traditionally these are provided in terse (telegraphic) natural language. As seen in parallel within other fields of biology researchers are exploring ways to formalize parts of the taxonomic process so that aspects of it are more computational in nature. The currently used data formalizations, mechanisms for persisting data, applications, and computing approaches related to the production of semantic descriptions
54
M.J. Yoder et al. / Semantic Phenotypes
(phenotypes) are reviewed, they, and their adopters are limited in number. In order to move forward we step back and characterize taxonomists with respect to their typical workflow and tendencies. We then use these characteristics as a basis for exploring how we might create software that taxonomists will find intuitive within their current workflows, providing interface examples as thought experiments. 1. Introduction Taxonomists, those who describe and organize Earth’s biodiversity, offer a unique perspective on life’s phenotypes. With little exception, their work references phenotype information to circumscribe natural units of biodiversity. A taxonomist’s hypotheses (concepts) typically rely on phenotypes, and their descriptions of phenotypes therefore serve as core evidence of a taxonomist’s science. The primary product of taxonomy is not seen as a set of phenotypes, but rather as the conclusion: the taxon concept. A taxonomist’s work is “validated” when their taxon concepts are applied, i.e., when others classify the world into their conclusions. Some have argued [1] that taxonomists are under-selling their work by not recognizing the importance of their supporting data, particularly their phenotypic descriptions. One way to increase the utility of a taxonomist’s anatomical observations is to generalize them into a “semantic phenotype” [1–3] , i.e., a formalized, typically logically rooted, representation of an anatomical concept that is, at minimum, adapted to computational exploration. Phenotype data of this nature have recently been instrumental to a range of broader scientific explorations, both foundational/theoretical (e.g., [4–8] and methodological (e.g., [9–17] ). There are costs to producing semantic phenotypes; however, as researchers must be trained in the concepts; tools and supporting infrastructures must be built; benefits must be outlined; and incentives must be determined and implemented. A full cost-benefit study is beyond the scope of this work, in part because it must start with the baseline cost of taxonomic products as they currently exist, and this alone is an exceedingly difficult analysis in and of itself (consider assigning costs to the complexity documented by Dahdul et al. and Franz [18,19] and see potential model in ten Hoopen et al. [20] ). Ultimately, regardless of whether semantic phenotypes are “cost-effective”,
M.J. Yoder et al. / Semantic Phenotypes
55
their exploration in the context of the taxonomic process will help uncover the complexities, and therefore, ultimately the costs underlying their production. For our current purposes, we posit that taxonomists do care about the potentiality of semantic phenotypes. In part, this work is an updated roadmap to ideas proposed in Deans et al. [1] ; it also seeks to serve as a summary introduction to taxonomists who want background on the field. Our goal is to provide a critical assessment of where we are at now with respect to taxonomists (specifically) adopting the principles and practices surrounding their production. With this background in place we then focus on the premise that new technologies, specifically software user interfaces, could catalyze the production of semantic phenotypes. The problem is therefore a general one: moving a community to adopt a new technology. Therefore it is best to start with a base level understanding and characterization of what that community does in the absence of that technology. From this basis we propose specific technologies that we hope will seed future exploration and discussion. 2. Approach Our approach is to leverage insights derived from three core areas: firstly, research undertaken during two NSF Advances in Biological Informatics Projects; secondly, from our day-to-day efforts as taxonomists who wish to adopt a philosophy that embraces the production of semantic phenotypes; and thirdly, from our day-to-day interactions with collaborators working on related fields (e.g., morphology). Here we refine these insights into summaries that represent core issues with respect to taxonomists producing semantic phenotypes. We begin by briefly describing the technologies taxonomists have utilized to add semantic layers to their phenotype data (specifically, taxonomic descriptions). Many of these technologies are cited elsewhere in this book. Our focus is to highlight issues specifically related to the field of taxonomy, and to clearly identify areas where taxonomists looking to enter the field would hit stumbling blocks. We then step back and focus on how we might craft technologies that would support taxonomists in ultimately adopting a workflow that produces semantic phenotypes. The argument is as follows: The use and production of semantic phenotypes is dependant on techno-
56
M.J. Yoder et al. / Semantic Phenotypes
logical advances in several aspects of computing. Therefore, if taxonomists are to produce semantic phenotypes they must adopt new technologies. With the goal of encouraging taxonomists to adopt new technologies we identify general characteristics of taxonomic work that lend themselves to technological solutions. In other words, any new tool for taxonomists must fit into, complement and/or enhance their existing work practices. Therefore, an in-depth understanding of taxonomists’ existing work practices is needed before we can build new technological solutions, such as tools that produce semantic phenotypes. This philosophy and approach draws from research in the fields of Human Computer Interaction and Computer-Supported Cooperative Work as used by researchers in Information Science. We draw from interviews of over 35 taxonomists and our own experiences as taxonomists. Each identified characteristic is cross-referenced to perceived issues specific to the production of semantic phenotypes. We conclude by briefly reviewing the role of software interfaces in the production of semantic phenotypes. Interfaces have largely been an afterthought in the development of scientific computing, yet some of us feel they might be key to developing new systems. The use and adoption of semantic phenotypes can be broken down into a more base problem in software design, that of the adoption of a very large and highly interconnected systems. Navigation, display, editing, and updating these types of networks are difficult problems that should have generalizable solutions elsewhere. We do not seek to propose specific solutions along these lines but rather point out potential avenues of exploration. Nearly all of the topics touched on here are worthy of their own review papers, this work is intended provide an index to these yet unrealized reviews. 3. Discussion 3.1. Current status A system architected to produce semantic phenotypes for taxonomists must contain certain components. These generally include the formalizations themselves, a means to persist one or more formalizations, and the wrapping applications. To fully realize the importance of semantic phenotypes we also require computing or reasoning engines. With respect to taxonomy we feel that all of these are truly in their
M.J. Yoder et al. / Semantic Phenotypes
57
infancy. In the future, we may see the current systems completely replaced by alternative solutions seeking similar goals. Of the formalizations that have been implemented specifically to treat phenotype descriptions, the majority have defined data structure, rather than data meaning. For instance, early efforts like Delta [21] were developed to output taxonomic descriptions from matrix-like data, and the TDWG-based Structure of Descriptive Data (SDD) standard [22] was developed to “allow capture, transport, caching and archiving of descriptive data in all the forms shown above, using a platform- and application-independent, international standard” (from https://github.com/tdwg/sdd; we note that SDD has only been adopted by only a few applications and is not currently exploited in any larger repositories). By far the most commonly used data structure is Nexus [23] , which is used to store character matrices. Lucid [24] also utilizes a simple table format with additional markup formats. Other models like NeXML [25] have not yet been adopted beyond simple experiments. We believe that lack of adoption is a reflection of 1) a the lack of applications that produce data of a given structure, and 2) the lack of repositories with specific capabilities for exploiting the underlying semantics. Two approaches have been used to produce semantically based taxonomic descriptions – that is, they provide a formalization that reflects meaning. Cui [26] introduced CharaParser, which uses an XML markup to represent the results of Natural Language Processing algorithms. Balhoff et al. [2] introduced an approach that links matrixbased data in NeXML to phenotype descriptions in OWL (Web Ontology Language) which reference anatomy (e.g., the Hymenoptera Anatomy Ontology, [27] ) and phenotype ontologies. The model was extended and “practiced” in a series of follow-up papers [28–31] . The two approaches have somewhat different goals, the former being focused on mining phenotypes from published works, the latter focused on providing de novo formalizations. There is no lossless translation between the two formats available. A third approach that seeks to describe individual part instances (e.g., a single individual’s head) using RDF is under development based on ideas put forth by Vogt [8,32] , and Vogt et al. [33–36] . While not targeting taxonomic descriptions specifically, it has clear potential to be applicable to them. Related performance metrics, specifically those that test issues of repeatability and cross-community compatibility are key (see Bertone et al. and
58
M.J. Yoder et al. / Semantic Phenotypes
Cui et al. [37,38] ). There are very few software applications that have been used by taxonomists to produce semantic phenotypes for the purpose of taxonomic description. Huang et al. [39] created the Ontology Term Organizer (OTO), a tool that lets users ontologize anatomical terms, with specific application for taxonomic characters. This evolved into the “Exploring Taxon Concepts” (ETC) project, which is arguably the most integrated approach to producing semantic phenotypes specifically by and for taxonomists [10] . Balhoff et al. [2] used the matrix editing functionality in mx [40] to export NeXML, which could be edited in in Protégé (https://protege.stanford.edu/), and linked to phenotype descriptions. Both approaches have only been used by their creators and a limited number of collaborators. While not specifically used by taxonomists to produce descriptions, Phenex [41] has been the most extensively used software to generate semantic representations of data from published works [18] . Those matrices were primarily produced by taxonomists, but perhaps more specifically for evolutionary studies rather than taxon descriptions. Morph-D-base, in active development, will produce highly semantic instance ontologies [42] . Datasets of semantic phenotypes are persisted locally (that is, stored on users’ machines) as XML documents, either in OWL format (approach of Balhoff et al. [2] ), or the ETC format (approach of Cui et al. [10] ). The latter persists in the extremely generalized RDF format. There is no standard relational database schema for either approach. Both formats are machine-readable, and not manually editable or easily examined without an application that can read them. It is unclear whether a more or less human readable file format for semantic phenotypes would encourage their adoption. There are no repositories specifically aimed at serving taxonomists, for example archives of data and their underlying semantics. Taxonomists who wish to share or publically archive their semantic data currently only have general (rather than taxonomy-specific) solutions such as Github and Dryad, or must publish them as supplementary material on DOI serving electronic journals. None of these archives yet expose the underlying semantics as a queryable database. That said, Balhoff et al. developed Phenoscape KB (http://kb.phenoscape.org/; See Chapter 11), a SPARQL-based endpoint for data derived from Phenex. Vogt et al. [43] are re-developing Morph-D-Base as a knowledgebase for generating, storing, distributing and querying semantic phenotypes described as
M.J. Yoder et al. / Semantic Phenotypes
59
“instance-anatomies”, i.e., highly semantic graphs. These latter two efforts have the most potential to be adopted as a more general purpose repository for taxonomists’ semantic phenotypes. One of the key (though sometimes only implied) benefits of producing semantic phenotypes is that they make data computable in a variety of ways. For example, they may be indexed and searched at a more finely atomic level than possible with unstructured natural language, or reasoned across using logical inference. However, none of the existing taxonomic efforts have demonstrated such applications beyond simple use of OWL reasoners (e.g., Elk) to classify data into partonomy based categories [2] . Franz [44] (Fig. 7) has experimented with reasoning via application of the Euler/X reasoning engine over phenotype data classified via the ETC framework. The work of Balhoff [9] and Dececchi et al. [45] may present the most convincing demonstration to taxonomists of the potential use of computation with respect semantic phenotypes. They demonstrate that, in combination with a organismal classification, one could infer many gaps in a taxon by character presence/absence matrix. This work implies that with computational reasoning, taxonomists may be able to describe more taxa, with fewer observations. Ramírez and Michalik [15] also present a methodological- and visualization-based approach to employing semantic phenotypes which may be particularly compelling to taxonomists. In summary, the range of semantic phenotype technology currently available to support taxonomists producing or refining new anatomical descriptions, or seeking to exploit past descriptions via new analytics, is very narrow. We do not mean to imply that the field has not deeply explored important issues, but rather, that much work is needed to make usable tools and thereby gain broader adoption. While we have demonstrated [28,31] that the underlying approaches can be learnt and advanced by graduate students, we note that most of the key advances have been facilitated by a very small number of highly technical individuals. Thus there is a substantial bottleneck in training semantic phenotype “producers.” 3.2. Taxonomists adopting technologies: characterizing and supporting taxonomic work One way to get past this bottleneck is to step back and acknowledge that catalyzing significant change in a scientific field will take
60
M.J. Yoder et al. / Semantic Phenotypes
time. If we accept this idea then we can afford to pause and carefully consider how best to enhance taxonomists’ existing work platforms and practices. This means designing new technologies that first and foremost help taxonomists do what they already do. Once engaged, those technologies can be leveraged to slowly guide their users toward the production and application of semantic phenotypes. In order to implement this strategy we must first generalize characteristics of taxonomists and the field of taxonomy. To that end, we interviewed over 30 taxonomists and held multiple workshops in conjunction with several NSF ABI related grants (“Collaborative Research: ABI Innovation: Rapid prototyping of semantic enhancements to biodiversity informatics platforms”, “NSF Advances in Biological Informatics. The Hymenoptera Ontology: part_of a transformation in systematic and genome science.”). From the discussions and review of those interactions several themes consistently arose. The results of those efforts are being expanded upon in forthcoming publications, but are broadly summarized here in the specific context of semantic phenotype production. While we by no means claim this to be a comprehensive or unbiasedly derived list, we do feel that these particular characteristics of taxonomists and their work lend themselves to consequences for the development of technologies. For each characteristic we briefly describe potential consequences, categorized into “pros” and “cons”, for the production of semantic phenotypes. Taxonomists are integrators. The product of a taxonomist’s work typically summarizes everything that is known about a taxon, or unit of biodiversity. We have found that numerous taxonomists have independently developed their own complex systems for integrating their data, and there is a high degree of convergence towards the need for large, integrative software tools. Pro: Taxonomists understand the difficulties of pulling together disparate types of data, and may view semantic phenotypes as just another kinds of data they need to integrate and work with. Con: Existing approaches to semantic phenotype production require a cobbling together of software and techniques. Consequently, their integration is going to be seen by taxonomists as requiring more work than their existing workflows. Taxonomists are illuminators of the never before seen. By the very nature of their work, taxonomists must frequently describe things that have never before been recognized. This has critical consequences for
M.J. Yoder et al. / Semantic Phenotypes
61
workflows that reference semantic standards, in that those standards will almost certainly not express all that the taxonomist needs them to. Pro: Taxonomists are the perfect type of researchers to extend and expand underlying standards (e.g., anatomy ontologies). Con: Software and tools that build semantic phenotypes must allow the user to formalize their data using temporary standards which become fully realized after the fact. This may be quite challenging to implement, given current systems’ challenges in handling semantic uncertainty. Taxonomists are “within-ers”, not betweeners. By this we mean that taxonomists are primarily interested in defining taxa such that they are “locally” recognizable; there is an assumption that one will start with an existing set of potential unknowns, not all unknowns. This is commonly illustrated in their work via statements like “species A has a smaller head than species B”, or “A has a spine more curved than the spine of B”. In this example, the description would be sufficient if a researcher only has A’s and B’s to look at, but insufficient if C’s, D’s and Z’s are introduced. In other words, a lot of taxonomic description is about relative values rather than absolute values, and relative to other near species rather than relative to all species. Pro: Combining formalized semantics with natural language processing could help universalize this class of statements by identifying them to the taxonomists prior to their publication or by linking to broader ontologies that could appropriately contextualize relative terms and descriptors. Con: Taxonomists may push back against the need to make their semantic phenotypes to be more globally interpretable, because it may require changes to their methods or the language they’re more comfortable using. Taxonomists work iteratively. Taxonomists’ workflows are decidedly non-linear. They continually return to past observations for refinement. Pro: In an integrated system this tendency could lead them to continually refine the supporting semantics (e.g., reference ontologies). Con: Referenced formalizations such as anatomy ontologies cannot be built as a first step, but need to be iteratively updatable in real time. This requires a complex software design pattern to properly be addressed. Taxonomists implicitly reference the past. Through a lifetime of experience that includes not only exposure to published work, but also tacit knowledge gained through mentorship and collaboration,
62
M.J. Yoder et al. / Semantic Phenotypes
taxonomists work with anatomical concepts that are understood but that may have never been fully or explicitly defined (semantically or otherwise). Pro: Taxonomists are a source of previously unpublished concepts that could potentially be drawn out during their workflow. Con: It is unclear whether semantic phenotypes can be defined to fully reflect the intent or meaning as understood by the taxonomist. Taxonomists are set builders. Taxonomists build diverse kinds of sets (e.g., their character matrices), in the mathematical sense. Pro: Computers excel at manipulating sets. Interfaces which mimic the natural way taxonomists build and interact with sets, for example aggregating and sorting through physical specimens, are largely unexplored. Con: There is not nearly enough logical exploitation of this principle. There is great potential to reason or compute over the datasets that are basic elements of a taxonomist’s work. Semantic phenotypes should extend the types of sets a taxonomists can make and refine to explore their data. Taxonomists are visually driven. Taxonomists transcribe concepts as text or annotated images for publications. Neither of these formats fully represent what the taxonomist understands and sees about these concepts. Pro: The space for developing novel visually based interfaces is for all intents and purposes completely unexplored. Con: Semantics for describing complex phenotypes may require extensions far beyond what has been done to date, such as models for 3D representations. Taxonomists are short term localists and long term globalists. Taxonomic work at the level of species definition is about attempting precise delineations between similar species, by noting differences that allow a distinction to be made between very similar species, and not by describing in relation to all species. However, this detailed work about often small and subtle differences to help avoid misclassification is then published as a contribution to a global collaborative effort of describing all the world’s species, that has been successfully sustained over decades, even centuries. That globalised effort benefits from different approaches to integration and standardization than if the effort was solely focused as the level of say genera. Pro: The way the work is done allows for both local and global work. Con: the needs of both perspectives can lead to contradictory requirements at different points of doing the work.
M.J. Yoder et al. / Semantic Phenotypes
63
Figure 1. A virtual light table. The table space include images, and symbolic representations of specimens, collecting events, and an anatomy ontology arranged in a freeform open space. Basic functionality includes adding objects, highlighting objects, and saving the workspace (top), selecting, manipulating, and annotating the current contents of the table (bottom). See also Fig. 2
3.3. Interfaces To illustrate how we might practically apply the generalized principles discussed above, we conclude by describing interface concepts for a taxonomist’s workbench that aids in the production and annotation of semantic phenotypes. These interfaces are premised on two of the principles identified above: that taxonomists are visual by nature, and that taxonomists often “localize” or describe their results in relative terms. While there are a vast number of visually varying interfaces employed in software in general, those used by taxonomists tend to be simple: spreadsheets, documents, or database entry forms. As noted above, taxonomists are set builders; they build groupings of speci-
64
M.J. Yoder et al. / Semantic Phenotypes
mens, literature, anatomical descriptions, and images. Though some of this set-building takes place in a spreadsheet, it’s just as likely to take place in on a physical table or lab. We propose a set building interface that functions more like a table than a spreadsheet: individuals in these sets could be manipulated in virtual light tables (Fig. 1). The concept here is that a taxonomist’s physical workspace is often a table littered (literally) with specimens, pieces of paper, containers and tools. This physical space is configured for the task at hand: labelling specimens, exploring them for new diagnostic phenotypes, comparing them to descriptions in the past literature, and so on. Providing an analogous space within software may let taxonomists organize and examine their data in a more free-form, iterative manner and may support the iterative work of grouping, lumping and splitting, and determining of best distinguishing features within the software. While the obvious use of a virtual light table is to examine images, including zooming, rotating, annotating them, and moving them side by side, we can also envision laying out specimens, labels, notes or other data in a symbolic fashion. These layouts could be visually enhanced by the semantics that relate the core data. For example, laying out specimens as circles on a light table, with colorization based on the number of measurements taken on them. Taxonomists are constantly labelling these sets (e.g., paper attached to insect pins) both physically and virtually. A richer, more graphically immersive interface that facilitates annotation, i.e., labelling, is another potential means to bridge the relationship between physical and virtual workspaces. Radial flyout menus (Fig. 3) provide a unified framework for letting users quickly select from a range of types of inputs. We envision allowing the user to tag, take notes, add images, or qualify the quality of existing data/observations. When combined with a light table concept the first stages of taxonomic discovery (e.g., “these hairs are interesting”, “red legs seems to group these specimens”, “these specimens need a another look”) become more digitally integrated. Taxonomists link the physical and digital world with paper labels, a flyout annotator could further allow the user to indicate that they want to queue a physical version of their annotation for print. We know of only a few cases where taxonomists have sought to make 3D anatomical models a part of their workflow for describing taxa. While training taxonomists to describe taxa in 3D modellers
M.J. Yoder et al. / Semantic Phenotypes
65
Figure 2. A virtual light table with actionable interfaces. Menus are presented as simultaneously open for display purposes, in practice they would open independently. The interface is designed with touch-screen use in mind. A) Individual attributes can be selected and assigned a color, this highlights symbolic representations of the data, for example here we see specimens and collecting events from particular geographic areas correspondingly highlighted. B) Images can be manipulated in freeform. Clicking them brings up a menu C) which lets the user quickly add related data to the table, provide annotations, or new links (semantics). D) Arbitrary groups of data can be defined in freeform by the user, and given simple textual annotations. This is particularly important during the discovery phase of a taxonomist’s research in which novel phenotypes are being understood and circumscribed for the first time. E) As an object is selected or function triggered linkages between symbolic data dynamically appear, allowing the user to quickly see and assign new observations at many different levels. F) Phenotypes can be quickly defined based on set classes (e.g., size, shape, color, and relative nature to other phenotypes). G) Traditional (e.g., qualitative or quantitative) observations can be gathered as well, these are precursors to semantic representations. H) Objects (data) and their groups on the table can be quickly selected and annotated. I) Anatomy classes can be displayed in correspondence to their physical position on the specimen.
66
M.J. Yoder et al. / Semantic Phenotypes
Figure 3. A radial menu. A) The initial menu is compact, and can be inserted within complex natural language statements without overburdening the interface. B) On click a radial menu is opened. Clicking a slice of the menu opens the corresponding form. When the user is done the menu and form collapses in place. See implementation within the light table (Fig. 2-C.)
(e.g., Blender; https://www.blender.org/) is likely some distance off, there is great potential for exploiting 3D spaces within workbench interfaces. We see their use falling into three categories: 1) spatially binding anatomical concepts to approximate their real-life position (e.g., Fig. 2-I) using symbolic representations of these spatially bound concepts to report metadata (e.g., as heat-mapped values); and 3) exploiting these same models to permit the user to navigate into specific phenotype description templates (this being particularly important with the advent of virtual headset technologies). Collectively, these concepts could, again, provide a more intuitive parallel between the physical and digital world, closing the space between the abstractions in the mind of the taxonomist and their digital manifestations. A major promise of novel interfaces is to provide new ways to express information that is currently almost exclusively shared in telegraphic natural language or annotated images. We envision navigating from a symbolic 3D representation of a taxon’s anatomy into templates that map to specific phenotype types. To realize the full
M.J. Yoder et al. / Semantic Phenotypes
67
Figure 4. Symbolic interfaces. A) Statements like “edge curved near tip” are not globally comparable amongst taxa. Allowing the taxonomist to express a visual relationship via a simple interface (drag anchors along a line) quantifies the expression in a manner that is globally comparable. B) Statements like “hair sparse” are useful only in the context of extra metadata (images, figures). Simple interfaces that let the user choose an approximation of what they mean by “sparse”, or tune their own approximations, result in a globally comparable quantification.
use of phenotype templates they need to be classified into categories that taxonomists frequently think of, for example, color, size, shape (e.g. Fig. 2-F). There are various classifications that could be used as the foundation for deriving phenotype template categories [7,8,32,34,46] , though little has been done to provide a formal classification from which to base application development. Lessons from 3D modelling software (e.g., how that software lays out options for color, texture, size) should be explored. We anticipate that symbolically based interfaces (Fig. 4) hold potential for clarifying semantic phenotypes. These types of interfaces are specifically useful in cases where we are making “within” type comparisons (e.g., “edge curved near tip”, “protrusion far from edge”, “setae dense”, “setae sparse”). Here the idea is to give the taxonomists the confidence to express what is a relative NL statement into a visual expression that is numerically quantifiable (e.g., has richer semantics). These expressions are not necessarily perfect, but they are formalized, and as such ultimately comparable in “between” studies. A generalizable set of interface improvements are possible if we can exploit the semantics of existing data while the user is providing new data (sometimes referred to as reasoning on the fly). In this case
68
M.J. Yoder et al. / Semantic Phenotypes
assertions made by the taxonomists like “the specimen has an orange tongue” are parsed for semantic links to underlying semantics. The results can be fed back to the user via multiple mechanisms, possibly including visualizations using symbolic representations. Feedback can be of the autocorrect type (you said ‘hair’, did you mean ‘setae’) in which labels are tested against the concepts they are bound to (Fig. 5), or in which detected concepts are being described in a way that is logically inconsistent (e.g., user error, “the head is attached to the leg” should not be a legal expression according to some ontology). A third type of auto-feedback is more complex and reflects the within/between dichotomy. In this case we can imagine a taxonomist describing a new taxon, while they populate a set of phenotypes those data are analyzed in real time against existing statements for related taxa. These analyses should be the basis for returning to the taxonomists and prompting them to 1) make new statements; 2) qualify existing statements; or 3) fix inconsistent statements regarding their phenotypes. For a trivial example, imagine a group of species presently diagnosed by head color. A taxonomist seeks to describe a new member of what he or she believes to belong in this clade. Upon completion, the software detects that the taxonomist has not provided a phenotype that references head, and it suggests that this be added so that all taxa in the clade, both previously and newly described, can be cross compared. We conclude by imagining a more abstract interface that adds simple, but powerful semantics to the underlying data (Fig. 6-B). This interface’s goal is to allow the taxonomist to give their “within” (local) phenotype a “between” (global) context. This interface maps specific concepts like “roundness”, “blueness”, “straightness”, “nearness”, “hairiness” 1:1 with a globally accessible endpoint (knowledgebase). A taxonomist is prompted to slot his or her phenotype concept between existing concepts, in essence making assertions that “my phenotype is hairy, it is more hairy than this, but less hairy than that”. Within the interface they can step back and forth between nearby phenotypes, or make more radical jumps to something “much hairier”. Results from this type of character are not necessarily locally accurate, for example your concept of blue may differ from mine, but they are globally relative and scoped. The distribution of data in a particular endpoint (all blue things) can be broken into groups and given user-desired labels (Fig. 6-B). For example: 1) I’m using the label “light blue” for
M.J. Yoder et al. / Semantic Phenotypes
69
Figure 5. A simple, non-intrusive real-time concept matcher. A) The text box/screen, user provides standard telegraphic natural language in the editor on the left of the split screen, on the right real-time feedback matches text to the current statement based on the cursor location (see arrow); B) As the user types, concepts from existing standards are suggested; C) Some concepts cannot be matched, and the user is prompted to select semantically similar ones, note that the user’s phrase is not automatically changed, this allows them to express themselves in the manner they see fit, while adding a general level of semantics to their statement; D) The user has selected a related concept and also elected to create a new concept; E) The interface could be extended in many ways, for example it could allow the user to express the “nestedness” of the related concepts by allowing them to be indented via a dragging action.
70
M.J. Yoder et al. / Semantic Phenotypes
records 0...1000, 2) “blue” for records 1001...1020, and 3) “dark blue” for records 1021..20000. The user could choose to exclude records from series if they don’t fit the concept. Because these labels (breaks in the distribution) are specifically bound to a position relative to other data they are vastly more informative than had the data been provided without external reference. Furthermore, as a collective community expands the number of assertions within a particular endpoint each particular assertion becomes more powerful (there are now X more things to compare it against) and more precise, it’s blue somewhere between these 1,000 examples on one side and those 1,000 examples on another versus “it’s light blue”. This approach is inherently a consensus building mechanism similar to CAPTCHA-based systems (when five users say the picture contains a dog, we can be confident that it does indeed contain a dog). If the system permits endpoints (e.g., “blueness”) to be cloned, or split into new endpoints, then as end-users find problems with the distribution or definitions they can easily make assertions that they feel more confident with, i.e., the system can evolve. The beauty behind the system is that it allows a taxonomists to assert a broader context for their data by using a simple, intuitive interface with minimal decisions points: 1) bump my phenotype to the left; 2) bump it to the right; 3) leave it here, I’m done! 4. Summary The production of semantic phenotypes originates either from the processing of previously published data - at this point exclusively natural language (though image post-processing is conceivable) - or from the taxonomists as they produce never before recorded observations. While in the former case the resultant utility of a given semantic phenotypes is greatly limited by the abilities of the parsing algorithm to interpret NL or by the annotator deriving a semantic phenotype from their understanding of NL statement, in the latter case what can be expressed is bound only by the interface. In other words the roadblocks preventing taxonomists from producing de novo semantics phenotypes are the lack of novel, imaginative interfaces. These interfaces must reflect the general principles that govern what a taxonomist does if they are to provide a system that resonates with the taxonomist.
M.J. Yoder et al. / Semantic Phenotypes
71
Figure 6. An interface to define “curved” using a global context. A) User selects “curved” and their target is randomly placed within the knowledgebase of curved things (as represented by images, textual descriptions, or other interpretable data). By sliding their target between other objects (concepts) that are asserted to “be curved” the user expresses the nature of their curve. B) If the user wishes they can provide labels for what they mean by the degree of “curved”. The endpoints of these ranges are globally referenced within the knowledgebase, in theory automatically accessioning the user’s data as a URI referenced object if desired.
The interface and functionality ideas outlined above are just ideas - points in a larger design space. However, it is a region of the design space that we believe is worth exploring. The taxonomic work process is extremely visual, involving numerous different comparisons by a trained expert eye in order to make appropriate, useful, actionable and replicable distinctions. This visual adjacency comparative work is finally translated into a textual description which has conventionally been free form natural language albeit using standardized terminology and structural conventions. We believe that software for taxonomists should support not only the product of structured text, but also the process of getting to that text, acknowledging its visual, comparative and iterative aspects. This somewhat structured natural language is relatively easily shared in worldwide databases aggregating the work of taxonomists across space and time. Semantic phenotypes offer great potential as a way to use even greater structure to support inferencing. Tools that make it easier for taxonomists to work towards both textual species definitions and develop semantic pheno-
72
M.J. Yoder et al. / Semantic Phenotypes
types without substantially increasing the work that the taxonomist must do are highly desirable. Additionally, a tool that supports comparative iterative work enables a recording of all the steps along the way. Such a history makes it easier for a taxonomist to review her work process, recover from dead ends and revert to earlier possibilities, and benefit from reuse of work on very similar species. It also at least offers the possibility of helping others to learn by making more visible their own work practices and the work practices of more expert practitioners through visualizations of the twists and turns of the taxonomic work process. In conclusion, we see no lack of interest from our fellow taxonomists and researchers, they want to build anatomy ontologies, formalize descriptions, and take advantage of the quantitative potential of the data, the vision presented here and throughout this book. However, evolving the few existing methods for producing semantic phenotypes into methods that can scale to meet this interest and demand remains a major challenge. 5. Acknowledgements The work was made possible by a grant from the U.S. National Science Foundation (grant # DBI-1356381). Conversations with participants of the Phenotype Research Coordination Network (NSF 0956049) resulted in the seeds of many of the ideas presented here. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Special thanks to Jim Woolley, John Heraty, István Mikó, and Roger Burks for their ideas and feedback regarding interfaces. Images used here are from the Wikimedia Commons: • Plant- By Glenda Wood [CC BY 3.0 au], via Wikimedia Commons • Curved wrench - By Tomas Castelazo (Own work) [CC BY-SA 3.0] or GFDL (http://www.gnu.org/copyleft/fdl.html), via Wikimedia Commons • Boomerang - Pic taken by Adrian Barnett, using Olympus C2000Z), from http://en.wikipedia.org/wiki/Image:Boomerang.jpg Public Domain
M.J. Yoder et al. / Semantic Phenotypes
73
• Arch - By Norio NAKAYAMA from Saitama, Japan (Arc de Triomphe Paris France) [CC BY-SA 2.0], via Wikimedia Commons
6. References [1] Deans AR, Yoder MJ, Balhoff JP (2012) Time to change how we describe biodiversity. Trends Ecol Evol 27:78-84. [2] Balhoff JP, Mikó I, Yoder MJ, Mullins PL, Deans AR (2013) A semantic model for species description applied to the ensign wasps (Hymenoptera: Evaniidae) of New Caledonia. Syst Biol 62:639-659. [3] Deans AR, Mikó I, Wipfler B, Friedrich F (2012) Evolutionary phenomics and the emerging enlightenment of arthropod systematics. Invertebr Syst 26:323-330. doi:10.1071/IS12063. [4] Deans AR, Lewis SE, Huala E, Anzaldo SS, Ashburner M, Balhoff JP, Blackburn DC, Blake JA, Burleigh JG, Chanet B, Cooper LD, Courtot M, Csösz S, Cui H, Dahdul W, Das S, Dececchi TA, Dettai A, Diogo R, Druzinsky RE, Dumontier M, Franz NM, Friedrich F, Gkoutos GV, Haendel M, Harmon LJ, Hayamizu TF, He Y, Hines HM, Ibrahim N, Jackson LM, Jaiswal P, James-Zorn C, Köhler S, Lecointre G, Lapp H, Lawrence CJ, Le Novère N, Lundberg JG, Macklin J, Mast AR, Midford PE, Mikó I, Mungall CJ, Oellrich A, Osumi-Sutherland D, Parkinson H, Ramírez MJ, Richter S, Robinson PN, Ruttenberg A, Schulz KS, Segerdell E, Seltmann KC, Sharkey MJ, Smith AD, Smith B, Specht CD, Squires RB, Thacker RW, Thessen A, Fernandez-Triana J, Vihinen M, Vize PD, Vogt L, Wall CE, Walls RL, Westerfeld M, Wharton RA, Wirkner CS, Woolley JB, Yoder MJ, Zorn AM, Mabee P (2015) Finding our way through phenotypes. PLoS Biol 13(1):e1002033. doi:10.1371/journal.pbio.1002033 [5] Mikó I, Friedrich F, Yoder MJ, Hines HM, Deitz LL, Bertone MA, Seltmann KC, Wallace MS, Deans AR (2012) On dorsal prothoracic appendages in treehoppers (Hemiptera: Membracidae) and the nature of morphological evidence. PLoS ONE 7(1):e30137. doi:10.1371/journal.pone.0030137. [6] Mungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA
74
M.J. Yoder et al. / Semantic Phenotypes
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
(2012) Uberon, an integrative multi-species anatomy ontology. Genome Biol 13:R5. Wirkner CS, Göpel T, Runge J , Keiler J, Klussmann-Fricke BJ, Huckstorf K, Scholz S, Mikó I, Yoder M, Richter S (2017) The first organ based ontology for arthropods (Ontology of Arthropod Circulatory Systems OArCS) and a semantic model for the formalization of morphological descriptions. Syst Biol 66(5):754-768 https://doi.org/10.1093/sysbio/syw108 Vogt L (2016) Assessing similarity: On homology, characters and the need for a semantic approach to nonevolutionary comparative homology. Cladistics 33(5):513-539 doi:10.1111/cla.12179 Balhoff JP, Dececchi TA, Mabee PM, Lapp H (2014) Presenceabsence reasoning for evolutionary phenotypes. arXiv preprint arXiv:1410.3862. Cui H, Xu D, Chong SS, Ramírez M, Rodenhausen T, Macklin JA, Ludäscher B, Morris RA, Soto EM, Mongiardino Koch E (2016) Introducing Explorer of Taxon Concepts with a case study on spider measurement matrix building. BMC bioinformatics 17(1):471. Edmunds RC, Su B, Balhoff JP, Eames BF, Dahdul WM, Lapp H, Lundberg JG, Vision TJ, Dunham RA, Mabee PM, Westerfield M (2016) Phenoscape: Identifying candidate genes for evolutionary phenotypes. Mol Biol Evol 33:13-24. Mabee PM, Balhoff JP, Dahdul WM, Lapp H, Midford PE, Vision TJ, Westerfield M (2012) 500,000 fish phenotypes: The new informatics landscape for evolutionary and developmental biology of the vertebrate skeleton. J Appl Ichthyol 28(3):300305. Manda P, Balhoff JP, Lapp H, Mabee PM, Vision TJ (2015) Using the Phenoscape Knowledgebase to relate genetic perturbations to phenotypic evolution. Genesis 53(8):561-571. Ramírez, MJ, Coddington JA, Maddison WP, Midford PE, Prendini L, Miller J, Griswold CE, Hormiga G, Sierwald P Scharff N, Benjamin SP, Wheeler WC (2007) Linking of digital images to phylogenetic data matrices using a morphological ontology. Syst Biol 56(2):283-294. Ramírez, MJ, Michalik P (2014) Calculating structural complexity in phylogenies using ancestral ontologies. Cladistics
M.J. Yoder et al. / Semantic Phenotypes
75
30(6):635-649. [16] Thessen AE, Bunker DE, Buttigieg PL, Cooper LD, Dahdul WM, Domisch S, Franz NM, Jaiswal P, Lawrence-Dill CJ, Midford PE, Mungall CJ, Ramírez MJ, Specht CD, Vogt L, Vos RA, Walls RL, White JW, Zhang G, Deans AR, Huala E, Lewis SE, Mabee PM. (2015) Emerging semantics to link phenotype and environment. PeerJ 3:e1470. doi:10.7717/peerj.1470 [17] Washington NL, Haendel MA, Mungall CJ, Ashburner M, Westerfield M, Lewis SE (2009) Linking human diseases to animal models using ontology-based phenotype annotation. PLoS Biol 7(11):e1000247. [18] Dahdul WM, Balhoff JP, Engeman J, Grande T, Hilton EJ, Kothari C, Lapp H, Lundberg JG, Midford PE, Vision TJ, Westerfield M, Mabee PM (2010) Evolutionary characters, phenotypes and ontologies: Curating data from the systematic biology literature. PLoS ONE 5(5):e10708. [19] Franz NM (2014) Anatomy of a cladistic analysis. Cladistics 30(3):294-321. [20] ten Hoopen P, Amid C, Buttigieg PL, Pafilis E, Bravakos P, Cerdeño-Tárraga AM, Gibson R, Kahlke T, Legaki A, Narayana Murthy K, Papastefanou G, Pereira E, Rossello M, Luisa Toribio A, Chochrane G (2016) Value, but high costs in post-deposition data curation. Database bav126. doi: 10.1093/database/bav126 [21] Dallwitz M (1980) A general system for coding taxonomic descriptions. Taxon 29:41-46. [22] Hagedorn G, Thiele K, Morris R, Heidorn PB (2005) Structured Descriptive Data (SDD) w3c-xml-schema, Version 1.0. Biodiversity Information Standards (TDWG) http://www.tdwg. org/standards/116 [23] Maddison D, Swofford D, Maddison W (1997) NEXUS: An extensible file format for systematic information. Syst Biol 46(4):590-621. [24] Norton GA, Patterson DJ, Schneider M (2012) LucID: A multimedia educational tool for identification and diagnostics. Int J Innov Sci Math Ed (formerly CAL-laborate International) 4(1). [25] Vos RA, Balhoff JP, Caravas JA, Holder MT, Lapp H, Maddison WP, Midford PE, Priyam A, Sukumaran J, Xia X, Stoltzfus A (2012) NeXML: Rich, extensible, and verifiable representation
76
M.J. Yoder et al. / Semantic Phenotypes
of comparative data and metadata. Syst Biol 61:675-689. [26] Hong C (2012) CharaParser for fine-grained semantic annotation of organism morphological descriptions. J Assoc Inf Sci Tech 63(4):738-754. [27] Yoder MJ, Mikó I, Seltmann KC, Bertone MA, Deans AR (2010) A gross anatomy ontology for Hymenoptera. PloS ONE 5(12):e15991. [28] Mikó I, Trietsch C, Sandall E, Yoder MJ, Hines H, Deans AR (2016) Malagasy Conostigmus and the secret of scutes. PeerJ 4:e2682 https://doi.org/10.7717/peerj.2682 [29] Mikó I, Masner L, Johannes E, Yoder MJ, Deans AR (2013) Male terminalia of Ceraphronoidea: Diversity in an otherwise monotonous taxon. Insect Syst Evol 44:261-347. doi:10.1163/1876312X-04402002 [30] Mikó I, Copeland RS, Balhoff JP, Yoder MJ, Deans AR (2014) Folding wings like a cockroach: A review of transverse wing folding ensign wasps (Hymenoptera: Evaniidae: Afrevania and Trissevania). PLoS ONE 9(5):e94056. doi:10.1371/journal.pone.0094056 [31] Trietsch C, Deans AR, Mikó I (2015) Redescription of Conostigmus albovarius Dodd, 1915 (Hymenoptera, Megaspilidae), a metallic ceraphronoid, with the first description of males. J Hymenopt Res 46:137. doi:10.3897/JHR.46.5534 [32] Vogt L (2010) Spatio-structural granularity of biological material entities. BMC Bioinformatics 11:289. [33] Vogt L, Bartolomaeus T, Giribet G (2010) The linguistic problem of morphology: Structure versus homology and the standardization of morphological data. Cladistics 26(3):301-325. [34] Vogt L, Grobe P, Quast B, Bartolomaeus T (2012) Fiat or bona fide boundary - a matter of granular perspective. PLoS ONE 7(12):e48603. [35] Vogt L (2017) The logical basis for coding ontologically dependent characters. Cladistics EarlyView. doi:10.1111/cla.12209 [36] Vogt L (2017) Towards a semantic approach to numerical tree inference in phylogenetics. Cladistics 34(2):200-224. doi:10.1111/cla.12195 [37] Bertone MA, Mikó I, Yoder MJ, Seltmann KC, Balhoff JP, Deans AR (2013) Matching arthropod anatomy ontologies to the Hymenoptera Anatomy Ontology: Re-
M.J. Yoder et al. / Semantic Phenotypes
[38]
[39]
[40] [41]
[42]
[43]
[44]
[45]
[46]
77
sults from a manual alignment. Database 2013:bas057. doi:10.1093/database/bas057 Cui H, Dahdul W, Dececchi AT, Ibrahim N, Mabee PM, Balhoff JP, Gopalakrishnan H (2015) CharaParser+ EQ: Performance evaluation without gold standard. Proceedings of the Association for Information Science and Technology 52(1):1-10. Huang F, Macklin JA, Cui H, Cole HA, Endara L (2015) OTO: Ontology Term Organizer. BMC Bioinformatics 16:47. doi:10.1186/s12859-015-0488-1 Yoder MJ, Dole K, Deans AR “Mx” http://purl.oclc.org/ NET/mx-database. Accessed 1 Sept. 2017 Balhoff JP, Dahdul WM, Kothari CR, Lapp H, Lundberg JG, Mabee PM, Midford PE, Westerfield M, Vision TJ (2010) Phenex: Ontological annotation of phenotypic diversity. PLoS ONE 5(5):e10500. Meid S, Baum R, Bhatty P, Grobe P, Köhler C, Quast B, Vogt L (2017) Developing a Module for Generating Formalized Semantic Morphological Descriptions for Morph-D-Base. Proceedings of TDWG 1:e15141. https://doi.org/10.3897/tdwgproceedings.1.15141 Vogt L (2017) Assessing similarity: On homology, characters and the need for a semantic approach to non-evolutionary comparative homology. Cladistics 33:513-539. doi:10.1111/cla.12179 Franz NM, Pier NM, Reeder DM, Chen M, Yu S, Kianmajd P, Bowers S, Ludäscher B (2016) Two influential primate classifications logically aligned. Syst Biol 65(4):561-582. Dececchi TA, Balhoff JP, Lapp H, Mabee PM (2015) Toward synthesizing our knowledge of morphology: Using ontologies and machine reasoning to extract presence/absence evolutionary phenotypes across studies. Syst Biol 64(6):936-952. Vogt L (2008) Learning from Linnaeus: Towards developing the foundation for a general structure concept for morphology. Zootaxa 1950:123-152
This page intentionally left blank
Part III Biodiversity Ontologies
This page intentionally left blank
Application of Semantic Technology in Biodiversity Science Anne E. Thessen (Ed.) ISBN 978-3-89838-733-0 c 2018 AKA Verlag Berlin
81
Chapter 5
Integrating and Managing Biodiversity Data with the Biocollections Ontology Ramona L. WALLSa , Pier Luigi BUTTIGIEGb , John DECKc , Rob GURALNICKd , and John WIECZOREKe a CyVerse, Bio5 Institute, University of Arizona, Tucson, Arizona, USA b Alfred-Wegener-Institut, Helmholtz-Zentrum für Polar- und Meeresforschung, Bremerhaven, Germany c Berkeley Natural History Museums, University of California at Berkeley, Berkeley, California, USA d Florida Museum of Natural History, University of Florida, Gainsville, Florida, USA e Museum of Vertebrate Zoology, University of California at Berkeley, Berkeley, California, USA The BioCollections Ontology (BCO) is a semantic model for applied biodiversity science, geared toward data integration. It is based on common semantics shared by ontologies in the Open Biological and Biomedical (OBO) Foundry, thus enhancing interoperability with data linked to other life science ontologies. BCO is also compatible with more general semantic web ontologies. By clarifying the semantics of biodiversity science activities such as specimen collection, trait observation, and taxonomic identification – at a level that is useful for practical applications – the BCO supports the types of data management systems needed in the age of large, distributed, and diverse
82
R. L. Walls et al. / Biodiversity Ontologies
data. This chapter describes the key elements of the BCO and how it complements biodiversity standards efforts such as Darwin Core and MIxS, as well as other relevant ontologies. Concrete examples demonstrate how BCO, in conjunction with other tools and semantic technologies, is addressing scientific and informatic challenges through the enhancement of biodiversity data. 1. Introduction Because of its multidisciplinary nature, biodiversity science depends on the integration of diverse datasets spanning phenotypes, traits, genes, taxonomy, location, environment, as well as organismal and species-level interactions. This can mean that researchers must perform various forms of data integration. For example, they must often combine datasets of the same type from multiple sources for use in a single analysis, merge multiple types of data into a more complex dataset, or create a complex database for a specific domain. Many useful domain ontologies (i.e., ontologies that cover a specific discipline such as anatomy and morphology or environmental entities [1–3] and standards [4–7] have been developed to support biological data integration, yet many important challenges remain. Domain ontologies can accurately describe the components of a dataset (what organism or tissue was measured, under what environmental conditions, etc.), and metadata standards can specify which aspects of a study should be captured for discoverability and reproducibility. While both of these functions are necessary to semantically describe data collection workflows and track provenance of physical and digital entities, they are not sufficient. Nonetheless, such abilities are crucial for scalable, automated aggregation and integration of biodiversity data. The BioCollections Ontology (BCO) has been developed specifically to support biodiversity data integration by defining the processes (i.e., events or activities) that are commonly performed in biodiversity science. These processes include specimen collection, trait measurement, inventorying of species using a defined search protocol, and taxonomic identification. Additionally, BCO provides structured semantics to describe the material and information-based entities that are the inputs and outputs of those processes. By clearly defining biodiversity processes and workflows, BCO can semantically describe biodiversity research and support use cases such as:
R. L. Walls et al. / Biodiversity Ontologies
83
• preserving specimen-level information for trait data aggregated to the population or species level • relating sequence data to taxonomy, even when specimen identifications change • describing surveys and inventories and assessing their completeness, and • recording data about observations of interactions between organisms. This chapter begins with a description of BCO and how it fills key gaps that are not covered by existing biodiversity standards. It then offers three examples of how BCO is being used in biodiversity data integration workflows and ends with a brief discussion of future work. 2. Description of BCO The origin and early development of BCO is well documented in several publications [8–12] . In brief, BCO grew out of a series of workshops that aimed to clarify the semantics of widely used terms describing activities central to biodiversity science, sourced from the Darwin Core (DwC [6] ). Further, BCO set out to use ontological approaches to harmonize terminology from DwC and the Minimum Information for any (x) Sequence (MIxS) Standards [7] , in order to help link emerging molecular standards with existing frameworks. The BCO developer team overlaps and works closely with developers of the Population and Community Ontology (PCO [9] ) and the Environment Ontology (EnvO [3] ), which complement BCO by describing collections of organisms and their interactions and their environmental context, respectively. BCO is housed in a GitHub repository at https://github. com/BiodiversityOntologies/bco and is available via its permanent URL (PURL) at http://purl.obolibrary.org/obo/bco.owl. BCO is part of the Open Biological and Biomedical Ontologies (OBO) Library [13,14] and follows community ontology development practices enumerated in the OBO Foundry Principles [15] , including open-source development, licensing, documentation, and commitment to collaboration. BCO is rooted in the upper level Basic Formal Ontology (BFO [16] ) and reuses terms and ontology design patterns from the Ontology for Biomedical Investigations (OBI [17] ) and the Information Artifact Ontology (IAO), originally developed as part of OBI. Despite
84
R. L. Walls et al. / Biodiversity Ontologies
some overlap, BCO is maintained separately from OBI to make it more accessible to biodiversity scientists and to focus input from domain experts. This allows BCO to create biodiversity-specific terms and labels, maintain mappings to other biodiversity vocabularies such as DwC, and import terms from existing OBO Library ontologies with biodiversity specific terms. Thus, the existance of BCO provides a single home for terms related to biodiversity data management and avoids forcing biodiversity scientists to search through OBI’s very large and complex hierarchy of over 3,000 primarily biomedical terms. A key component shared by BFO, OBI, and IAO that is reused in the BCO is the separation of material entities, qualities/traits, processes, and data about these three categories of entities (Fig. 1-A). OBI’s ontological model of specimen 1 and specimen collection process (to which BCO developers contributed) also plays a central role in BCO, detailed in the following paragraph. One design motivation of BCO is to clearly distinguish different types of activities or processes involved in biodiversity science and specify the types of entities (organisms, specimens, data) that serve as the inputs and outputs of those activities (Fig. 1-B). For example, following OBI, an OBI:specimen collection process has some BFO:material entity as an input and an OBI:specimen as output. In contrast, a BCO:observing process has a material entity as input (the observed thing, in BCO, is usually an organism, collection of organisms, or part of an organism), but its output is an IAO:information content entity, usually an IAO:data item [12] . Both specimen collection process and observing process are now subclasses of OBI:planned process in BCO, that is, they follow some OBI:plan specification. This is a change to BCO as it was described in 2014, enacted because BCO is primarily concerned with describing data from scientific observations. In biodiversity sciences, plan specifications are often informal, such as incidental or casual observations of organisms made in the course of other activities. Such data are said to follow a plan and be instances of planned process, because such opportunistic observations are planned by biodiversity scientists de facto. For improved data sharing and reproducibility, this type of plan specifications should be semantically captured and associated with data explicitly. 1
Ontology terms in this chapter are italicized, and, if not clear from the context, prefixed with the name space of the ontology from which they come.
R. L. Walls et al. / Biodiversity Ontologies
85
-A sc
-B sc Figure 1. (A) At a high level, BCO distinguishes among material entities (subclasses of BFO:independent continuant), traits (subclasses of BFO:dependent continuant), processes (subclass of BFO:occurrent), and data (subclass of BFO:dependent continuant), which can be about a material entity, trait, or process. (B) Some common processes in biodiversity science, and their inputs and outputs, which can be either material entities or data. Colors correspond to the categories in part A.
86
R. L. Walls et al. / Biodiversity Ontologies
Instances of these practices can then be aggregated into best practices or manuals, which can be exposed online. BCO’s focus on processes and their inputs and outputs provides a flexible method for describing almost any biodiversity workflow in a way that preserves the provenance of data and materials. This type of workflow is shown in Fig. 2 (adapted from Fig. 4 in Walls et al. [10] ) where, for example, a simplified metagenomic study includes the process chain: 1) seawater sampling event 2) seawater filtration 3) DNA extraction 4) metagenomic sequencing and 5) taxonomic identification. BCO clearly links the inputs and outputs of each process, thus creating a foundation for semantically enhanced querying using, for example, a triplestore (a collection of Resource Description Framework or RDF triples, also sometimes called a graph database). Triplestore technology allows the logical links (i.e., the axioms) between entities represented by BCO to be queried to flexibly discover any entity or combination of entities that satisfies user defined constraints. For example, a user may construct a SPARQL query to find all entities that are associated with environment E and were preserved by method M, including the superclasses and subclasses of M and E. An example using real data is described in Walls et al. [11] , and additional workflows are described in more detail below in the section “Using BCO to integrate data”. 3. BCO in relation to existing standards and other ontologies Because early development of BCO focused on the integration of molecular and organismal biodiversity data, BCO paid special attention to the Darwin Core (DwC) and Minimum Information for any (x) Sequence (MIxS) standards. DwC [4,6] is a standard for exchanging biodiversity data that includes approximately 200 terms and is maintained by Biodiversity Information Standards (TDWG). The DwC vocabulary has been formally described in RDF [18] in order to facilitate re-use, but it intentionally has limited semantics with no hierarchical class structure. Nearly all DwC terms are RDF properties (relations between two resources). While these properties are organized by corresponding classes, there are no formal domain and range specifications. DwC-compliant datasets are most often exchanged in the form of a Darwin Core Archive (DwC-A [19] ) a self-described pack-
R. L. Walls et al. / Biodiversity Ontologies
87
Figure 2. A graphical representation of data from a metagenomic workflow. Rectangular boxes represent instance of BCO classes (ovals) that are linked via the OBI relations is specified input of and has specified input. Using the property chain transitively derives from from the BCO, a reasoner can infer that a set of sequence data labeled with some microbial taxon ID came from a particular seawater sample.
age consisting of one data file and optional “extension” files in text (CSV, TXT) with content and “star structure” (one-to-many relationship between core and extensions) described in simple XML file (meta.xml), and with a dataset metadata file in Ecological Markup Language (EML [20,21] ). BCO provides a semantic interpretation of DwC. To ease the conversion of DwC data to triples using BCO, an ontology called “dwcterms.owl” was created that translates DwC as RDF into OWL, with DwC classes interpreted as ontology classes and DwC properties interpreted as data properties [22] . MixS [7] is a set of checklists published by the Genomics Standards Consortium (GSC). MIxS currently consists of three separate checklists for genomes, metagenomes, and marker genes. These check-
88
R. L. Walls et al. / Biodiversity Ontologies
lists share a set of core descriptors, but also include checklist specific descriptors. Each checklist includes a corresponding environmental package intended to describe the conditions under which the sequenced specimen was collected. A related checklist for plant specimens was recently published as a GSC standard by ten Hoopen et al. [23] . MIxS terms are available as RDF [24] . Coordination efforts that were part of the Research Coordination Network for GSC [8,22] eliminated redundancies between DwC and MIxS, so that MIxS now reuses several DwC terms. BCO does not yet import MIxS terms, but those terms can be used as data properties similar to the way DwC properties are used (see “Managing specimen data using BCO and standards”, below, and Walls et al. [11] . Many ontologies used to describe biological data are primarily hierarchies of classes (e.g., GO, PO, UBERON, EnvO [14] ) meant to cover a specific domain. Ontologies of this type can be used as controlled vocabularies (CVs) to supply values (objects) for metadata properties described in standards such as DwC or MIxS. For example, the MIxS term ‘environmental material’ recommends using one of the subclasses of environmental material from EnvO [3] . Likewise, the DwC term ‘lifeStage’ recommends using a CV, the values of which can be pulled from development stage ontologies such as UBERON [1] for animals or the Plant Ontology [2,25] . Although BCO is not intended to be a class-rich domain ontology, it does include sets of ontology classes that can serve as CVs for data description when no other ontology exists and when subject matter is consistent with BCO. An example of this is a vocabulary for taxonomic inventory processes developed for the Humboldt Core – a draft standard for biological inventory data [26] . Outside the context of metadata, domain ontologies are also widely used in ‘omics communities to associate one data type (e.g., gene expression) with another (e.g., molecular function or cellular localization). Large datasets associated with ontologies like the Gene Ontology [27] , Protein Ontology [28] and Plant Ontology are excellent examples of successful data integration, but their value depends largely on labor intensive manual curation of data over many years [29] . Replicating this model for biodiversity data (which are more heterogeneous than ‘omics data) is not practical, and more automated methods of data integration are needed. By providing a process-based generalizable model for data integration, BCO aims to provide an exten-
R. L. Walls et al. / Biodiversity Ontologies
89
sible semantic base where different standards and ontologies can be brought together to meet the needs of the data at hand, thus complementing the goals of domain ontologies such as GO or PO rather than replacing them. Thus, BCO is more similar to general observation ontologies than domain ontologies. Examples of observation ontologies include the Extensible Observation Ontology (OBOE) [30,31] and the Semantic Sensor Network Ontology (SSN) [32] , or even the higher level Provenance Ontology (Prov-O) [33] . BCO differs from these other ontologies not only by specializing in biodiversity data, but also by being tightly aligned with other OBO Library ontologies such as OBI, PCO, and EnvO [10] . These features make BCO accessible to biodiversity researchers while maintaining compatibility with web standards as well as with other data (particularly genetic and genomic data) linked to biomedical, agriculture, or model organism oriented ontologies. Further, through alignment with BFO’s upper level semantics, BCO can be extended in a well-defined manner compatible with other OBO resources. A set of ontologies and methods developed as part of the Research Object project [34,35] serves a purpose very similar to that of BCO: to track the processes, inputs, and outputs of workflows in order to preserve provenance and support reproducibility. However, that project focuses specifically on data processing or computational workflows, and is not geared toward the dual need in biodiversity science to track both material and data entities. Nonetheless, there is a great deal of overlap in the requirements between a Workflow Research Object and BCO use cases, suggesting that future alignment between the two approaches would be fruitful.
4. Using BCO to integrate and manage data A transition to the next era of biodiversity data management requires semantic workflows that structure data for machine readability and provenance tracking of both specific data elements and whole datasets. These workflows must be compatible with practices both inside and outside the biodiversity domain, including related earth science observation and W3C standards. The examples below - for data about traits, specimens, and interactions - show how BCO can support such semantic workflows.
90
R. L. Walls et al. / Biodiversity Ontologies
4.1. Integrating trait data using BCO:observing process Trait or phenotype data are a fundamental currency of biological research, along with genotype and environmental data. We use the term “trait” broadly to refer to any quality of an organism, including its body size and shape, biochemical capabilities (e.g., for microbial communities), phenological or developmental attributes, and adaptations to its habitat or frequently encountered environments. Unlike genetic and genomic data, which are encoded in a limited number of ways, trait data can take almost any form, from continuous or discrete numerical values to categorical labels (e.g., ontology terms), to free text descriptions. Primary trait data are the output of direct observations or measurements of organisms (a single organism, a population, or an ecological community). However, trait data are often published in an aggregated form, such as statistically summarized values for populations or species (e.g., individuals in population P have a four and five offspring per litter, species X has white flowers, species Y has body length ranging from 10-20 cm). For aggregated trait data, the primary values usually are not published, and no accessible links are maintained between the raw and aggregate values. This practice restricts available information about variation within species and populations and makes it impossible to later “de-aggregate” the data, limiting their future value [36–38] . A growing list of robust domain ontologies can be used to consistently and semantically describe traits (e.g., [1,2,39–41] ), but those ontologies provide little if any support for managing and integrating data about traits. Thus, even while design patterns such as the Entity Quality (EQ) model for describing traits are gaining traction in both ontological studies and trait databases [26,39,42–45] , there is as yet no consistent way of storing trait data, let alone one that works across life science domains and scales to very large datasets. BCO, with its process-based semantics, provides a general data model that can be used not only to ingest trait data, but also to record the provenance and relevant metadata about how data are collected. This model relies on instances of BCO:observing process or other OBI:planned processes to describe the activities that generated trait data (e.g., a direct observation on a museum specimen or organism in situ, extraction of data from published sources, averaged trait values for population, derived values from sensors, etc.) along with inputs and outputs
R. L. Walls et al. / Biodiversity Ontologies
91
of those processes. A BCO-based workflow for ingesting trait data, facilitates the flow of data (in any direction) between data producers and data aggregators by semantically describing the data and thus making it machine readable and discoverable. This workflow also makes manageable the difficult task of going from spreadsheets generated by different projects to a single combined data set, while at the same time maintaining the rich information content of the original data sources (Fig. 3).
Figure 3. Workflow for semantic annotation of biodiversity trait data. Arrows represent the flow of data. The key actors are data generators and aggregators, which interact with data processing services and a graph database. The logical structure of BCO serves as a background trait data model (gray oval).
Data models aligned with BCO’s semantic model, and associated integration workflows, are being used in conjunction with the Plant Phenology Ontology (PPO) [46] to integrate three continentalscale plant phenology datasets: the USA National Phenology Network (NPN) [47] , plant phenology reporting from the National Ecological Observatory Network (NEON) [48] , and the Pan-European Phenology Network (PEP725) [49] . In this use case, PPO provides the semantics for defining phenological stages and traits, while BCO provides the
92
R. L. Walls et al. / Biodiversity Ontologies
semantics for defining the data generation process. Plant phenology is the study of cyclical variation of plant growth and development, generally focusing on the timing of leafing out, flowering, or fruiting. Classes in the PPO describe plant phenological traits (e.g., flowers present or expanding leaves present) that are connected by logical relations and defined in terms of canonical plant structure and plant structure development stage classes from the PO [2] . Data from NPN, NEON and PEP725 are delivered in their native format (either via an API, relational database dump, or tabular extracts) and converted to Comma Separated Value (CSV) files using formatting pre-processors. Each processed dataset contains its own set of phenological descriptions and data collection protocols which are mapped to PPO stages and traits. Using BCO, an instance of BCO:observing process is created for each spreadsheet row, along with an instance of PO:whole plant as the specified input to the observation, and some IAO:data item that is about a PPO:plant phenological trait as the output (Fig. 4). Additional data properties are used to add metadata about the observing process and other entities to the data graph, and the entire dataset is stored as RDF triples in a database. Once data are ingested, queries over the integrated dataset can return all records about a species, geographic area, or time period, regardless of the source, and those data can be traced back to the original source. 4.2. Managing specimen data using BCO and community standards Museum specimen collections are the physical capital on which much of biodiversity science is built. Specimens are a tangible record of biodiversity through space and time and are of high value to society for numerous applications [50] . As biodiversity science expands beyond traditional eukaryotic bounds to encompass microbial and molecular diversity, the definition of a specimen for biodiversity has become broader and now includes environmental samples such as water, soil, or gut contents, as well as DNA extracts therefrom. Likewise, the increasing use of digitization and the ubiquity of sequencing and other molecular techniques has created a need to maintain links between physical specimens and their digital or physical derivatives. As shown in Fig. 2, a series of processes can be chained together, and a shortcut transitive derivation relationship from BCO can be used to identify,
R. L. Walls et al. / Biodiversity Ontologies
93
Figure 4. A representation of the graph created by ingesting a single record from the USA-NPN database, which corresponds to a single observation of a sunflower, Helianthus annuus in the beginning of flowering stage. Phenophase status with a value of 1 means that the phenophase described is present or true. The data ingest process automatically creates instances of classes such as observing process and measurement datum. Data properties such as min value and max value (from PPO) or eventDate and scientificName (from DwC) are used to link values from the database to instances.
for example, the ocean water sample that is associated with a set of microbial taxonomic identifiers as well as any metadata associated with the ocean water sample, stored as data properties as shown in Fig. 4. This method of tracking data derived from specimens is described more fully in Walls et al. [10,11] . The first set of classes defined in BCO included material sample and material sampling process, developed specifically to support specimen collection and processing workflows. Although these classes were soon replaced by the OBI classes for specimen and specimen collection, the legacy of that work lives on through the adoption of MaterialSample (http://rs.tdwg.org/dwc/terms/MaterialSample) as a class in DwC and as the properties materialSampleID in DwC and source_mat_id in MIxS. Data stored in a Darwin Core Archive [19]
94
R. L. Walls et al. / Biodiversity Ontologies
can be explicitly based on specimens through the use of a Darwin Core Material Sample core definition (https://tools.gbif.org/ dwca-validator/extension.do?id=dwc:MaterialSample). Although it has not yet been implemented at scale, a workflow similar to the trait data workflow described in the preceding section can be used to automate the ingest of specimen-based data into RDF. A major obstacle to integrating specimen data described using DwC or MIxS is the need for massive data cleanup and standardization before any meaningful analyses can be done [51] . This is a challenge that BCO cannot address on its own, but which could be reduced through improved knowledge capture at the point of data collection and better use of ontologies and controlled vocabularies as values for many of the metadata terms provided by DwC or MIxS, a step that both standards are moving toward. Despite limitations to integration, resources exist to aggregate data from many sources. The largest aggregator of global biodiversity data is the Global Biodiversity Information Facility (GBIF; gbif.org; [52] ), which ingests data from more than one thousand institutions and makes them available through a global portal. Other aggregators such as VertNet [53] or InvertNet [54] perform a similar function for selected taxa. The bulk of these aggregated data is about museum specimens and is in the form of Darwin Core Archives using an “Occurrence Core”, where the DWC:basisOfRecord (http: //rs.tdwg.org/dwc/terms/basisOfRecord) is a DWC:Occurrence [http://rs.tdwg.org/dwc/terms/Occurrence; “an existence of an Organism (sensu http://rs.tdwg.org/dwc/terms/Organism) at a particular place at a particular time”]. A relatively simple step that would greatly clarify the semantics of Darwin Core records - and thus aggregated biodiversity data at large - would be the widespread adoption of BCO classes as the basisOfRecord. The MaterialSample archive, based on OBI:specimen and described above, is one example of this, but other BCO classes could be used for different types of records, such as bco:observing process in place of DWC:Event or the loosely defined classes DWC:MachineObservation and DWC:HumanObservation (see discussion at https://github.com/ BiodiversityOntologies/bco/issues/54). DwC already recommends using a controlled vocabulary for basisOfRecord (https://terms.tdwg.org/wiki/dwc:basisOfRecord) in the form
R. L. Walls et al. / Biodiversity Ontologies
95
of its own classes, but that list of classes is very limited and not rigorously defined using logical definitions or other semantic approaches. BCO has ingested DwC classes into its logical hierarchy, which provides a start at semantically defining the basisOfRecord terms, and work on this is continuing. We also suggest using BCO for creating new basisOfRecord terms, with the advantage that terms can be put to use quickly and tested by the community before adoption into DwC. BCO version control provides needed reproducibility, without the burden of the lengthy review process required before any use of new DwC classes [55] . 4.3. Integrating ontology-based and relational data systems Graph-based data stores, such as a triplestore, offer rich semantic representation of data, but they come with usability issues such as scalability and the need for special skills that are not yet core competencies for most bio- and ecoinformaticians (e.g., writing SPARQL queries). A great deal of code and infrastructure already exists to support relational databases (DBs), so we offer here a method by which BCO relations can be applied to a relational DB management system. Both relational models and ontologies are specifications intended to be consistent with first order logic, so constructing a relational schema that is consistent with BCO is not necessarily difficult. The challenge lies in constructing a schema that also works for the DB needs. Below we describe a biodiversity DB implementation based on BCO that was designed to store data on both traits and interactions. In this example, two DBs were built using the same BCO-based schema. The first houses trait data collected by AmphibiaWeb (http: //www.amphibiaweb.org/) from literature sources. It contained about 5,000 records, each of which might have values for multiple traits. The other dataset covered interactions between lepidopterans (butterflies and moths) and their host plants. It contains data on 70-80K interactions compiled in the HOSTS database at the Natural History Museum of London and is being used in the ButterflyNET project. For both datasets, a primary challenge was building a DB that could store original observations of individuals and serve aggregated data to the species level. An additional challenge was the need to capture the source of trait data, which could be observations in nature, observations of specimens, or information from published lit-
96
R. L. Walls et al. / Biodiversity Ontologies
erature. For ButterflyNET data, it was also necessary to record the two interacting partners along with the type of interaction. The relational model shown in the top of Fig. 5 is a simplified version of the model in use by the DBs. It shows how the challenges described above may be solved by including tables for planned processes, evidence, (organismal) entities, measured values, and traits. This relational model maps to the ontological model used by BCO, shown in the bottom of Fig. 5, with the exception that the relational model considers the input to a planned process (i.e., a material sample or information artifact) as evidence of an entity or set of interacting entities. BCO, in contrast, specifies inputs to planned processes as either a BFO:material entity or IAO:information content entity, depending on the process type. We consider the concept of evidence to be central to biodiversity studies, and thus important for BCO, but further specification is needed for both the relational model and BCO. Such specification requires care, as the semantics of what different researchers call “evidence” are quite complex and currently under discussion in a number of ontology developer groups. We are participating in those discussions, which are beyond the scope of this chapter, and will incorporate appropriate classes and properties to describe evidence into BCO as they become more stable. Despite small differences in the models, data from the relational DB could easily be ingested into a BCO-based triplestore by assigning instances to the ontology classes based on values from the relational tables and by substituting OWL object properties for the links between tables. However, the primary goal of this work was not to be able to convert between relational and graph-based data, but rather to impart clear semantics to the relational model to support its defined use cases. Typical queries used with these DBs are to aggregate all trait values for a species or a genus of Amphibians, to list all host plants for a genus or species of Lepidoptera, or to list lepidopteran ‘pests’ on any genus or species of plants. In sum, relational database models that mimic the knowledge modeling of BCO and other observation and measurement ontologies provide a flexible way to maintain compatibilities with existing, well-established technologies while also being forward compatible with newer semantic tools.
R. L. Walls et al. / Biodiversity Ontologies
97
Figure 5. A relational database model (top) that incorporates the same basic semantics as the BCO (bottom). Note that the relational model uses tables for “process evidence” and “entity interactions” in order to specify complex many to many relations, which can be more simply specified using object properties as the instance level in an ontology.
5. Future directions for BCO BCO has been available to use since 2015, and it continues to progress to meet new use cases. The case studies provided above describe some of the early adoptions of BCO for real-world data management. As projects using BCO grow and scale, BCO must also become more robust and scalable. To allow BCO developers to focus more on content development, a crucial next step is to refactor the ontology development process for scripted, reproducible editing, building, and releasing of the ontology. Since BCO’s inception, new and improved tools for ontology development have been developed [56,57] , and we hope to take advantage of those tools to update BCO’s production protocols.
98
R. L. Walls et al. / Biodiversity Ontologies
An important factor for the scalability and sustainability of BCO is the availability of tools for working with BCO and other ontologies that meet specific data management challenges faced by the biodiversity sciences community. In response to that need, we are actively developing tools for data ingest, such as that being used with PPO [58] . Direct phenology observations are fairly straightforward to model (a person observes a plant or a patch of plants and records data about it), making their ingest relatively straightforward. More complex observations and sampling processes, such as PhenoCam [59] or satellite data streams that are later translated into phenological data, complex inventory protocols as defined in Humboldt Core, or nested sampling protocols used in microbial diversity studies, produce more complex datasets. Building scalable data ingest workflows for those datatypes that can work for more than a single data generator will present many additional challenges, and consensus solutions should be sought together with other ontologies facing the same complexity. One solution to scaling is a hybrid approach that combines graph databases with other data stores such as relational or object oriented DBs, or other data storage systems such as IRODS [60] . A step in this direction is the prototype relational model described in this chapter. This schema could be extended to support a broad array of biodiversity databases, similar to the way that the CHADO schema from the Generic Model Organism Database (GMOD) project [61] has been able to serve the needs of many model organisms databases by including modules for genes, phenotypes, organisms, and sequences. BCO could release a corresponding modular relational schema for use by the biodiversity community, but we would only pursue this step should a clear need arise. Finally, to ensure interoperability of BCO-annotated data, we must ensure that BCO is interoperable with other ontologies. Interoperability with other OBO Library ontologies is already well established, but there are many other widely used resources that are important for linked open data, data management, and scientific reproducibility. Alignment of BCO with OBOE and SSN (described above) is ongoing, and we have already confirmed that all three ontologies use compatible models of observations. Using BCO in conjuntion with terms from semantic web ontologies such as FOAF [62] or metadata vocabularies such as Dublin Core [63] may make data described using BCO more compatible with data published outside
R. L. Walls et al. / Biodiversity Ontologies
99
biology. Going forward, formal mapping (i.e., specification of OWL subclassOf, subPropertyOf, or sameAs relations) between BCO terms and the W3C ontologies PROV-O [33] and the Web Annotation Data Model [64] could allow BCO to function better within a wider range of data systems, including semantic web applications. 6. Acknowledgements Funding for early development of BCO was provided by the Phenotype Ontology Research Coordination Network (NSF-DEB-0956049), EAGER: An Interoperable Information Infrastructure for Biodiversity Research (NSF-IIS-1255035), RCN4GSC: A Research Coordination Network for the Genomic Standards Consortium (NSF-DBI0840989), and Collaborative Research: BiSciCol Tracker: Towards a tagging and tracking infrastructure for biodiversity science collections (NSF-DBI: 0956371, 0956350, 0956426). R. Walls was supported by CyVerse (NSF-DBI-0735191 and NSF-DBI-1265383). 7. References [1] Mungall CJ, Torniai C, Gkoutos GV, Lewis SE, Haendel MA (2012) Uberon, an integrative multi-species anatomy ontology. Genome Biol 13:R5. [2] Cooper L, Walls RL, Elser J, Gandolfo MA, Stevenson DW, Smith B, Preece J, Athreya B, Mungall CJ, Rensing S, Hiss M, Lang D, Reski R, Berardini TZ, Li D, Huala E, Schaeffer M, Menda N, Arnaud E, Shrestha R, Yamazaki Y, Jaiswal P (2013) The Plant Ontology as a tool for comparative plant anatomy and genomic analyses. Plant Cell Physiol 54:e1. doi:10.1093/pcp/pcs163 [3] Buttigieg PL, Morrison N, Smith B, Mungall CJ, Lewis SE (2013) The environment ontology: Contextualising biological and biomedical entities. J Biomed Semant 4:43. [4] Darwin Core Task Group (2015) Darwin Core. http://rs. tdwg.org/dwc/ [5] Holetschek J, Dröge G, Güntsch A, Berendsohn WG (2012) The ABCD of primary biodiversity data access. Plant Biosystems - An International Journal Dealing with all Aspects of Plant Biology 146:771-779.
100
R. L. Walls et al. / Biodiversity Ontologies
[6] Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, Giovanni R, Robertson T, Vieglais D (2012) Darwin Core: An evolving community-developed biodiversity data standard. PLoS ONE 7(1):e29715. https://doi.org/10.1371/journal.pone.0029715 [7] Yilmaz P, Kottmann R, Field D, Knight R, Cole JR, AmaralZettler L, Gilbert JA, Karsch-Mizrachi I, Johnston A, Cochrane G, et al. (2011) Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nature biotechnology 29:415420. [8] Deck J, Barker K, Beaman R, Buttigieg PL, Dröge G, Guralnick R, Miller C, Ó Tuama É, Murrell Z, Parr C, Robbins B, Schigel D, Stucky B, Walls R, Wieczorek J, Morrison N, Wooley J (2013) Clarifying Concepts and Terms in Biodiversity Informatics. Standards in Genomic Sciences 8:352. [9] Deck J, Guralnick R, Walls R, Blum S, Haendel M, Matsunaga A, Wieczorek J (2015) Meeting report: Identifying practical applications of ontologies for biodiversity informatics. Standards in Genomic Sciences 10:25. [10] Walls RL, Deck J, Guralnick R, Baskauf S, Beaman R, Blum S, Bowers S, Buttigieg PL, Davies N, Endresen D, et al. (2014). Semantics in support of biodiversity knowledge discovery: An introduction to the Biological Collections Ontology and related ontologies. PLoS ONE 9:e89606. doi:10.1371/journal.pone.0089606 [11] Walls RL, Guralnick R, Deck J, Buntzman A, Buttigieg PL, Davies N, Denslow MW, Gallery RE, Parnell JJ, OsumiSutherland D, Robbins RJ, Rocca-Serra P, Wieczorek J, Zheng J (2014) Meeting report: Advancing practical applications of biodiversity ontologies. Standards in Genomic Sciences 9:17. [12] Walls RL, Guralnick R, Deck J, Matsunaga A (2014). A specimen-based view of the world: Using the Biological Collections Ontology to query biodiversity data. In: Joint Proceedings of The First International Workshop on Drug Interaction Knowledge Representation (DIKR 2014), The Second International Workshop on Definitions in Ontologies (IWOOD 2014), and The Starting an OBI-based Biobank Ontology Workshop (OBIB 2014). Houston, Texas,USA: CEUR Workshop Proceed-
R. L. Walls et al. / Biodiversity Ontologies
101
ings, 64-66. [13] Population and Community Ontology (2017) http://www. obofoundry.org/ontology/pco.html [14] The OBO Foundry (2017) http://www.obofoundry.org/ [15] The OBO Foundry (2017). OBO Foundry Principles http:// www.obofoundry.org/principles/fp-000-summary.html [16] Arp R, Smith B, Spear AD (2015) Building Ontologies with Basic Formal Ontology. MIT Press. [17] Bandrowski A, Brinkman R, Brochhausen M, Brush MH, Bug B, Chibucos MC, Clancy K, Courtot M, Derom D, Dumontier M, et al. (2016) The Ontology for Biomedical Investigations. PloS ONE 11:e0154556. https://doi.org/10.1371/journal.pone.0154556 [18] Biodiversity Information Standards (TDWG) (2015) Darwin Core as RDF. http://rs.tdwg.org/dwc/terms/guides/rdf/ index.htm [19] Darwin Core Task Group (2015) Darwin Core Text Guide. http://rs.tdwg.org/dwc/terms/guides/text [20] KNB (2017) https://knb.ecoinformatics.org/ [21] Michener WK, Brunt JW, Helly JJ, Kirchner TB, Stafford SG (1997) Nongeospatial Metadata for the Ecological Sciences. Ecological Applications 7:330-342. [22] Ó Tuama É, Deck J, Dröge G, Döring M, Field D, Kottmann R, Ma J, Mori H, Morrison N, Sterk P, Sugawara H, Wieczorek J, Wu L, Yilmaz P (2012) Meeting Report: Hackathon-Workshop on Darwin Core and MIxS Standards Alignment (February 2012). Standards in genomic sciences 7:166-170. [23] ten Hoopen P, Walls RL, Cannon EKS, Cochrane G, Cole JR, Johnston A, Karsch-Mizrachi I, Yilmaz P (2016) Plant specimen contextual data consensus. GigaScience 5(1):1-4. https://doi.org/10.1093/gigascience/giw002 [24] Yilmaz P (2017) Automatically exported from code.google.com/p/mixs-as-rdf. [25] Walls RL, Athreya B, Cooper L, Elser J, Gandolfo MA, Jaiswal P, Mungall CJ, Preece J, Rensing S, Smith B, Stevenson DW (2012) Ontologies as integrative tools for plant science. American Journal of Botany 99:1263-1275. [26] Guralnick R, Walls R, Jetz W (2017) Humboldt Core - toward a standardized capture of biological inventories for biodiversity
102
[27]
[28]
[29] [30]
[31]
[32]
[33] [34]
[35] [36]
[37]
R. L. Walls et al. / Biodiversity Ontologies
monitoring, modeling and assessment. Ecography 41(5):713725. doi: 10.1111/ecog.02942 The Gene Ontology Consortium (2009) The Gene Ontology in 2010: Extensions and refinements. Nucleic Acids Research 38:D331-D335. Natale DA, Arighi CN, Blake JA, Bult CJ, Christie KR, Cowart J, D’Eustachio P, Diehl AD, Drabkin HJ, Helfer O, Huang H, Masci AM, Ren J, Roberts NV, Ross K, Ruttenberg A, Shamovsky V, Smith B, Yerramalla MS, Zhang J, AlJanahi A, C.elen I, Gan C, Lv M, Schuster-Lezell E, Wu CH (2014) Protein Ontology: A controlled structured network of protein entities. Nucleic Acids Research 42:D415-421. Gene Ontology (2012) GO Annotation File Format Guide http: //www.geneontology.org/page/file-format-guide Madin JS, Bowers S, Schildhauer MP, Jones MB (2008) Advancing ecological research with ontologies. Trends in Ecology & Evolution 23:159-168. Madin JS, Bowers S, Schildhauer M, Krivov S, Pennington D, Villa F (2007) An ontology for describing and synthesizing ecological observation data. Ecological Informatics 2:279-296. Haller A, Janowicz K, Cox S, Le Phuoc D, Taylor K, Lefrançois M (2017) Semantic Sensor Network Ontology. https://www. w3.org/TR/vocab-ssn/ Lebo T, Sahoo S, McGuinness D (2015) PROV-O: The PROV Ontology https://www.w3.org/TR/prov-o/ Belhajjame K, Zhao J, Garijo D, Gamble M, Hettne K, Palma R, Mina E, Corcho O, Gómez-Pérez JM, Bechhofer S, Klyne G, Goble C (2015) Using a suite of ontologies for preserving workflow-centric research objects. Web Semantics: Science, Services and Agents on the World Wide Web 32:16-42. Research Object (2017) http://www.researchobject.org/ Brooks TM, Butchart SHM, Cox NA, Heath M, Hilton-Taylor C, Hoffmann M, Kingston N, Rodríguez JP, Stuart SN, Smart J (2015) Harnessing biodiversity and conservation knowledge products to track the Aichi Targets and Sustainable Development Goals. Biodiversity 16:157-174. Kissling WD, Ahumada JA, Bowser A, Fernandez M, Fernández N, García EA, Guralnick RP, Isaac NJB, Kelling S, Los W, et al. (2017) Building essential biodiversity variables (EBVs) of
R. L. Walls et al. / Biodiversity Ontologies
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
103
species distribution and abundance at a global scale. Biological Reviews 93(1):600-625. Kissling WD, Hardisty A, García EA, Santamaria M, Leo FD, Pesole G, Freyhof J, Manset D, Wissel S, Konijn J, Los W (2015) Towards global interoperability for supporting biodiversity research on essential biodiversity variables (EBVs). Biodiversity 16:99-107. Gkoutos GV, Green ECJ, Mallon AM, Hancock JM, Davidson D (2005) Using ontologies to describe mouse phenotypes. Genome Biology 6:R8. Groza T, Köhler S, Moldenhauer D, Vasilevsky N, Baynam G, Zemojtel T, Schriml LM, Kibbe WA, Schofield PN, Beck T, Vasant D, Brookes AJ, Zankl A, Washington NL, Mungall CJ, Lewis SE, Haendel MA, Parkinson H, Robinson PN (2015) The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease. American Journal of Human Genetics 97:111124. Jaiswal P (2011) Gramene Database: A Hub for Comparative Plant Genomics. In: Pereira A [Ed.] Plant Reverse Genetics. Humana Press. Totowa, New Jersey USA. p. 247-275. Garnier E, Stahl U, Laporte MA, Kattge J, Mougenot I, Kühn I, Laporte B, Amiaud B, Ahrestani FS, Bönisch G, et al. (2017) Towards a thesaurus of plant characteristics: An ecological contribution. J Ecol 105:298-309. https://doi.org/10.1111/13652745.12698 Laporte M-A, Mougenot I, Garnier E, Stahl U, Maicher L, Kattge J (2014) A Semantic Web Faceted Search System for Facilitating Building of Biodiversity and Ecosystems Services. In: Galhardas H, Rahm E [Eds.] Data Integration in the Life Sciences. Springer International Publishing. Cham. p. 50-57. Oellrich A, Walls RL, Cannon EK, Cannon SB, Cooper L, Gardiner J, Gkoutos GV, Harper L, He M, Hoehndorf R, Jaiswal P, Kalberer SR, Lloyd JP, Meinke D, Menda N, Moore L, Nelson RT, Pujar A, Lawrence CJ, Huala E (2015) An ontology approach to comparative phenomics in plants. Plant Methods 11:10. doi:10.1186/s13007-015-0053-y Parr C, Schulz K, Hammock J, Wilson N, Leary P, Rice J, Corrigan RJ (2015) TraitBank: Practical semantics for organism attribute data. Semantic Web Journal 7(6):577-588.
104
R. L. Walls et al. / Biodiversity Ontologies
[46] Stucky BJ, Guralnick RP, Deck J, Denny EG, Bolmgren K, Walls RL (2018) The Plant Phenology Ontology: A new informatics resource for large-scale integration of plant phenology data. Frontiers in Plant Science 9:157. doi:10.3389/fpls.2018.00517 [47] Thomas KA, Denny EG, Miller-Rushing AJ, Crimmins TM, Weltzin JF (2010) The National Phenology Monitoring System v0.1. [48] Kao RH, Gibson CM, Gallery RE, Meier CL, Barnett DT, Docherty KM, Blevins KK, Travers PD, Azuaje E, Springer YP, Thibault KM, McKenzie VJ, Keller M, Alves LF, Hinckley E-LS, Parnell J, Schimel D (2012) NEON terrestrial field observations: Designing continental-scale, standardized sampling. Ecosphere 3:art115. [49] PEP725 (2017) http://www.pep725.eu/ [50] Suarez AV, Tsutsui ND (2004) The Value of Museum Collections for Research and Society. BioScience 54:66-74. [51] Guralnick RP, Zermoglio PF, Wieczorek J, LaFrance R, Bloom D, Russell L (2016) The importance of digitized biocollections as a source of trait data and a new VertNet resource. Database baw158. https://doi.org/10.1093/database/baw158 [52] GBIF.org | Free and Open Access to Biodiversity Data (2017) https://www.gbif.org/ [53] Constable H, Guralnick R, Wieczorek J, Spencer C, Peterson AT, The VertNet Steering Committee (2010) VertNet: a new model for biodiversity data sharing. PLoS Biology 8(2):e1000309. https://doi.org/10.1371/journal.pbio.1000309 [54] InvertNet (2017) https://invertnet.org/ [55] Vocabulary Maintenance Specification Task Group. 2017. Vocabulary Maintenance Specification. Biodiversity Information Standards (TDWG). http://www.tdwg.org/standards/642 [56] Overton JA, Dietze H, Essaid S, Osumi-Sutherland D, Mungall C (2015) ROBOT: A command-line tool for ontology development. In: Proceedings of the International Conference on Biomedical Ontology. Lisbon, Portugal: CUER Workshop Proceedings. [57] Stucky BJ (2017) stuckyb/ontopilot. GitHub. [58] Deck J (2017) biocodellc/ppo-data-pipeline. GitHub. [59] Melaas EK, Friedl MA, Richardson AD (2016) Multiscale mod-
R. L. Walls et al. / Biodiversity Ontologies
[60] [61]
[62] [63] [64]
105
eling of spring phenology across Deciduous Forests in the Eastern United States. Global Change Biology 22:792-805. iRODS Consortium (2017) https://irods.org/ Jung S, Menda N, Redmond S, Buels RM, Friesen M, Bendana Y, Sanderson LA, Lapp H, Lee T, MacCallum B, Bett KE, Cain S, Clements D, Mueller LA, Main D (2011) The Chado Natural Diversity module: A new generic database schema for largescale phenotyping and genotyping data. Database bar051. doi: 10.1093/database/bar051 FOAF (2015) The FOAF Project http://www.foaf-project. org/ Dublin Core (2017) http://dublincore.org/ Web Annotation Data Model (2017) https://www.w3.org/TR/ annotation-model/
This page intentionally left blank
Application of Semantic Technology in Biodiversity Science Anne E. Thessen (Ed.) ISBN 978-3-89838-733-0 c 2018 AKA Verlag Berlin
107
Chapter 6
The Flora Phenotype Ontology (FLOPO) and the FLOPO Knowledgebase Robert HOEHNDORFa , Claus WEILANDb , Marco SCHMIDTb,c , Quentin GROOMd , George GOSLINEe , Stefan DRESSLERf , and Thomas HAMANNg a King Abdullah University of Science and Technology, Thuwal, Saudi Arabia b Senckenberg Biodiversity and Climate Research Centre (SBiK-F), Frankfurt, Germany c Palmengarten der Stadt Frankfurt am Main, Frankfurt, Germany d Botanic Garden Meise, Meise, Belgium e Royal Botanic Gardens, Kew, London, UK f Senckenberg Research Institute and Museum, Frankfurt, Germany g Naturalis Biodiversity Center, Leiden, The Netherlands Plant morphology has been at the base of botanical systematics for centuries, resulting in a large body of descriptive plant data, including synoptic texts like monographs and floras and in recent times character matrices and trait databases. Systematic analysis of trait data is hampered by the lack of a standardized vocabulary and difficulties in extracting structured information from these knowledge sources. Therefore, the Flora Phenotype Ontology (FLOPO) was built by text mining floras for morphological entities and corresponding attributes (using PO and PATO), automated reasoning and subsequent manual curation. Consequently, FLOPO is comprised of a
108
R. Hoehndorf et al. / Flora Phenotype Ontology
large number of morphological traits and phenotypes. By mobilizing such data from legacy literature, integration with data from other biodiversity data sources and images becomes possible: this includes (1) character matrices used for determination keys or morphology-based phylogenetics as well as (2) morphological traits extracted from other biodiversity literature such as taxonomic revisions and (3) traits extracted from images by pattern recognition. In this way, vast amounts of data are accessible for interlinkage with other domains, and for use in morphological, evolutionary, genetic, and ecological studies including species interactions and macroecology. FLOPO can be applied beyond biodiversity research in fields such as plant breeding, ethnobotany, horticulture, and through the frequent use of floral design, even in studies of arts and crafts. FLOPO can be browsed and downloaded from http://aber-owl.net/ontology/FLOPO. The FLOPO knowledgebase is accessible via http://flopo.senckenberg.de/sparql. 1. Introduction Named after the Roman goddess of flowers, spring and fertility, a “flora” comprises all plant species of a certain region and/or time period and is, at the same time (first letter capitalized), a term for a book (or website) collecting the information on these plants. Although sometimes Floras are just simple lists of taxa, they usually contain identification keys and morphological descriptions of the plants, as well as additional information on their distribution, ecology and their uses. Following the rediscovery of classical Greek and Roman botanical works by Renaissance scientists, the first Floras were written in the early 16th century, the herbal books by Otto Brunfels, William Turner, Leonhart Fuchs and Hieronymus Bock, with a focus on medicinal plants. The first local Floras appeared by the end of the 16th century and the first country flora, Flora Danica [1] , soon after. A significant change was the introduction of Linnéan nomenclature in the mid-18th century and this had consequences for the way Floras were written. During this period, knowledge on the global flora grew with the expansion of European exploration and colonization. Large-scale descriptive works appeared for many areas of the world and by the early 19th century the Flore Francaise by Lamarck and de Candolle set standards for the modern Flora.
R. Hoehndorf et al. / Flora Phenotype Ontology
109
This format is still largely followed, including an introduction to the region and its habitats and an arrangement following hierarchical plant systematics. In most cases starting with higher taxa such as divisions and classes, often skipping orders, but using families and genera and ending up with the species as the basic unit. Keys are often provided at different taxonomic levels. Species accounts usually include synonyms, often also vernacular names. A morphological description is given that can be very detailed or reduced to only few distinctive characters, and finally notes on habitat and other ecological information, geographic distribution as well as medicinal and other uses. A comprehensive overview of the world’s Floras is given by Frodin [2] . Many Floras are available digitally through the Biodiversity Heritage Library (BHL), and an increasing number of Floras are freely available as eBooks (e.g., Flora of West Tropical Africa [3] ) or via websites (e.g., Flora Zambesiaca [4] , Flore d’Afrique Centrale [5] ). For data mobilization from legacy literature, there is a huge corpus of Floras available; however, the names used for taxa have changed over time as plant systematics has progressed. Disambiguation of plant names can, to a certain extent, be addressed by applying known synonymies (e.g., from The Plant List [6] ). However, morphological terms and species concepts have also been refined over time. To avoid naming problems it is preferable to use the most recent Floras. Nevertheless, historical descriptions of plant ecology and habitats have value in understanding environmental change that has few other sources of documentation [7] . Floras contain many different types of data, but because they are used for plant identification and often contain descriptions of novel taxa, they are the most important source of descriptions of plant morphology for large sets of species. The Flora Phenotype Ontology (FLOPO) specifically focuses on morphological traits, also known as entities and qualities, or characters and character states, as a plant morphologist would express it [8] . 2. Methods In a first step we compiled a vocabulary of plant morphological terms in the form of entities, attributes, and attribute values (Fig. 1). We largely relied upon two ontologies, already widely used in biological
110
R. Hoehndorf et al. / Flora Phenotype Ontology
Figure 1. Workflow for the data-driven generation of the Flora Phenotype Ontology and the FLOPO knowledgebase: Text-mining is applied to Floras to extract Entity-Quality (EQ) statements and link them to taxa using standard domain ontologies (PO, PATO). The mobilized data is used to create both the ontology FLOPO and a knowledgebase containing FLOPO-annotated traits of taxa.
R. Hoehndorf et al. / Flora Phenotype Ontology
111
research: the Plant Ontology (PO [9] ) for plant morphological entities and the Phenotype and Trait Ontology (PATO [10] ) for their attributes and their values. We used these ontologies to extract Entity-Quality pairs from a corpus of digitized Floras, encompassing the Flora Malesiana, Flore du Gabon, Flore d’Afrique Centrale, Flore du Congo Belge et du Ruanda-Urundi, Flora Zambesiaca, Flora of Tropical East Africa, Flora of West Tropical Africa, Flora of Tropical Africa, Flora Capensis, and Useful Plants of West Tropical Africa. Most of these Floras were written in English and we directly identified co-occurrences of the label of a class from PO and a label of a class from PATO in the same sentence. In order to apply PO and PATO terms to the Floras available in French language, we used a bilingual botanical glossary from Missouri Botanical Gardens [11] to translate PO and PATO class labels, and identified those in the French text. Entity-Quality pairs of corresponding 20,584 distinct combinations of PO/PATO terms were extracted by text-mining, and we used these to build the Flora Phenotype Ontology (FLOPO); single occurrences of these Entity-Quality pairs with the respective taxon they describe (502,693 records) became the foundation of the FLOPO knowledgebase. In addition to the initial set of trait data from text mining the Floras, we integrated another set of morphological traits from the online database African Plants - a photo guide [12–14] ). For each of the c. 6000 species documented there, a set of easily recognizable morphological traits including habit, leaf, flower, and fruit traits have been encoded using available literature as well as collection material and photo records. These traits have been matched to FLOPO terms and the species-trait records added to the knowledgebase. We use the GBIF API to match taxon names in the Floras to their corresponding GBIF taxon identifiers. The GBIF taxon backbone seemed to be a good choice for us, being widely used in the biodiversity community, covering a wide range of taxa and offering an API that is easy to use (for an overview of taxon backbones, see Rees & Cranston [15] ). Specifically, we used the GBIF API to find the taxa matching the taxon label in the Flora. We filter by matching confidence of 95 to ensure that high-quality matches are found. We use the GBIF usage identifier and its corresponding URI to identify a taxon within our knowledgebase. For example, using the GBIF API, the label “Cyathula cylindrica” maps to the species [16] with the
112
R. Hoehndorf et al. / Flora Phenotype Ontology
label “Cyathula cylindrica Moq.” and the taxonomic rank “species”. We represent the taxonomic rank and label of the taxon in our RDF knowledgebase using the rdfs:label property and the relations and taxa from the Taxonomic Rank Ontology (a part of the Vertebrate Taxonomy Ontology): @ p r e f i x r d f s : . @ p r e f i x g b i f : . @ p r e f i x obo : . g b i f : 5 5 4 8 6 0 0 r d f s : l a b e l " Cyathula c y l i n d r i c a Moq . "@en . g b i f : 5 5 4 8 6 0 0 obo : TAXRANK_1000000 obo : TAXRANK_0000006 .
The property obo:TAXRANK_10000000 is the has_rank relation used to associate an entity (here, a taxon) with its taxonomic rank; obo:TAXRANK_0000006 is the rank “species”. Within our knowledgebase, we only use “species” (obo:TAXRANK_0000006) and “family” (obo:TAXRANK_0000004). Through text mining, we identify Entity-Quality pairs in a set of taxon descriptions and used these to generate the FLOPO. Subsequently, FLOPO has evolved to improve its quality and coverage, and new classes have been created and obsolete classes removed. Not every Entity-Quality pair will directly map to a FLOPO class, either because it has been removed or was not yet required for annotation. Given an Entity-Quality pair with Entity E and Quality Q, we use the Elk OWL reasoner to identify an equivalent class to the complex class description has_part some (E and has_quality some Q), which corresponds to the definition pattern of classes in FLOPO. If an equivalent class is found, we assign this class to the taxon as its trait, otherwise we assign the most specific superclass in FLOPO. We use the has_phenotype relation (RO:0002200) from the OBO Relationship Ontology to associate a taxon with its FLOPO class. For example, to assign the FLOPO class “flower red” (FLOPO_0007599) to Cyathula cylindrica, we create the RDF statement: g b i f : 5 5 4 8 6 0 0 obo : RO_0002200 obo : FLOPO_0007599
Our knowledgebase is generated from two data sources: textmined information from Floras, and manually curated traits from
R. Hoehndorf et al. / Flora Phenotype Ontology
113
Table 1. Mobilized data from text-mining floras and the traits coded for the multi-entry identification key in African Plants - a photo guide.
Triples Taxa Taxon-Trait Pairs FLOPO Classes (traits)
Text-Mined Data From African Plants IdentifiFloras cation Key 3,528,410 960,776 16,913 5,796 (all spp. or below) 388,298 105,326 17,744 88
the African Plants Database. We make the kind of assignment explicit through using the Evidence Code Ontology (ECO) and assign an evidence code to individual RDF statements. Specifically, we assign evidence codes to RDF triples by reifying them and explicitly assigning evidence codes and the source to the RDF statement. We assign the evidence codes “manual assertion” (ECO:0000218) and “traceable author statement” (ECO:0000033) to taxon-trait pairs derived from the African Plants Database, and “automatic assertion” (ECO:0000203) and “non-traceable author statement” (ECO:0000034) for text-mined taxon-trait pairs. In each case, we add the source of the statement as direct reference to the reified taxon-trait assertion. 3. Results We were able to mobilize a large amount of data through text-mining and integration of data from the African plants identification key (Table 1). Due to the different nature of the datasets, the Flora data are much more diverse than the 88 traits chosen for plant identification in African Plants. These data were prepared as outlined in the methods section above and are available for download [17] . Furthermore, we set up a Virtuoso web interface to query the database [18] . To obtain all species with black seeds, e.g., the interface can be queried as follows: PREFIX PREFIX PREFIX PREFIX PREFIX
FLOPO: RO: rdfs : dwc : gbif :
114
R. Hoehndorf et al. / Flora Phenotype Ontology
SELECT ? species_name ? taxon_uri FROM
WHERE { ? taxon_uri RO: 0 0 0 2 2 0 0 FLOPO: 0 0 0 6 6 9 6 ; r d f s : l a b e l ? species_name ; # o n l y s p e c i e s , no h i g h e r t a x a : dwc : taxonRank g b i f : s p e c i e s . }
It is also possible to use the (English) label of the FLOPO class in the query instead by replacing ?taxon_uri RO:0002200 FLOPO:0006696 in the query with the two lines:
? taxon_uri RO: 0 0 0 2 2 0 0 ?x ; ?x r d f s : l a b e l " s e e d b l a c k "@en ; If necessary, this query can also be further refined through fuzzy string matches to enable more sophisticated searches. The result is given as scientific name and as an URI, which is a resolvable URL of the respective taxa in the GBIF taxon backbone [19] and can be obtained in a variety of formats including Excel R spreadsheets, CSV, HTML, JSON and XML. Furthermore, it is possible to address the interface, e.g., via R, making it possible for researchers to easily extract data from our SPARQL endpoint. The following routine creates a comma separated text file using a similar query for red fruits utilizing the R library ‘SPARQL’: l i b r a r y (SPARQL) # SPARQL q u e r y i n g p a c k a g e # Define the f l o p o endpoint e n d p o i n t