Understanding the Semantic Web : bibliographic data and metadata 9780838958070, 9780838958100


247 41 3MB

English Pages [46] Year 2010

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Understanding the Semantic Web : bibliographic data and metadata
 9780838958070, 9780838958100

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

MeeT The NeW! FAce oF ALA TechSource online

Library Technology R

e

p

o

r

t

May/June 2010 vol. 46 / no. 4 ISSN 0024-2586

s

Expert Guides to Library Systems and Services

www.alatechsource.org

a publishing unit of the American Library Association

• Access a growing archive of more than 8 years of Library Technology Reports (LTR) and Smart Libraries Newsletter (SLN) • Read full issues online (LTR only) or as downloadable PDFs • Learn from industry-leading practitioners • Share unlimited simultaneous access across your institution • Personalize with RSS alerts, saved items, and emailed favorites • Perform full-text searches ISBN 978-0-8389-5807-0

free samples @ alatechsource.metapress.com

library TechNOlOgy

9 780838 958070 UNcoveReD,

exPLoReD, oNLiNe

subscribe to Techsource Online today! alatechsource.metapress.com

Your support helps fund advocacy, awareness, and accreditation programs for library professionals worldwide.

ISBN 978-0-8389-5810-0

9 780838 958100

Library Technology R

e

p

o

r

t

s

Expert Guides to Library Systems and Services

Object Reuse and Exchange (OAI-ORE) Michael Witt

www.alatechsource.org Copyright © 2010 American Library Association All Rights Reserved.

Library Technology R

e

p

o

rt s

American Library Association 50 East Huron St. Chicago, IL 60611-2795 USA www.alatechsource.org 800-545-2433, ext. 4299 312-944-6780 312-280-5275 (fax)

Advertising Representative Brian Searles, Ad Sales Manager ALA Publishing Dept. [email protected] 312-280-5282 1-800-545-2433, ext. 5282

ALA TechSource Editor Dan Freeman [email protected] 312-280-5413

Copy Editor Judith Lauber

Administrative Assistant Judy Foley [email protected] 800-545-2433, ext. 4272 312-280-5275 (fax)

About the Author Michael Witt is the Inter­ disciplinary Research Librarian and an assistant professor of library science at Purdue University in West Lafayette, Indiana. He is also a senior researcher at the Distributed Data Curation Center (http:// d2c2.lib.purdue.edu). Michael has spoken about new roles for librarians in curating research datasets and applying library science principles to e-science in workshops and presentations at national conferences such as the Chronicle of Higher Education’s Technology Forum, Educause, Open Repositories, the Special Libraries Association, and the Coalition for Networked Information. In 2011, he will spend five months at the Bibliotheca Alexandrina as a Fulbright Scholar in Egypt. His research has been published in journals such as the International Journal of Digital Curation, Library Trends, College & Undergraduate Libraries, and the International Journal on Digital Libraries. He is a graduate of the School of Library and Information Science at Indiana University-Indianapolis, and he was named an Emerging Leader by the American Library Association in 2008.

Production and Design ALA Production Services: Troy D. Linker and Tim Clifford

Library Technology Reports (ISSN 0024-2586) is published eight times a year (January, March, April, June, July, September, October, and December) by American Library Association, 50 E. Huron St., Chicago, IL 60611. It is managed by ALA TechSource, a unit of the publishing department of ALA. Periodical postage paid at Chicago, Illinois, and at additional mailing offices. POSTMASTER: Send address changes to Library Technology Reports, 50 E. Huron St., Chicago, IL 60611. Trademarked names appear in the text of this journal. Rather than identify or insert a trademark symbol at the appearance of each name, the authors and the American Library Association state that the names are used for editorial purposes exclusively, to the ultimate benefit of the owners of the trademarks. There is absolutely no intention of infringement on the rights of the trademark owners.

www.alatechsource.org www.alatechsource.org Copyright ©2010 American Library Association All Rights Reserved.

Subscriptions For more information about subscriptions and individual issues for purchase, call the ALA Customer Service Center at 1-800-545-2433 and press 5 for assistance, or visit www.alatechsource.org.

Table of Contents Chapter 1—An Introduction to ORE

5

Basic Concepts Aggregations Are Collections History of the Open Archives Initiative What is the OAI-PMH? Object Reuse and Exchange Data Model Examples of Aggregations and Applications of ORE Notes

6 6 7 8 9 10 11

Chapter 2—Exploring Object Reuse and Exchange Building Blocks: RDF Triples National Digital Newspaper Program Aggregations and Resource Maps Adding Metadata Aggregated Resources Authoritative and Non-Authoritative Resource Maps Nested Aggregations Putting It All Together Notes

12 12 13 14 14 15 17 17 17 17

Chapter 3—Serializing and Exposing Resource Maps Serializing a Resource Map as RDF/XML Aggregations and Resource Maps Adding Metadata Aggregated Resources Nested Aggregations Other Serializations Exposing Resource Maps HTTP 303 Hash URIs RDFa Discovering Resource Maps Resource Embedding Batch Discovery Exercise Notes

18 18 18 20 20 21 22 23 23 23 24 24 24 24 25 25

Chapter 4—Implementations of ORE

26

Chapter 5—Resources

26 27 28 28 29 30 31 31 32

35

Selected ORE Implementations, Demonstrations, and Tools Official ORE Documentation Selected References OAI-ORE Community

35 37 38 38

Appendix

39

Resource Map for a Batch Aggregation Serialized as RDF/XML Resource Map for a Title Aggregation Serialized as RDF/XML

39 40

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

Vireo: An ORE Implementation for DSpace Foresite Microsoft External Research Zentity Article Authoring Add-In for Microsoft Word U. S. National Virtual Observatory LORE: A Compound Object Authoring and Publishing Tool for Literary Scholars Notes Interview with Peter Patrick Hochstenbach, Ghent University

3

Abstract

Library Technology Reports  www.alatechsource.org  May/June 2010

The Open Archives Initiative Object Reuse and Exchange (ORE) specification defines a set of new standards for the description and exchange of aggregations of Web resources. This presents an exciting opportunity for us to revisit how digital libraries are provisioned. ORE and its concept of aggregation—that a set of digital objects of different types and from different locations on the Web can be described and exposed together as a single, compound entity—may present the next major disruptive technology for librarians who develop and manage col­ lections of digital information. Currently, the management and presentation of dig­ ital library collections revolve mostly around the digital library systems that house them. A librarian decides what digital resources go together and then works within the capabilities of the system to present the resources in an appropriate and orderly context. The result is typically a series of webpages that human beings need to navigate in order to find and click on links to resources that meet their information needs. While the system may expose its metadata for harvesting or its index for federated searching, the digital resources themselves are tucked deeply inside proprietary silos. ORE presents the possibility of breaking down these silos by exposing the semantics of these resources and providing hooks to retrieve them without the need for a human being to read a webpage and click on a link. Liberating digital library content from these silos for reuse and exchange may very well explode the

4

construct of the “collection” as we know it today because it will no longer be the exclusive domain of librarians to aggregate digital library resources and dic­ tate the context of their presentation for use. Human beings and machines will be able to assemble their own “collections.” The goal of this issue of Library Technology Reports is to present a tutorial on ORE to make it more approachable and understandable to information professionals who are not computer scientists or pro­ grammers. The report begins by presenting the general concepts of ORE and then works backwards to explain and fill in some of the supporting technical details. It introduces the basic concepts of ORE and its founda­ tion and follows an example of implementation to illus­ trate the graphing of the ORE data model, exploring Aggregations and Aggregated Resources and the serial­ ization and provisioning of Resource Maps. A series of ORE tools and implementations are presented to relate the specification to real-world application in libraries. While the Semantic Web and ORE represent poten­ tially disruptive technologies, the need for librarians to help make sense of interoperable digital information by provisioning resources with care and quality metadata and by connecting users to resources—and resources to resources—is greater than ever. In order to capitalize on these technologies, librarians must first understand them and be able to relate them to the professional prac­ tice of librarianship.

Acknowledgements Thanks to the members of the ORE Executive, Technical, and Advisory Committees for their efforts in develop­ ing the ORE specification, especially those who were involved in its documentation. This issue of Library Technology Reports borrows liberally from the ORE Primer, Abstract Data Model, and other documents that were edited by Pete Johnston, Michael Nelson, Robert Sanderson, and Simeon Warner from the ORE Technical Committee and Carl Lagoze and Herbert Van de Sompel from the ORE Executive Committee. I would like to thank the ORE implementers who contributed their time and information about their projects for inclusion in this report: Tim DiLauro from

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Johns Hopkins University, Patrick Hochstenbach from Universiteit Gent, Mark McFarland at the Texas Digital Library, Robert Sanderson at the Los Alamos National Labs, and Lee Dirks and Alex Wade from Microsoft External Research. Special thanks to Ed Summers and the Library of Congress for giving me permission to reproduce their work and use it as an example in chap­ ters 2 and 3. This report would not have been possible without the encouragement of the Purdue University Libraries, in particular from D. Scott Brandt and Dr. James L. Mullins. The author’s photograph was contributed by C. T. Pham. I am grateful for the support of my wife and family.

Chapter X 1

An Introduction to ORE

Abstract This chapter of Object Reuse and Exchange (OAI-ORE) sets the stage for the report, providing introductions to the basic specification and basic concepts and explaining what readers will learn in subsequent chapters.

T

• The architecture of the World Wide Web • Semantic Web concepts such as RDF and RDFS • Cool URIs and Linked Data1 If you hadn’t already been working with these tech­ nologies or don’t come from a technical background, that’s a tall order! ORE can be difficult to approach because it is typically explained in terms of the various technologies that make up its foundation. The foundation is important, but someone new to the topic may quickly lose sight of the forest through the trees. The goal of this issue of Library Technology Reports is to present a tutorial for librarians on ORE to make it more accessible and understandable. Our approach is to begin by presenting the general concepts of ORE and then work backwards to explain and fill in some of the Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

he Open Archives Initiative Object Reuse and Exchange (OAI-ORE, or simply ORE) specification defines a set of new standards for the description and exchange of aggregations of Web resources that pres­ ents an exciting opportunity for us to revisit how digital libraries are provisioned. ORE and its concept of aggrega­ tion—that a set of digital objects of different types and from different locations on the Web can be described and exposed together as a new, compound entity—may present the next major disruptive technology for librarians who develop and manage collections of digital information. Currently, the management and presentation of digi­ tal library collections revolve mostly around the digital library systems that house them. A librarian decides what digital resources go together and then works within the capabilities of the system to present the resources in an appropriate and orderly context. The result is typically a series of webpages that human beings need to navigate to find and click on the links to the resources that meet their information needs. While the system may expose its metadata for harvesting or its index for federated search­ ing, the digital resources themselves are tucked deeply inside proprietary silos. ORE presents the possibility of breaking down these silos by exposing the semantics of these resources and providing hooks to retrieve them without the need for a human being to read a webpage and click on a link.

Liberating digital library content from these silos for reuse and exchange may very well explode the construct of the “collection” as we know it today because it will no longer be the exclusive domain of librarians to bring together digital library resources and dictate the context of their presentation for use. Human beings and machines will be able to assemble their own “collections.” If you’ve looked for information on the Web or attended a presentation about the ORE standard, it is likely that within the first five minutes, you were con­ fronted with a large, complicated diagram with circles and lines and references to a half dozen other, different tech­ nologies. If you weren’t familiar with these other underly­ ing technologies and tried to learn about them, you were probably confronted with even more diagrams and circles and lines. It can be overwhelming. In the beginning of the ORE specification, it is sug­ gested that the reader become familiar with:

5

supporting technical details. Our goal is not to present a comprehensive account of ORE but instead to make it approachable by people who are not programmers or com­ puter scientists. If you’re interested in developing solu­ tions with ORE for your library, this report will be a good starting point before you dig deeper into the references in chapter 5.

Library Technology Reports  www.alatechsource.org  May/June 2010

http://www.openarchives.org/ore is the official website for ORE. It maintains the specification and related documentation such as user guides, a primer, news releases, and other community resources.

6

We’ll begin in this first chapter by explaining the rationale for ORE and describing the basic components of the ORE abstract data model: Aggregations, Aggregated Resources, Resource Maps, and Proxies. Chapter 2 starts with an introduction to RDF (explained later in this chapter) before walking through a practical example— the National Digital Newspaper Program at the Library of Congress—and a series of simple graphs to illustrate a Resource Map and Aggregation, metadata, Aggregated Resources, and nested Aggregations. In chapter 3, we will explore Resource Map serialization by looking at exam­ ples from the same project in RDF/XML. A selection of current ORE implementations and tools that are relevant to libraries will be presented in chapter 4, including pro­ files of projects at the Los Alamos National Labs, Ghent University, Microsoft, Johns Hopkins University, and the Texas Digital Library. Chapter 5 provides a list of refer­ ences and further reading. While the Semantic Web and ORE represent poten­ tially disruptive technologies, the need for librarians to help make sense of interoperable digital information by provisioning resources with care and quality metadata and by connecting users to resources—and resources to resources—is greater than ever. In order to capitalize on these technologies, we must first understand them and be able to relate them to our professional practice of librarianship.

Basic Concepts Speaking in generic terms, an aggregation is simply a group or collection of things. For example, you may aggre­ gate food to prepare a meal. You can begin with recipes that include lists of ingredients and descriptions of how to prepare the dishes you’ve chosen to make. Some of the ingredients may come from different places. You probably Object Reuse and Exchange (OAI-ORE)  Michael Witt

have some of them locally in your fridge or cabinet, but you may need to fetch some of them from various remote locations. For example, you may pick up a loaf of bread at the bakery or a bottle of Merlot from your local wine shop. You may even be interested in a particular instance of wine, perhaps from a specific year, that has been recom­ mended to you by a friend. Everything needed for your meal has been represented all together above as an aggregation, but you can also view the dishes and their recipes and ingredients as their own aggregations. Aggregations can include other aggregations. A salad may be an aggregation that includes different kinds of lettuce, tomatoes, and salad dressing. If you look at the label on a bottle of dressing, you’ll see a list of the ingre­ dients in it: another aggregation! And so on. Once you’ve retrieved all of the items from your shopping list, the end result is that you have everything you need assembled in your kitchen to prepare the dishes and serve the meal. Aggregations Are Collections This concept of aggregation is not new to librarians, who have been aggregating content into library collections for centuries. Some librarians are bibliographers who create lists of information sources on various topics of interest. Some of the sources listed in a bibliography may be in the library’s local holdings, but some of them may be located elsewhere—perhaps at another library or online on a web­ site. In the traditional analog practice of librarianship, col­ lection management included purchasing books and sub­ scribing to print journals, cataloging them, and arranging them on shelves for patrons to find and use. With the shift from print to digital technology, many of the same prin­ ciples of collection management are now employed in the aggregation of electronic resources such as databases and e-journals. Libraries are also involved in collecting borndigital content on platforms such as institutional reposito­ ries, and many librarians are digitizing special collections and presenting and managing digital libraries. In all of these activities, librarians define the informa­ tion that constitutes a collection. This is typically guided by collection development policies that are informed by the mission of the library and the information needs of its patrons. The boundary of such an aggregation is usually established by a librarian as well, in a library’s catalog, for example. In other words, what separates a book that belongs to a collection from one that does not? In most cases, the immediate user of a library col­ lection is assumed to be a person, and librarians have designed their interfaces for people to use. This makes perfect sense in the analog world, where print collec­ tions are classified and physically arranged in a building with signs to direct patrons in navigating and using a collection. But in the case of digital libraries, some of a library’s collections may also be in electronic formats that

History of the Open Archives Initiative In 1999, Paul Ginsparg, Rick Luce, and Herbert Van de Sompel issued a Call For Participation to bring together developers and managers of e-print repositories to explore possible collaborations. The resulting Santa Fe Convention begat the Open Archives Initiative (OAI), whose goal was stated as being: “to transform scholarly communication by providing a technical and organizational framework to facilitate interoperability among repositories.”2 Under the leadership of Carl Lagoze from Cornell University and Herbert Van de Sompel from Los Alamos National Labs, the OAI collaboratively developed the OAI-PMH and ORE specifications and grew to include a diverse community of scientists, software developers, repository managers, publishers, and librarians who shared a common interest in facilitating scholarly communication. The current mission statement of the OAI says that it “develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. The Open Archives

preparing and assigning metadata to describe digital objects and in presenting and managing objects in digital libraries. There is no question that this process adds con­ siderable value to the digital objects. Unfortunately, most of this value is lost when the user is a computer program and not a human being because a computer program can­ not identify and understand how to use the object or any of the valuable information added to it by the librarian. A computer program doesn’t know it is accessing a digitized book when the book is being represented in HTML; like­ wise, it has no way of figuring out the title of the book, its author, and other information about the book that may be obvious to a person who can recognize and read the book’s metadata. The URL (or more generally, the Uniform Resource Identifier or URI) to a Web resource in a digital library typically does not link directly to the digital object itself but to a representation of the object. This representation is usually a “splash page” that is presented for a person, not a computer program, to read and comprehend. To illustrate the point, look at the digital object being repre­ sented in figure 1, which is a conference poster that has been deposited into a library’s institutional repository. It is obvious to you, a person, that clicking the Download button will download a copy of the conference poster. But suppose someone wanted to write a program to download all of the conference posters from the col­ lection. The program could retrieve a list of the URIs of every object in the Library Research Publications col­ lection (this piece of metadata could be harvested eas­ ily using the OAI-PMH), but the program would probably Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

are available over a network. The users of digital libraries can be human beings or computer programs. The problem is that most digital libraries have been provisioned for people, not computer programs, to use. When you boil it down, the primary point of access for a digital library is usually a webpage that presents informa­ tion along with links to other webpages. The language used to create webpages, Hypertext Markup Language (HTML), was developed for exactly that simple purpose: to mark up and present electronic information to human beings. It is obvious to a person who is accessing a webpage in a digi­ tal library that they are looking, for example, at a page in a book that has been digitized. Visual cues such as page numbers at the bottom of the screen or breadcrumb navi­ gation through a table of contents conveys meaning: the semantics of the object being viewed. It doesn’t matter if the page numbers are at the bottom left or bottom right of the page, people understand the construct of a “book” and can immediately recognize what it is that they are viewing and how to navigate and use it. People also understand the relationships implied in a book such as those between pages, chapters, works cited from other books, the title page, and the index at the back of the book. Computer programs cannot understand the seman­ tics of such Web resources unless the resources are exposed and expressed in a way that can be identified and understood by a machine. This is the goal of Object Reuse and Exchange: to provide a standard for identifying aggre­ gations of Web resources and describing the constituents or the boundary of an aggregation. Librarians invest a great deal of time and care in

Initiative has its roots in an effort to enhance access to e-print archives as a means of increasing the availability of scholarly communication. Continued support of this work remains a cornerstone of the Open Archives program. The fundamental technological framework and standards that are developing to support this work are, however, independent of the both the type of content offered and the economic mechanisms surrounding that content, and promise to have much broader relevance in opening up access to a range of digital materials. As a result, the Open Archives Initiative is currently an organization and an effort explicitly in transition, and is committed to exploring and enabling this new and broader range of applications. As we gain greater knowledge of the scope of applicability of the underlying technology and standards being developed, and begin to understand the structure and culture of the various adopter communities, we expect that we will have to make continued evolutionary changes to both the mission and organization of the Open Archives Initiative.”3

7

What Is the OAI-PMH? The Open Archives Initiative Protocol for Metadata Harvesting, more fondly known as the OAI-PMH, defines a protocol for exposing and harvesting metadata records. OAI-PMH data providers expose their metadata to be harvested; service providers (also known as harvesters) query data providers and selectively harvest metadata records from them.4 Most data providers are archives of scholarly resources, such as institutional repositories, publishers, and digital libraries. A common application of the OAI-PMH is to harvest and index large quantities of metadata for the purpose of providing a portal to search across collections that are distributed in multiple remote data providers. The OAI-PMH protocol is defined as a standard Web service. The harvester sends a request to the data provider using HTTP, in much the same way a Web browser would request a webpage from a web server. The data provider then responds with its answer encoded in Extensible Markup Language (XML). At a minimum, unqualified Dublin Core metadata records are exchanged, although other additional formats can be provided.5 Unqualified Dublin Core provides a “common ground” for the purpose of basic metadata interoperability, although its generic nature sometimes limits its use in specialized applications. Requests can include one of six different OAI-PMH verbs: • Identify • ListSets Library Technology Reports  www.alatechsource.org  May/June 2010

• ListMetadataFormats

8

• ListIdentifiers • ListRecords • GetRecord The response to the Identify verb is simply the name of the data provider. In most implementations of OAI-PMH, a set corresponds to a collection; from these, the ListSets verb returns a list and descriptions of collections hosted by the repository. ListMetadataFormats returns the metadata formats available for the malfunction when it tried to download the first object because it would receive this splash page instead of a PDF or other file that constituted the actual poster. The pro­ gram has no way of knowing that there is an additional step (to click the Download button) to download the file. It also can’t make much sense of the rest of the informa­ tion being presented, such as the title of the poster, its authors, the document type (which identifies it as being a poster), the abstract, and the link to a supplementary report that provides context for the poster. You could make an argument that, with some Object Reuse and Exchange (OAI-ORE)  Michael Witt

object that has been requested. The oai_dc XML schema is most common. Each record has a unique identifier, and a list of these is returned by the ListIdentifiers verb. This is often (but not always) the URL of a representation (e.g., “splash page”) of the digital object. The ListRecords verb returns more information about the records than simply their identifiers and supports parameters to limit the results. The most recent version of OAI-PMH, version 2, supports the use of resumption tokens to provide better flow control and avoid over-saturating data providers by requesting too much metadata at one time. Finally, the GetRecord verb returns an entire metadata record for an object from the data provider.6 An excellent tutorial describing the OAI-PMH is hosted by the OAI Forum,7 and a detailed transactional approach to learning the protocol from the perspective of coding a harvester in the Perl scripting language can be found in Building OAI-PMH Harvesters With Net::OAI::Harvester.8 The largest OAI-PMH service provider, OAIster, currently contains over 23,000,000 harvested metadata records from over 1,100 data providers.9 These records can be searched and accessed through OCLC’s free WorldCat service.10 How is the OAI-PMH different from ORE? Generally speaking, the focus of the OAI-PMH is on exchanging the metadata that describe digital objects, whereas the focus for ORE is on exchanging and using the digital objects themselves. ORE allows you to harvest objects and not just their metadata. Beyond harvesting, ORE enables a many-to-many web of relationships among objects to be discovered, linked, and utilized. Objects described and exposed using ORE are useful outside of an ORE context to the larger Semantic Web, for example, as Linked Data.11 By comparison, the OAI-PMH is limited to acting in a clientserver manner that requires both the service and data providers to “speak” the same specialized protocol: the OAI-PMH. To be fair, the OAI-PMH and ORE were created for different purposes using different paradigms, so they cannot (and should not) be compared as apples to apples. In at least one case, the Texas Digital Library uses the OAI-PMH and ORE together in a complementary fashion. knowledge of this specific institutional repository and collection, you could write a program that is aware of the link behind the Download button and could accom­ plish this task. You might even be able to reassemble some of the structured metadata by indexing the page or applying some other heuristics, like those that Google uses for ranking relevant search results. But would this program work with a different digital library that pres­ ents different representations of its objects? Chances are it wouldn’t work with any precision because the splash pages that it would encounter would be constructed

differently. For example, the Download button might be located somewhere else on the page, or instead of a Download button, the title of the object might be a link that the user is expected to click to download the object. There are some other important questions that could be asked. Could such a program be able to differentiate between conference post­ ers and other types of objects in the collection? What if you wanted your program to download and assemble all of the posters or their supplementary files from a particular conference and those files were archived across multiple institutional repositories? What if you wanted to move a set of objects from a digital library to a preservation reposi­ tory or another digital library platform without losing their semantics?

Figure 1 A typical HTML “splash page” that represents a digital object in a digital library

Object Reuse and Exchange Data Model

The ORE e-mail discussion list is maintained as a Google Group. You can search and browse the mailing list archives and subscribe at http:// groups.google.com/group/oai-ore. Unlike a Web resource, an Aggregation is a concep­ tual construct. Even though it has a URI, it is not tangible; you can’t download it. An Aggregation is expressed by a Resource Map, or ReM for short. A Resource Map provides details about an Aggregation in a machine-readable for­ mat.13 In our first example of aggregating food to prepare a meal, you could think of the shopping list as being like a Resource Map. The Resource Map is something tangible:

Institutional repositories are beginning to include support for ORE. The Texas Digital Library has developed an ORE implementation for DSpace. The oreProvider project has produced an ORE add-on for Fedora, and Microsoft supports ORE natively as a part of its Zentity repository platform. See chapter 5 for a list of notable ORE implementations and tools. A Resource Map also has its own URI that resolves to one or more serializations. You can think of serialization as being a way to write something down. It lets you take an object or group of objects, put them on a disk or send them through a wire or wireless transport mechanism, then later, perhaps on another computer, reverse the process: resurrect the original object(s). The basic mechanisms are to flatten object(s) into a one-dimensional stream of bits, and to turn that stream of bits back into the original object(s). Like the Transporter on Star Trek, it’s all about Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

ORE was developed to address these kinds of issues for objects on the Web. It introduces the concept of an Aggregation of Web resources. We’ll capitalize Aggregation when we’re speaking about ORE Aggregations to differ­ entiate them from the generic use of the word aggrega­ tion. For example, we could imagine an Aggregation that contains the PDF of a conference poster, its descriptive metadata, and a Microsoft Word document that is the supplementary report. These things that are being aggre­ gated are called, simply enough, Aggregated Resources. An Aggregation has a URI that is used to identify it, just like any other resource on the Web.12

you can download it, and it will reference the Aggregation and list its Aggregated Resources. A Resource Map can also express relationships and properties pertaining to its Aggregated Resources as well as metadata about Resource Map itself, such as who created it.14

9

taking something complicated and turning it into a flat sequence of 1s and 0s, then taking that sequence of 1s and 0s (possibly at another place, possibly at another time) and reconstructing the original complicated “something.”15

The three formats for serialization explained in the ORE specification are RDF/XML, RDFa, and Atom XML, although there are others.16 We’ll learn more about the serialization of Resource Maps in chapter 3. Lastly, an Aggregation can include a Proxy, which is an Aggregated Resource in the context of a specific Aggregation.17 For example, it is not uncommon for

journal articles to be republished as book chapters. For many situations, the context from which the object being included in your Aggregation may not matter (i.e., from the article in the journal or from the chapter of the book). For some applications, such as providing cita­ tions for Web resources, it may be critical. When the context does matter, you have the option of designating an Aggregated Resource as being a Proxy so that you can make assertions about it in the context of a specific Aggregation. To summarize, the ORE Data Model is made up of four entities: Aggregations, Resource Maps, Aggregated Resources, and Proxies. An Aggregation contains a

Examples of Aggregations and Applications of ORE Examples of Aggregations • A simple unordered set, or bag, of Resources, such as a collection of favorite images from various web sites.

Library Technology Reports  www.alatechsource.org  May/June 2010

• A multi-page, HTML document where the pages are linked together by hyperlinks that provide “previous page” and “next page” access.

10

• Published scientific results such as those envisioned by Clifford Lynch that, in addition to the features of the scholarly publication described above, incorporate data plus the tools to visualize and analyze that data. Examples of Applications

• Information available from “social networking” sites, which contain content and related social activity around that content. An example is Flickr, where each participant has an entry page providing access to images in multiple sizes and resolutions that are organized in sets and collections. All of these entities are separate Resources. These are then linked to additional Resources that are comments and annotations about the images.

• Crawler-based search engines could use such descriptions to index information and provide search results sets at the granularity of the aggregations rather than in addition to their individual parts.

• A scholarly publication stored in an ePrint repository such as arXiv or in a DSpace, ePrints, or Fedora repository. Such a publication may appear on the Web as multiple Resources, each with an individual URI. The set of Resources typically consists of a human readable “splash page”, that links to the body of the publication in multiple formats such as LaTeX, PDF, and HTML. In addition, the publication may have citation links to other publications, each existing as one or more Resources.

• Other automated agents such as preservation systems could use these descriptions as guides to understand a “whole document” and determine the best preservation strategy for the document Compound Object.

• An overlay journal issue that aggregates multiple scholarly publications as described above, each located in their origin repository, into an issue. Issues may be recursively aggregated themselves into volumes, and then into the journal itself. • A semantically-linked group of cellular images—each available as a Resource resident in repositories from research laboratories, museums, libraries, and the like— in the manner implemented in the ImageWeb Project.

Object Reuse and Exchange (OAI-ORE)  Michael Witt

• Browsers could leverage them to provide users with navigation aids for the aggregated resources, in the same manner that machine-readable site maps provide navigation clues for crawlers.

• Systems that mine and analyze networked information for citation analysis/bibliometrics could achieve better accuracy with the knowledge of aggregation structure contained in these descriptions. • Institutional repository applications could use them as the basis of interoperability for exchange and service interaction with other institutional repositories. • These machine-readable descriptions could provide the foundation for advanced scholarly communication systems that allow the flexible reuse and refactoring of rich scholarly artifacts and their components Value Chains. —Excerpt from the ORE User Guide: Primer, http://www.openarchives .org/ore/1.0/primer.

Resource Map plus one or more Aggregated Resources, which can also be Proxies. In the next chapter we will introduce the Resource Description Framework (RDF), which forms the foundation of the Semantic Web and gives us a language to use for talking about Aggregations in greater detail. We’ll also visually explore Aggregations, Aggregated Resources, and Resource Maps by using graphs to illustrate how they relate to one another.

Notes 1. Carl Lagoze et al., “ORE User Guide: Primer,” Oct. 17, 2008, Open Archives Initiative Object Reuse and Exchange website, http://www.openarchives.org/ ore/1.0/primer (accessed March 6, 2010). 2. Herbert Van de Sompel and Carl Lagoze, “The Santa Fe Convention of the Open Archives Initiative” D-Lib Magazine 6, no. 2 (Feb. 2000), http://www.dlib.org/ dlib/february00/vandesompel-oai/02vandesompel-oai .html (accessed March 6, 2010). 3. Open Archives Initiative, “Mission Statement,” About OAI page, http://www.openarchives.org/OAI/OAI -organization.php (accessed March 6, 2010). 4. Open Archives Initiative Protocol for Metadata

Harvesting website, http://www.openarchives.org/pmh (accessed March 6, 2010). 5. Ibid. 6. Ibid. 7. Susanne Dobratz, Friederike Schimmelpfennig, and Peter Schirmbacher, “OAI for Beginners: The Open Archives Forum Online Tutorial,” 2002, http://www .oaforum.org/tutorial (accessed March 6, 2010). 8. Ed. Summers, “OAI-Harvester-1.0,” 2004, CPAN Search web­ site, http://search.cpan.org/~esummers/OAI-Harvester -1.0 (accessed March 10, 2010). 9. The OAIster Database,” OCLC website, http://oaister .org (accessed March 10, 2010 10. WorldCat website, http://www.worldcat.org (accessed March 10, 2010). 11. Linked Data website, http://linkeddata.org (accessed March 10, 2010). 12. Lagoze et al., “ORE User Guide: Primer.” 13. Ibid. 14. Ibid. 15. Marshall Cline, “What’s This ‘Serialization’ Thing All About?” Serialization and Unserialization: C++ FAQ Lite, 1991–2009, http://www.parashift.com/c++-faq-lite/seri­ alization.html#faq-36.1 (accessed March 10, 2010). 16. Lagoze et al., “ORE User Guide: Primer.” 17. Ibid.

Library Technology Reports  www.alatechsource.org  May/June 2010

Object Reuse and Exchange (OAI-ORE)  Michael Witt

11

Chapter X 2

Exploring Object Reuse and Exchange

Abstract

Library Technology Reports  www.alatechsource.org  May/June 2010

This chapter of “Object Reuse and Exchange (OAI-ORE)” will introduce and explain some basic elements of RDF in order to “bootstrap” our exploration of ORE. The National Digital Newspaper Program at the Library of Congress will be used as a real-world example to complement a sequence of graphs that will illustrate Aggregations and Resource Maps, metadata, Aggregated Resources, and nested Aggregations.

12

Building Blocks: RDF Triples The Resource Description Framework (RDF) was origi­ nally designed as a standard for encoding metadata but has grown in its scope and application to be used more generally for modeling information.1 It is especially useful for describing entities in terms of their relationship with other entities. RDF statements are called “triples” and take the form of subject-predicate-object (see figure 2). In other words, something (the subject) is described by or related to (predicate) something else (an object). For example, you could describe a particular document that is a page in a newspaper by making a series of sentence-like statements about it: 1. The document was issued on December 14, 1918. 2. The document is titled “The St. Joseph observer. - 1918-12-14 - 1” 3. The document is ordered first. 4. The document belongs to a particular issue. 5. The document is a newspaper page. Object Reuse and Exchange (OAI-ORE)  Michael Witt

Figure 2 Graphical representation of an RDF triple

Taken together, collections of these statements can provide robust and relational descriptions of resources. In dealing with Web resources using RDF, the subject is denoted by a URI. The predicate is also denoted by a URI. The object can be denoted by a URI or a literal (a string of text).2 If the newspaper page in our previous example was digitized and available on the Internet, our statements could be: 1. The document was issued on December 14, 1918. subject: http://chroniclingamerica.loc.gov/ lccn/sn90061457/1918-12-14/ed-1/seq-1#page predicate: http://purl.org/dc/terms/issued object: 1918-12-14 2. The document is titled “The St. Joseph observer. - 1918-12-14 - 1” subject: http://chroniclingamerica.loc.gov/ lccn/sn90061457/1918-12-14/ed-1/seq-1#page predicate: http://purl.org/dc/terms/title object: “The St. Joseph observer. - 1918-12-14 - 1” 3. The document is ordered first. subject: http://chroniclingamerica.loc.gov/

lccn/sn90061457/1918-12-14/ed-1/seq-1#page predicate: http://chroniclingamerica.loc.gov/ terms#sequence object: 1 4. The document belongs to a particular issue. subject: http://chroniclingamerica.loc.gov/ lccn/sn90061457/1918-12-14/ed-1/seq-1#page predicate: http://www.openarchives.org/ore/ terms/isAggregatedBy object: http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1#issue

National Digital Newspaper Program The National Digital Newspaper Program (NDNP) is a partnership between the Library of Congress and the National Endowment for the Humanities (NEH) to digi­ tize and preserve newspapers that were published in the United States between 1836 and 1922.3 The NEH grants funds to state projects to select and digitize newspapers of historical and cultural significance, which are then aggregated, preserved, and made accessible by the Library of Congress.4 The first phase of the project is complete, and the NDNP currently contains 214 newspaper titles with approximately 192,000 issues and 1.8 million pages of content.5 The digitized newspapers are made available through the Chronicling America website (see figure 3),

Figure 3 Chronicling America website at the Library of Congress

which acts as a portal for searching and browsing the collection.

Chronicling America http://chroniclingamerica.loc.gov

The Library of Congress has a fairly straightforward workflow and data model. Digitized content and meta­ data are physically mailed to them from the projects on hard drives, and the group of files on each hard drive is referred to a “batch.” There are Aggregations defined for batches, newspaper titles, issues, and pages. For exam­ ple, a page Aggregation includes a derivative JPEG 2000 image, the raw text output of an optical character recog­ nition (OCR) scan of the page, a structured capture of the OCR encoded in XML, a PDF file of the page, and a thumbnail JPEG image.6 Pages are then aggregated by issues, and issues are aggregated by newspapers, which are aggregated by title. So, from our previous example, page 1 would be aggregated by the issue published on December 14, 1918, that is aggregated by the St. Joseph Observer (the title of the newspaper). This may seem to suggest a hier­ archical data model; however, it is not: issues are also aggregated by batches. In other words, issues from a newspaper may come from more than one batch. When you think about large-scale digitization projects, this makes sense because all of the issues from a newspa­ per may not be digitized at the same time or by the same facility. Also, large runs of a given newspaper’s issues will span multiple drives. While end users may not Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

5. The document is a newspaper page. subject: http://chroniclingamerica.loc.gov/ lccn/sn90061457/1918-12-14/ed-1/seq-1#page predicate: rdf:type object: http://chroniclingamerica.loc.gov/ terms#Page Literals can be any plain text, or they may be format­ ted by referencing a data type, such as YYYY-MM-DD for a date. Furthermore, vocabularies such as those offered in Dublin Core (see “dcterms” in the examples in the next chapter) can be used to establish the meaning of the resource or the relationship between it and another resource in a manner that is formal and can be referenced. If an existing vocabulary does not meet your needs, you can create your own Continuing our newspaper example, let’s take a look at a real-world implementation of ORE with the National Digital Newspaper Program at the Library of Congress. After a summary of the program and an overview of its general workflow, we’ll see how Aggregations was defined in the program’s data model. We’ll graph the relationship between an Aggregation and its Resource Map and then add more information sequentially to illustrate the pri­ mary ORE entities through subsequent graphs.

13

Figure 4 An Aggregation and Resource Map

care which batch an issue came from, this information is potentially important for the library’s maintenance of the technical provenance of the objects. RDF graphs, which are based on triples, can be used to explore ORE entities in an intuitive way. We will begin exploring ORE entities by graphing the relation­ ship between an Aggregation and the Resource Map that describes it. Then we will add some metadata and Aggregated Resources. We’ll graph a nested Aggregation and conclude by putting all of these pieces together in a single graph. The examples are tied to the way that the NDNP mapped ORE to its data model, with Aggregations defined for pages, issues, newspaper titles, and batches. It may be helpful to flip ahead to the next chapter on Resource Maps to see these examples continued through to their serializations in RDF/XML.

Library Technology Reports  www.alatechsource.org  May/June 2010

Aggregations and Resource Maps

14

Generally speaking, an Aggregation can be considered to be a collection of Web resources, and a Resource Map is like a shopping list that describes what is inside an Aggregation. To be more specific, an Aggregation is an RDF Resource of type ore:Aggregation that is a collection of other Resources. An Aggregation must be associated with at least one Resource Map. An Aggregation is identified by a URI (such as a URL), and this URI should not be assigned to any other Resource or be used for any other purpose than refer­ encing the Aggregation. For example, if someone used an Aggregation as a whole and wanted to cite it, they should use the URI of the Aggregation and not any other URI, such as the link to a PDF that is contained in the Aggregation or the “splash page” that represents the Aggregation on a website. Because an Aggregation is a conceptual construct, it cannot be downloaded. Instead, the webserver uses the Aggregation’s URI to provide access to a Resource Map, which is tangible and refer­ ences the Aggregation and provides more information about it and the Resource Map itself.7 In the next chap­ ter we’ll investigate how Resource Maps are discovered and served to client applications. Object Reuse and Exchange (OAI-ORE)  Michael Witt

Figure 5 Adding metadata

In figure 4, we see an Aggregation that is a page of a newspaper from the NDNP and its Resource Map, which has been serialized into an RDF/XML file (seq-1.rdf) that can be downloaded from a webserver. ORE requires that a Resource Map express its relationship to the Aggregation (ore:describes), and the subject of this triple must be the URI of the Resource Map. The pattern of these tri­ ples (e.g., “The Resource Map describes the Aggregation”) will begin to sound familiar as you read these graphs and they are repeated over and over again.

Adding Metadata A Resource Map is required to express two basic meta­ data elements about itself. Minimally, the Resource Map must include who created the Resource Map by using dcterms:creator in a triple whose object is a Resource referenced in dcterms:agent.8 This can then be the subject of additional optional triples that express descriptive text about the creator or the creator’s e-mail address, for example, using the foaf (Friend Of A Friend) ontology.9 Secondly, it must express and main­ tain the date-timestamp (using dcterms:modified) to reflect the last time the Resource Map was changed. Besides these two required elements, a Resource Map can express additional metadata about either itself or about the Aggregation. For example, figure 5 shows the date the Resource Map was created. Some descriptive metadata about the Aggregation provides the date the newspaper page was issued, its title, its sequence in the newspaper (it is the first page), and a thumbnail image. Other common metadata include rights information (dcterms:rights) and the creator of the Aggregation, who may be different from the creator of the Resource Map.10

Aggregated Resources A Resource Map can include one or more triples with a predicate of ore:aggregates to denote the Aggregated Resources that make up the Aggregation. Each Aggregated Resource must have its own URI, and the Resource Map must use this URI to reference the Aggregated Resource. A Resource Map may also include triples with the ore:isAggregatedBy predicate to assert that one or more Aggregated Resources belong to other Aggregations. A graph that displays an Aggregation with one or more ore:aggregates relationships to Aggregated Resources is known collectively as the Aggregation Graph. All authoritative Resource Maps are required to assert the same Aggregation Graph.11 In the example of the National Digital Newspaper Program, a “page” Aggregation has at least five Aggregated Resources (see figure 6): • a JPEG 2000 image derived from scanning the news­ paper page (seq-1.jp2)

• an Adobe Acrobat file of the page (seq-1.pdf) • a small thumbnail image of the page (thumbnail.jpg) • the raw text output from performing OCR on the page (ocr.txt) • structured XML resulting from the OCR scan (ocr.xml) Each of these files represents the page in a different way for different uses. The JPEG 2000 and PDF files are mainly intended for people to use as digital surrogates in place of the original newspaper page. The thumbnail image is small and can be downloaded quickly to enable browsing through newspaper pages. Finally, the OCR files are useful for indexing and searching, among other things. Regardless of their purposes, they can be thought of together as a “page” and as a compound digital object. By defining an Aggregation for a page and these files as Aggregated Resources, their semantics can be main­ tained and leveraged by ORE and other Semantic Web applications.

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

Figure 6 Aggregated Resources

15

Figure 7 Nested Aggregations

Library Technology Reports  www.alatechsource.org  May/June 2010

Nested Aggregations

16

An Aggregation can contain other Aggregations. The result is a recursive nesting of Aggregations. When this nesting occurs, the Aggregation being nested can be thought of and treated as an Aggregated Resource. This nesting must be expressed in multiple Resource Maps because a Resource Map is limited (by definition) to describing only one Aggregation. A Resource Map can (but is not required to) assert that the Aggregation that is being nested as an Aggregated Resource is described by another Resource Map using the ore:isDescribedBy predicate. This informs clients of the first Resource Map that a nested Aggregation is described by its own Resource Map and points to it. Otherwise, all of the same semantics (e.g., ore:aggregates, ore:isAggregatedBy) apply to nested Aggregations.13

Just as a newspaper page is described in the NDNP as an Aggregation of various files, pages are aggregated into issues: Aggregations of Aggregations (see figure 7). This nesting becomes more complex as issues are aggre­ gated both by newspaper titles and also by batches. The Library of Congress makes it easy to browse and under­ stand these different Aggregations by instantiating them on the Chronicling America website, for example: • Batch: http://chroniclingamerica.loc.gov/batches/ batch_mohi_carver_ver01# • Title: http://chroniclingamerica.loc.gov/lccn/ sn90061457# • Issue: http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1# • Page: http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1#

Authoritative and Non-Authoritative Resource Maps While a Resource Map is limited to describing only one Aggregation, an Aggregation may include more than one Resource Map. In other words, a Resource Map can only have one ore:describes assertion and cannot describe more than one Aggregation. For example, a nested Aggregation may include a Resource Map describing the Aggregation along with Resource Maps describing the Aggregations that are included in it as Aggregated Resources. When a client provides a webserver with the URI of the Aggregation, the Resource Map it references in its response is called the authoritative Resource

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Map. ORE requires that there be at least one authoritative Resource Map, although there may be more than one, each having its own unique URI. For example, an Aggregation may have multiple serializations (e.g., RDF/XML, RDFa, Atom XML) of its Resource Maps. In any case, all authoritative Resource Maps that describe the same Aggregation are required to assert the same Aggregation Graph. A non-authoritative Resource Map may describe an Aggregation; however, the web server will not reference it in responding to a request for the URI of the Aggregation.12

Figure 8 Putting it all together

Using and graphing RDF triples effectively demon­ strates the ORE data model. In this chapter, we have explored the relationship between a Resource Map and its Aggregation, metadata, Aggregated Resources, and nested Aggregations. We’ve used an example, the National Digital Newspaper Program and the Library of Congress’s Chronicling America website, to illustrate these concepts in an implementation (see figure 8). Keep in mind that this is not a comprehensive account of the ORE specifica­ tion (for example, Proxies and other advanced concepts have not been presented), so you are encouraged to refer­ ence the documents in the last chapter for more infor­ mation. In the next chapter, we will look at one way to serialize a Resource Map, RDF/XML, and investigate how Resource Maps can be exposed to clients by a web server and discovered.

Notes 1. Frank Manola and Eric Miller, “RDF Primer,” World Wide Web Consortium website, Feb. 10, 2004, http:// www.w3.org/TR/rdf-primer (accessed March 6, 2010). 2. Ibid. 3. National Digital Newspaper Program website, http:// www.neh.gov/projects/ndnp.html (accessed March 6, 2010). 4. Ibid. 5. Ed Summers, interview by the author, January 21, 2010. 6. Ibid. 7. Carl Lagoze et al., “ORE Specification: Abstract Data Model,” 2008, Open Archives Initiative Object Reuse and Exchange website, http://www.openarchives.org/ ore/1.0/datamodel (accessed March 10, 2010). 8. Ibid. 9. “The Friend of a Friend (FOAF) Project,” FOAF Project website, http://www.foaf-project.org (accessed March 6, 2010). 10. Lagoze et al., “ORE Specification: Abstract Data Model.” 11. Ibid. 12. Ibid. 13. Ibid.

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

Putting It All Together

17

Chapter 3

Serializing and Exposing Resource Maps

Library Technology Reports  www.alatechsource.org  May/June 2010

Abstract

18

Serializing and exposing Resource Maps is where the “rubber meets the road” in ORE. As described in chapter 1, to serialize a Resource Map essentially means to write it down in a format that can be transmitted and read by a machine. The ORE specification documents three different formats for Resource Map serialization: RDF/XML, RDFa, and Atom XML, although there are others.1 In this chapter of “Object Reuse and Exchange,” we will continue the example of the National Digital Newspaper Program and demonstrate the serialization of Resource Maps in RDF/XML for the RDF Graphs of an Aggregation and its Resource Map, metadata, Aggregated Resources, and a nested Aggregation. In addition to RDF/XML, RDFa and Atom XML serialization will be introduced. Strategies for exposing Resource Maps using the World Wide Web architecture and HTTP will be explored, including 303 redirection (with and without content negotiation), hash URIs, and RDFa. Lastly, we will discuss some mechanisms that are suggested by the ORE specification for enabling the discovery of Resource Maps, including batch discovery using protocols such as the OAI-PMH and resource embedding in webpages.

Serializing a Resource Map as RDF/XML A Resource Map describes an Aggregation and its Aggregated Resources and relationships, which we explored in the previous chapter as RDF Graphs. RDF Graphs can be expressed as RDF/XML by translating the triples represented by the circles and arrows into XML. The subjects, predicates, and objects of the RDF triples are represented in RDF/XML as XML elements and Object Reuse and Exchange (OAI-ORE)  Michael Witt

attributes with their associated names and values. The syntax of RDF/XML allows triples to be expressed in dif­ ferent ways, so the same Resource Map can be expressed differently in RDF/XML but contain the same semantics.2 More detailed recommendations and syntax can be found in the “ORE User Guide: Resource Map Implementation in RDF/XML.” To demonstrate some of the more common elements, we will iterate the triples and then serialize the RDF Graphs from chapter 2 for page and issue Aggregations from the National Digital Newspaper Program’s implementation, which uses RDF/XML. It may be helpful to flip back to the graphs and reference them with the Resource Maps as we construct them. Aggregations and Resource Maps We begin with an Aggregation for a newspaper page and its Resource Map. The same requirements for the graphed representations apply to Resource Maps. For instance, a Resource Map is required to express its relationship to the Aggregation it describes using the ore:describes predicate, and the subject of the triple must be the URI of the Aggregation. In this example, the Aggregation and Resource Map also declare their types. You can compare these triples that express the relationship between the Resource Map and its Aggregation in figure 10, which displays the entire Resource Map for a page Aggregation, encoded in RDF/XML. 1. The Aggregation is a newspaper page. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1#page rdf:type rdf:resource=http://chroniclingamerica.loc.gov/ terms#Page

Figure 9 Aggregation Graph for a newspaper page Aggregation serialized as a Resource Map in RDF/XML Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010



text/xml

image/jpeg

1918-12-14 1 The St. Joseph observer. - 1918-12-14 - 1









2010-02-13T13:06:31-04:00

2010-02-13T13:06:31-04:00

application/pdf

text/plain

image/jp2 6175 5036

19

2. Seq-1.rdf is a Resource Map. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1.rdf rdf:type rdf:resource=”http://www.openarchives.org/ ore/terms/ResourceMap” 3. The page Aggregation is described by a Resource Map, seq-1.rdf. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1#page ore:isDescribedBy rdf:resource=http://chroniclingamerica.loc.gov/ lccn/sn90061457/1918-12-14/ed-1/seq-1.rdf 4. The Resource Map, seq-1.rdf, describes the page Aggregation. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1.rdf ore:describes http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1#page Adding Metadata

Library Technology Reports  www.alatechsource.org  May/June 2010

A Resource Map is required to express some basic meta­ data about itself, such as its creator and the last time and date it was modified. Compare these triples that express metadata in figure 10, which displays the entire RDF/ XML of the Resource Map for a page Aggregation.

20

1. The Resource Map, seq-1.rdf, has a creator of dlc (the OCLC symbol of the Library of Congress). http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1.rdf dcterms:creator rdf:resource=http://chroniclingamerica.loc.gov/ awardees/dlc#awardee 2. The Resource Map, seq-1.rdf, was last modified on February 13, 2010 at 1:06 p.m. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1.rdf dcterms:modified rdf:datatype=”http://www.w3.org/2001/ XMLSchema#dateTime” 2010-02-13T13:06:3104:00 3. The Resource Map, seq-1.rdf, was created on February 13, 2010 at 1:06 p.m. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1.rdf dcterms:created rdf:datatype=http://www.w3.org/2001/ XMLSchema#dateTime” 2010-02-13T13:06:3104:00 A Resource Map can also express metadata about the Object Reuse and Exchange (OAI-ORE)  Michael Witt

Aggregation it describes. 1. The page Aggregation was issued on December 14, 1918. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1#page dcterms:issued rdf:datatype=”http://www.w3.org/2001/ XMLSchema#date” 1918-12-14 2. The page Aggregation is titled “The St. Joseph observer. - 1918-12-14 - 1” (the title of the newspaper, its date, and the page number). http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1#page dcterms:title “The St. Joseph observer. - 1918-12-14 - 1” 3. The page Aggregation has a sequence number of 1 (i.e., it is the first page of the newspaper). http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1#page ndnp:sequence rdf:datatype=”http://www.w3.org/2001/ XMLSchema#long” 1 4. The page Aggregation is depicted by thumbnail.jpg (a thumbnail image can serve as meta­ data about an object). http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1#page foaf:depiction rdf:resource=http://chroniclingamerica.loc. gov/lccn/sn90061457/1918-12-14/ed-1/seq-1/ thumbnail.jpg Aggregated Resources Now we change our focus to the relationship between the page Aggregation and its Aggregated Resources. Every page consists of a JPEG 2000 image, a PDF, a thumb­ nail JPEG, the raw text from the OCR, and the structured XML output of the OCR. The Resource Map describes that these files are aggregated into a page. You can see how these triples have been encoded in RDF/XML in figure 4, which presents a Resource Map for a page Aggregation. 1. The page Aggregation aggregates a JPEG 2000 file, seq-1.jp2. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1#page ore:aggregates rdf:resource=http://chroniclingamerica.loc.gov/ lccn/sn90061457/1918-12-14/ed-1/seq-1.jp2 2. The page Aggregation aggregates a PDF file, seq-1.pdf. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1#page



2010-0213T13:37:16-04:00

2010-0213T13:37:16-04:00



1918-12-14

The St. Joseph observer. - 1918-12-14







Figure 10 Aggregation Graph for a newspaper issue Aggregation serialized as a Resource Map in RDF/XML

http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1#page ore:aggregates rdf:resource=http://chroniclingamerica.loc.gov/ lccn/sn90061457/1918-12-14/ed-1/ocr.xml Nested Aggregations The data model for the NDNP defines Aggregations for pages, issues, titles, and batches. Issues aggregate pages. Issues are aggregated by titles and batches. Let’s take a look at a set of triples that describe an issue Aggregation. You can see that a Resource Map for it would need to express that the issue is being aggregated by a title Aggregation and a batch Aggregation. The full Resource Map serialized in RDF/XML can be found in figure 11. 1. Ed-1.rdf describes an issue Aggregation. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1.rdf Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

ore:aggregates rdf:resource=http://chroniclingamerica.loc.gov/ lccn/sn90061457/1918-12-14/ed-1/seq-1.pdf 3. The page Aggregation aggregates a thumbnail image file, thumbnail.jpg. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1#page ore:aggregates rdf:resource=http://chroniclingamerica.loc.gov/ lccn/sn90061457/1918-12-14/ed-1/thumbnail. jpg 4. The page Aggregation aggregates an OCR text file, ocr.txt. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1#page ore:aggregates rdf:resource=http://chroniclingamerica.loc.gov/ lccn/sn90061457/1918-12-14/ed-1/ocr.txt 5. The page Aggregation aggregates a OCR XML file, ocr.xml.

21

ore:describes http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1#issue 2. Ed-1.rdf is a Resource Map. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1.rdf rdf:type rdf:resource=http://www.openarchives.org/ ore/terms/ResourceMap 3. The Resource Map, ed-1.rdf, was modified on February 13, 2010 at 1:27 p.m. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1.rdf dcterms:modified rdf:datatype=”http://www.w3.org/2001/ XMLSchema#dateTime” 2010-02-13T13:37:1604:00 4. The Resource Map, ed-1.rdf, was created by dlc (the Library of Congress). http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1.rdf dcterms:creator rdf:resource=http://chroniclingamerica.loc.gov/ awardees/dlc#awardee

Library Technology Reports  www.alatechsource.org  May/June 2010

And now, some triples that describe the issue Aggregation:

22

1. The Aggregation is described by the Resource Map, ed-1.rdf. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1#issue ore:isDescribedBy rdf:resource=http://chroniclingamerica.loc.gov/ lccn/sn90061457/1918-12-14/ed-1.rdf 2. The Aggregation is a newspaper issue. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1#issue rdf:type rdf:resource=http://purl.org/ontology/bibo/ Issue 3. The issue Aggregation was issued on December 14, 1918. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1#issue dcterms:issued rdf:datatype=http://www.w3.org/2001/ XMLSchema#date 1918-12-14 4. The issue Aggregation is titled “The St. Joseph observer. - 1918-12-14”. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1#issue dcterms:title “The St. Joseph observer. - 1918-12-14” Object Reuse and Exchange (OAI-ORE)  Michael Witt

5. The issue Aggregation is aggregated by a title Aggregation. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1#issue ore:isAggregatedBy http://chroniclingamerica.loc.gov/lccn/ sn90061457#title 6. The issue Aggregation is also aggregated by a batch Aggregation. http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1#issue ore:isAggregatedBy http://chroniclingamerica.loc.gov/lccn/ batches/batch_mohi_carver_ver01#batch 7. The issue Aggregation aggregates a page Aggregation, seq-1 (first page). http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1#issue ore:aggregates http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-1#page 8. The issue Aggregation aggregates a page Aggregation, seq-2 (second page, and so on...) http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1#issue ore:aggregates http://chroniclingamerica.loc.gov/lccn/ sn90061457/1918-12-14/ed-1/seq-2#page

Other Serializations RDF/XML was selected to demonstrate Resource Map serialization in this issue of Library Technology Reports because it was implemented in our example of the NDNP; however, this should not imply any endorsement of RDF/ XML or give the impression that it is better or more widely adopted than other serializations. In fact, the ORE speci­ fication suggests that Atom may currently be the most widely used serialization.3 In addition to RDF/XML, the ORE specification documents two other formats, RDFa and Atom XML, although others may be used. RDFa allows a Resource Map to be encoded in an XHTML document. One advantage to this approach is that the XHTML document can be presented as a web­ page by a browser and can also be parsed by an ORE client to retrieve a machine-readable Resource Map. For example, the “splash page” for an Aggregation could be implemented as RDFa in XHTML and serve both a human user and a machine user-agent at the same time. The Atom Syndication Format is a part of the Atom standard that is an XML format that describes lists of related items as web feeds. Items in the feed are known as entries, and extensible metadata can be attached to

an entry.4 Atom was originally created as an alternative to Really Simple Syndication (commonly known by its indefinite acronym, RSS). In ORE, Resource Maps may be exposed as Atom entries or Resource Maps may simply be linked from entries. In this way, Resource Maps can be discoverable in batches through Atom feeds. For more detailed information on RDFa and Atom XML serialization of Resource Maps, reference the resources in chapter 5, namely the ORE User Guides for Resource Map Implementation in RDFa and Atom.

Exposing Resource Maps Because of the pervasiveness of the World Wide Web, it is no surprise that the ORE specification recommends protocol-based URIs such as HTTP for Aggregations and Resource Maps.5 When most people think about HTTP, the most common scenario that comes to mind is a web browser requesting an HTML page from a webserver. HTTP has also been put to use in developing and deploy­ ing Web Services that use it to request and return XML and other kinds of data besides HTML webpages. The same architecture of the World Wide Web is employed in an HTTP implementation of ORE. In this section we’ll explore how Resource Maps can be made available by webservers to ORE clients after they have been serialized. HTTP 303

• Your webserver supports 303 and content negotia­ tion • You require support for multiple Resource Maps • You wish to include “splash pages” along with Resource Maps • You wish to allow easy extensibility to future, addi­ tional Resource Maps or “splash pages”10 Note that 303s with content renegotiation is the only recommended HTTP implementation that supports serving multiple Resource Maps for an Aggregation. If your webserver does not support content negotiation and you only have one Resource Map, the ORE specifi­ cation recommends that 303 redirection without content negotiation be implemented. In this case, Resource Maps can become aware of each other by linking to each other using ore:isDescribedBy predicates and HTML elements in HTML “splash pages.”11 Hash URIs In some situations, your webserver may not support 303s or your Resource Maps may be hosted on a webserver that is not under your control. One simple work-around is to construct URIs for Aggregations using hash notation, where the URI to the Resource Map is followed by the pound sign (“#”) and a fragment that qualifies the URI as being for the Aggregation. For example: • URI for Aggregation: http://example.com/resource­ map.rdf#aggregation • URI for Resource Map: http://www.example.com/ resourcemap.rdf Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

Most Web users are familiar with the “404 Page Not Found” response code that you get when you experience a broken link or try to access a webpage that is no lon­ ger available from a webserver. In fact, there are a num­ ber of different of codes that can be exchanged between an HTTP client (e.g., a web browser) and a webserver. One such response code is “303 See Other,” in which the webserver can provide the client with the location of another resource, perhaps in a different format or a different language than the resource that was requested. In this way, a web client could also request the URI of an Aggregation that the webserver handles with a “303 See Other” and then provides the URI of a Resource Map. Information can be communicated between the web client and server in the HTTP headers; for example, the client may indicate it will accept particular formats by stating the acceptable MIME types in the header of its request.6 For Aggregations that have more than one serialized Resource Map, this content negotiation enables the web­ server to provide the format that is preferred by the cli­ ent. For example, a web client could specify that it will accept “application/atom+xml,” in which case the server may try to respond with the Atom XML serialization of the Aggregation’s Resource Map. In the event that more than one acceptable format is

available or no acceptable formats are available, it is left up to the webserver to determine which format to send in its response.7 The ORE specification recommends that the webserver be configured to attempt to honor the format that is requested by the web client for an Aggregation, but if it can’t, to respond with a Resource Map (as opposed a non-machine-readable format or HTML). Furthermore, it recommends the webserver respond with a Resource Map in cases where an Aggregation is requested an no format preference is specified.8 A 303 redirection can also allow a human-readable “splash page” to be accessed as well as Resource Maps when one is available. The “splash page” should include HTML elements to enable the discovery of Resource Maps that are associated with the Aggregation. Keep in mind that the client should be redirected to a Resource Map and not the splash page when the client does not specify an acceptable MIME type in its HTTP request.9 HTTP 303 with content negotiation is the recom­ mended implementation when

23

Library Technology Reports  www.alatechsource.org  May/June 2010 24

This same notation is commonly used for “jump links” in HTML, to direct a browser to scroll down to a particular named anchor in a web­ page. Because this is a client-side behavior, the web client does not send the fragment to the server in its request. And so, a client requesting or following a link to an Aggregation (http://example.com/resourcemap. rdf#aggregation) is only effectively sending the URI to the Resource Map (http://example.com/resource­ map.rdf) because it chops off the fragment (#aggregation). After the client receives a response that is the Resource Map, it may attempt to resolve the fragment but will fail, harmlessly. If the web client discards Figure 11 the fragment, you may be asking Enabling discovery of Resource Maps using the element in HTML yourself, why bother with it in the first place? The reason is that the hash URI allows you to satisfy the requirement that an HTML webpage. Web crawlers that find the page can Aggregation have its own unique URI, while allowing an parse elements, iden­ easy way to provide a Resource Map for it.12 tify them as MIME-typed Resource Maps (e.g., application/rdf+xml) and follow their URIs to Resource Maps. Multiple Resource Maps and serializations of RDFa the same Resource Map can be provided with multiple RDFa enables a Resource Map to be contained inside elements.15 This is illustrated in Figure 11 for an XHTML document, which can be rendered by a web a webpage that displays an issue of a newspaper in the browser as a normal webpage and also be used by an ORE NDNP. client as a Resource Map. The previous HTTP implemen­ In some cases, the webpage may itself be an tations (303s with or without content negotiation and Aggregated Resource that has one or more hash URIs) work the same way with RDFa as other seri­ elements that point to Resource Maps for Aggregations alizations of Resource Maps. The only difference is when that include it. If the Resource Maps are available in an there are multiple serializations available; the Resource Atom feed, it may be exposed for autodiscovery using Map in RDFa should be used as the default because it . It is also possible to fulfills the requirement of being machine-readable with link to the URI of an Aggregation by specifying without a type. Note that it is up to the crawler or other ORE client application to fol­ low the links to the Resource Maps and process them in Discovering Resource Maps order to determine the relationships of the Aggregations and Aggregated Resources.16 This cannot be inferred by There are a variety of ways in which Resource Maps (and the placement of the elements or any other thus, Aggregations) can be discovered on the Web after information in the ORE documentation that encourages they have been serialized and exposed for ORE client appli­ displaying the URIs of Aggregations in the content of web cations, such as harvesters and crawlers, to find. The ORE pages, for example, to allow users to copy and paste them specification suggests two categories of discovery, resource into blog posts, e-mails, and other environments.17 embedding and batch discovery, although new mechanisms or categories may emerge and evolve over time.14 Batch Discovery Resource Embedding One strategy for exposing Resource Maps for discovery is to link to them from within the tags of an Object Reuse and Exchange (OAI-ORE)  Michael Witt

Beyond linking or embedding Resource Maps into web­ pages, Resource Maps may be made discoverable in batches more explicitly. If Resource Maps have been serialized using Atom XML, they may be exposed directly in Atom

Exercise Resource Maps for newspaper title and batch Aggregations have been serialized from the National Digital Newspaper Program as additional examples in Appendix 1 and 2. Can you understand and graph them with pencil and paper? feeds, which supply their own mechanism for autodiscov­ ery. Resource Maps can be linked from Atom entries or may be supplied as entries themselves.18 The URIs for Resource Maps and Aggregations may also be included in a Sitemap, which is an XML document that lists all of the URLs in a particular website along with some metadata describing them to a web crawler or other web client.19 The last batch discovery mechanism suggested by the ORE specification is the OAI-PMH (explained in chapter 1), which is a pro­ tocol for metadata harvesting. The serialization format of the Resource Map can be specified as a metadata Prefix, enabling the harvest of Resource Maps instead of or in addition to Dublin Core metadata records. Even if the full Resource Maps aren’t exposed to be harvested, another type of metadata record may be harvested that includes links to Resource Maps or Aggregations and serve the pur­ pose of facilitating their discovery.20

Notes

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

1. Carl Lagoze et al., “ORE User Guide: Primer,” Oct. 17, 2008, Open Archives Initiative Object Reuse and Exchange website, http://www.openarchives.org/ore/ 1.0/primer (accessed March 6, 2010). 2. Carl Lagoze et al., “ORE User Guide: Resource Map Implementation in RDF/XML,” Oct. 17, 2008, Open

Archives Initiative Object Reuse and Exchange website, http://www.openarchives.org/ore/1.0/rdfxml (accessed March 6, 2010). 3. Carl Lagoze et al., “ORE User Guide: HTTP Implementation,” Oct. 17, 2008, Open Archives Initiative Object Reuse and Exchange website, http://www.openarchives .org/ore/1.0/http.html (accessed March 6, 2010). 4. M. Nottingham and R. Sayre, eds., “RFC 4287: The Atom Syndication Format,” Dec. 2005, The Internet Engineering Task Force, http://www.ietf.org/rfc/ rfc4287.txt (accessed March 6, 2010). 5. Lagoze et al., “ORE User Guide: HTTP Implementation.” 6. Leo Sauermann and Richard Cyganiak, eds., “Cool URIs for the Semantic Web,” World Wide Web Consortium website, Dec. 3, 2008, http://www.w3.org/TR/cooluris (accessed March 6, 2010). 7. K. Holtman and A. Mutz, “RFC 2295: Transparent Content Negotiation in HTTP,” March 1998, The Internet Engineering Task Force, http://www.ietf.org/rfc/ rfc2295.txt (accessed March 6, 2010). 8. Lagoze et al., “ORE User Guide: HTTP Implementation.” 9. Ibid. 10. Ibid. 11. Ibid. 12. Ibid. 13. Ibid. 14. Carl Lagoze et al., “ORE User Guide: Resource Map Discovery,” Oct. 17, 2008, Open Archives Initiative Object Reuse and Exchange website, http://www.openarchives .org/ore/1.0/discovery (accessed March 6. 2010). 15. Ibid. 16. Ibid. 17. Ibid. 18. Ibid. 19. “What Are Sitemaps?” sitemaps.org website, http://www .sitemaps.org (accessed March 6, 2010). 20. Lagoze et al., “ORE User Guide: Resource Map Discovery.”

25

Chapter X 4

Implementations of ORE

Library Technology Reports  www.alatechsource.org  May/June 2010

Abstract

26

What are people doing with ORE in the real world? In this chapter we will explore eight different implementations of ORE that may be of interest to librarians. The Texas Digital Library created an implementation of ORE as a component of its digital library of electronic dissertations and theses. Microsoft External Research recently introduced the Zentity institutional repository and a plug-in for Word that generates Resource Maps. At Johns Hopkins University, librarians are participating in e-Science initiatives with the U. S. National Virtual Observatory to help astronomers manage massive data sets. In Australia, the LORE tool was created as an extension to the Mozilla Firefox web browser to enable literary scholars to encapsulate their digital resources and bibliographic metadata as ORE aggregations. Lastly, we speak with Patrick Hochstenbach about his thoughts on ORE and the Biblio institutional repository and academic bibliography at Ghent University in Belgium.

Vireo: An ORE Implementation for DSpace Many academic libraries provide services to support students and faculty in the submission and archiving of electronic theses and dissertations (ETD). In the state of Texas, the Texas Digital Library (TDL) is a consortium of eighteen universities and has a mission of providing common infrastructure, services, and training to support the scholarly communication needs of its member institu­ tions.1 Among other services, TDL provides platforms for hosting open-access journals and wikis, and it supports a federation of institutional repositories. Object Reuse and Exchange (OAI-ORE)  Michael Witt

These institutional repositories are running the DSpace software. Some members host their own DSpace repository; some share a repository at another member’s institution; and others use a shared repository hosted by TDL. These institutional repositories provide a pub­ lishing and archiving platform, typically for born-digital documents such as journal article preprints, conference papers, and technical reports. Some members were using their institutional repository for the submission of theses and dissertations, too.

Vireo http://www.tdl.org/etds

With support from the Institute of Museum and Library Services, TDL began a process of seeking input from its stakeholders to design a new system for man­ aging and preserving ETDs. The project, named Vireo, sought to leverage existing infrastructure, implement new workflows, and scale up to a distributed, statewide ETD system. The Manakin software was used to create a customized user interface for DSpace to enable students to submit their dissertations.2 The dissertation and its related files and metadata are then stored in the local DSpace repository. At a high level, DSpace repositories are organized into communities, collections, and items. Items are made up of metadata and bundles, which contain one or more bitstreams.3 For example, a university department may constitute a community, which has a collection of techni­ cal reports. Each individual technical report may be rep­ resented as an item that includes a metadata record that describes the report and a bundle of files, which can be

software in a production environ­ ment with thousands of users. It is planning to release its ORE modi­ fications to DSpace as open source software in February 2010.6

Foresite

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

Foresite began as a project funded by the Joint Information Systems Committee (JISC) in the United Kingdom to produce a demon­ stration of the ORE standard by creating Resource Maps of jour­ nals and their contents from the JSTOR archive of academic jour­ Figure 12 nals and delivering them as ATOM Federating DSpace repositories of electronic theses and dissertations at the Texas documents to deposit in a DSpace Digital Libraries repository using SWORD. The Resource Maps were ingested into thought of as bitstreams. In this case, there may be a DSpace as items that reference the content residing in bitstream that represents an Adobe Acrobat file (PDF) of JSTOR.7 the report. In mapping the DSpace data model to ORE, TDL decided to define aggregations for communities, col­ Foresite libraries lections, and items. It wrote some code to enable each http://code.google.com/p/foresite-toolkit/ DSpace repository to generate and interpret Resource Maps for these kinds of aggregations and to expose them as metadata records using DSpace’s OAI-PMH interface. Foresite is probably more commonly known for (Revisit chapter 1 for more information about the OAI- producing open source Java and Python libraries for PMH.) The Resource Maps are serialized as Atom XML constructing, parsing, manipulating, and serializing and can be harvested by an OAI-PMH service provider by specifying the proper metadata prefix.4 For its central DSpace repository, TDL employed an We wanted to operate at the OAI-PMH harvester to harvest the Resource Maps and object-level, with a protocol that is developed an ORE item importer. The ORE item importer resolves the URIs of the Aggregated Resources described focused on objects and not just the in the harvested Resource Map, fetches them from the metadata. The intent with Vireo was Remote DSpace repository, and rebuilds the item with not to produce something that can its bitstreams and metadata in the central DSpace reposi­ tory. TDL also built a custom scheduling system to auto­ harvest metadata but something mate harvesting.5 that can harvest metadata and items In this way, TDL is using ORE to harvest all of the inside of collections . . . and that’s dissertations and metadata from its member institutions into a central repository where they can be more easily one of the things that was exciting preserved and made accessible in a single location. Future to us about ORE. We want to share plans include public syndication of the Resource Maps so everything we can possible share in that anyone on the Internet can access and use the ETDs in semantic applications. In fact, interest has already been order to maximize the scholarship expressed by water quality researchers in Texas who want contained in the Texas Digital to automate the harvest of data from dissertations that Library. relate to their field. TDL invested a great deal of time in developing —Mark McFarland, Texas Digital Library and testing its software because it is implementing the

27

Resource Maps. Both sets of libraries support the parsing and serialization of Resource Maps that are suggested in the ORE specification: Atom XML, RDF/XML, and RDFa. Additionally, they support serialization in Notation3 (N3), N-Triples, and Terse RDF Triple Language (Turtle). ORE is used for describing compound/complex digital objects such as aggregations of journals, issues, articles, and pages within JSTOR and enabling the digi­ tal preservation all of the copies of a resource. Of the two sets of libraries, Foresite’s implementation of ORE is more complete in Python than Java. In the Python librar­ ies, Foresite hides the ORE data model (in RDF) under­ neath an object-oriented layer and familiar “pythonic” style. It was used to create ORE descriptions of the com­ plete holdings of JSTOR, making available the graph of interconnected journals, issues, and articles, through structure as well as citations.8 JSTOR is currently modifying the Foresite code to use its own internal formats rather than the information exported in the original project. It hopes to make the Resource Maps available to users at some point in the future.9 The Foresite libraries are available for download from Google Code.

Library Technology Reports  www.alatechsource.org  May/June 2010

Microsoft External Research

28

Microsoft External Research partners with universities to support research, traditionally in computer science, but also in other areas such as library science and e-Science. Along with supporting research projects that are directed outside of Microsoft, External Research engages in activi­ ties such as sponsoring academic conferences, providing fellowships and internships, and producing software tools to foster and improve the research process. Microsoft provided early support for the development of the ORE specification along with the National Science Foundation, the Andrew W. Mellon Foundation, and the Coalition for Networked Information.10

The most important thing to know about ORE is not ORE at all, but the web architecture on which it is based. Understanding how the web works, and then how RDF works, is more important than knowing the details of ORE. Once the fundamentals are understood, ORE is very straightforward. Understanding that it is not just another compound object format like METS or MPEG21 DIDL is the most important thing for libraries. It could be used for Archives in place of Encoded Archival Description (EAD), for compound objects in place of METS, for collections in place of proprietary databases or description formats. —Robert Sanderson, Los Alamos National Labs and ORE, to enable interoperability and integration with other tools and services. An included toolkit and code samples allow developers to present data in original ways, demonstrating, for example, the relationships between a published paper, authors, research data, associated lec­ tures, presentation slides, or PDFs.13

Open Repositories 2009 https://or09.library.gatech.edu

Zentity The creation of the Scholarly Communication program within Microsoft External Research by Tony Hey in 2007 has also yielded many valuable contributions. One example is the Zentity repository, which was launched at the Open Repositories conference in Atlanta in 2009. Microsoft sought to build a new repository platform from scratch on top of its product stack: Microsoft Windows, SQL Server, and the Microsoft Entity and .NET Frameworks. Zentity provides a turn-key repository solution with a default set of user interfaces, workflows, and a schema that defines typical repository entities and relationships.11 They made an effort to incorporate as many open community pro­ tocols as possible, including SWORD,12 the OAI-PMH, Object Reuse and Exchange (OAI-ORE)  Michael Witt

While Zentity is one of the newest players in the institutional repository space, it may be the most mature and tightly integrated ORE implementation that is cur­ rently available as a part of a repository platform.  Any time you have the URI for an entity, you can retrieve a Resource Map from the data store that describes all of the entities’ relationships. For example, if you have the URI of a person, you can see an aggregation of the papers that person has authored, or the lectures that person has given, or the papers that person had reviewed. Resource Maps for these aggregations are defined automatically and updated dynamically, essen­ tially by querying the store and then serializing them

as RDF/XML. Even though it is built atop SQL Server, Zentity is designed to behave more like an RDF triplestore than a relational database.14 Other well-known repository software such as DSpace, Fedora, and e-Prints are considered to be open source, which means that their source code is published and freely shared. Depending on the license being used, software developers may be free to create their own addi­ tions to or implementations of the software that can be proprietary. In fact, some companies have begun selling their own commercial implementations and extensions of these repositories. In this way, these repositories may be seen as open core software, because the base reposi­ tory Remains in the open source while the additions to it can be proprietary. Microsoft is planning on releasing the source code for Zentity as open edge software, which is exactly the opposite of open core software. In other words, the core of Zentity (e.g., SQL Server) is a propri­ etary, commercial product, but the extensions to create a repository application will be open source and can be

ORE is a way of assigning semantics to the web. The more data that is described and linked extends its capability and opens up more data to a long tail of creative and unintended uses. —Alex Wade, Microsoft External Research

Zentity download website http://research.microsoft.com/en-us/downloads/ 48e60ac1-a95a-4163-a23d-28a914007743

Microsoft Research Community discussion board for Zentity http://community.research.microsoft.com/forums/90.aspx

Article Authoring Add-In for Microsoft Word At this point, most ORE implementations focus on plat­ forms that store data or relate existing data to each other. ORE can also be tremendously useful when it is integrated into tools that create new data, such as the Article Authoring Add-in for Microsoft Word. The addin was originally developed to help authors use Word to

In the past, librarians have put too much emphasis on the container: the book, the journal, the article. And by doing so, we have pigeonholed our collections. By throwing away the container and embracing approaches like ORE and Linked Data, it opens up our data to a wider field of discovery and use for much richer applications. ORE broadens the impact of data by making it machine-accessible. —Alex Wade, Microsoft External Research A demonstration of the add-in and a description of its ties to OpenXML can be found on YouTube, as well as a more detailed account of how it uses ORE. In a nutshell, the add-in attempts to make it easier for researchers to write articles. Authors can insert properly formatted bib­ liographic citations by directly querying PubMed Central from Word, and the add-in can automatically populate metadata (e.g., grant information, author affiliation) that the author used to have to enter into a web form before submitting. As authors insert data into their articles, the add-in records some of its semantics. For example, an author may embed a data set, workflow, or image that has its own URI into a document. When the Word file is saved, a Resource Map describing these Aggregated Resources is serialized as RDF/XML and embedded into the article’s .docx file. In this way, a downstream ORE application can later extract the Resource Map and handle the article as an Aggregation.17

YouTube video: Scientific & Technical Article Authoring Add-in Tour http://www.youtube.com/openxml#p/u/11/ EuhAokemuH8

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

freely shared and modified. Zentity can be freely downloaded from Microsoft’s website. There is also a discussion board for Zentity that is hosted by the Microsoft Research Community.

write articles in a format required by the National Library of Medicine. It enables more metadata to be captured and stored at the authoring stage and enables semantic infor­ mation to be preserved through the publishing process, which is essential for enabling search and semantic anal­ ysis once the articles are archived within repositories.15 The author can also directly submit the article to PubMed Central or another repository from directly within Word, using its SWORD functionality.16

29

YouTube video: Article Authoring Add-in Tool for Word 2007 and Object Reuse and Exchange http://www.youtube.com/watch?v=ITSTGPbpA2A

The Article Authoring Add-In for Microsoft Word is still being enhanced, but the current version can be freely downloaded from Microsoft’s website.

Article Authoring Add-in for Word 97 Beta 2 Preview for download http://research.microsoft.com/en-us/downloads/b844bcfa -2d27-4d96-9fe7-2bd16a54e4b4

U. S. National Virtual Observatory

Library Technology Reports  www.alatechsource.org  May/June 2010

The goal of the U. S. National Virtual Observatory (NVO) is to “enable a new way of doing astronomy” by to making it possible for researchers to find, retrieve, and analyze astronomical data from ground- and spacebased telescopes worldwide.18 The NVO is sponsored by the National Science Foundation and is based at Johns Hopkins University (JHU). Librarians at the university’s

30

Digital Research and Curation Center were early collab­ orators with astronomers in building solutions for sub­ mitting, publishing, and curating data sets for the NVO community, applying many of the principles of library science to the management of large, astronomical data collections. Tim DiLauro, Digital Library Architect at the JHU Sheridan Libraries, describes their project: The overall goal of the project is to capture the data that is associated with publications, deposit them into a data archive, and enable services over the data in the archive. One of the most fundamental aspects of scientific scholarly communication is the ability to access and examine cited data. Without this ability, the very essence of the scientific method, with its requirement of validating results, becomes compromised. The NVO is playing a leadership role in building services for the astronomy community to access and analyze astronomical data. However, thus far the scope of the NVO has deliberately not included long-term data curation, focusing instead on data location and data access standards and protocols. One of the goals of our project, which is a collaboration of astronomers, a scholarly society, its publishing production partner, and research libraries, is to capture data that is related to a

Figure 13 ORE data publishing workflow for the U. S. National Virtual Observatory

Object Reuse and Exchange (OAI-ORE)  Michael Witt

journal article when it is submitted [and archive it]. The challenges are several: • To gather more metadata and datasets from authors without significantly increasing their workload, • To simplify deposit process for authors and publishers, and • To enable linking between articles and datasets without significant impact on publisher systems.

LORE: A Compound Object Authoring and Publishing Tool for Literary Scholars The Australian Literature Resource (also known as AustLit) is a collaboration between the National Library of Australia and twelve Australian universities to index and provide authoritative information on more

Notes 1. “About the Texas Digital Library,” Texas Digital Library website, http://www.tdl.org/about-tdl (accessed March 6, 2010). 2. Scott Phillips, Cody Green, Alexey Maslov, Adam Mikeal, and John Leggett, “Manakin: A New Face for DSpace,” D-Lib Magazine 13, no. 11/12 (Nov./Dec. 2007), http:// www.dlib.org/dlib/november07/phillips/11phillips.html (accessed March 15, 2010). 3. MacKenzie Smith, Lecture Notes on Computer Science, vol. 2458, 2002, 543–549 http://dspace.mit.edu/handle/ 1721.1/26706 (accessed March 11, 2010). 4. Adam Mikeal, James Creel, Alexey Maslov, Scott Phillips, and John Leggett, “Large-Scale ETD Repositories: A Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

To accomplish these goals, we chose ORE as an enabling technology.19 In the project, they are using ORE to support the description of the relationships between data and an article. For example, an article may include images, tables, and graphs that are embedded in it. Treating the article as an Aggregation, a Resource Map is generated that identifies and links the data behind these embedded objects and the article when it is submit­ ted. If the article was written using Microsoft Word for Windows, the Resource Map can be created by the Article Authoring Add-In. JHU has also created a web-based appli­ cation that can generate a Resource Map for other for­ mats that are common in Astronomy, such as LaTeX. In either case, the document is submitted with its Resource Map using SWORD to the publishers.20 Unlike the other ORE implementations described in this chapter, JHU does not maintain the Resource Maps after they are generated. They are used by the publisher to link and ingest the article and its data when they are submitted. The publishers’ systems can then track the relationships in their own way.21 JHU considered other options, such as METS, and specifically structMaps, but decided that ORE was a bet­ ter fit because it was designed to express relationships among resources and to support the expression of com­ plex objects. They also anticipate that the tools that will be developed for ORE will align more closely with their needs in the future.22 Along with DiLauro, who served on the ORE Technical Committee, Sayeed Choudhury from JHU’s Sheridan Libraries helped develop the ORE specification by serving on the ORE Advisory Committee. The NVO is currently being operationalized by NASA as the U. S. Virtual Astronomical Observatory. JHU’s implementation of ORE will rolled into the Data Conservancy,23 one of two DataNet projects that was funded by the NSF in 2009.

than 100,000 Australian authors, going back to 1780.24 Literature Object Re-use and Exchange (LORE) was created in this context as a lightweight tool to enable researchers to create and publish ORE-compliant literary objects that encapsulate their digital resources and biblio­ graphic metadata.25 LORE runs as a plug-in to the Mozilla Firefox web browser. It provides a graphical tool for draw­ ing and labeling typed relationships between objects using terms from a bibliographic ontology. Metadata can be attached to the object, which can then be pub­ lished as an RDF graph to a repository, where it can be searched, downloaded, edited, and reused by others.26 It stores and queries Named Graphs that represent liter­ ary compound objects using web services on a Sesame 2 (RDF triple-store) or Fedora repository. The types for relationships among or between Aggregations and meta­ data that describe the Aggregated Resources are specified by an OWL ontology that was developed after examining the topic types and relationships present in AustLit. The ontology is based on FRBR but was extended to support additional relationships. The LORE authoring interface displays a graphical visualization of the Resource Maps with their Aggregated Resources represented as nodes with arrows between them representing their typed relationships. A node pres­ ents a preview of its resource such as a thumbnail image, making it easy to locate and identify resources. Clicking on an identifier will load the resource in the web browser window. Along with the visualization, the Resource Maps are displayed as RDF/XML in a text window. New resources can be added as they are browsed in the browser window. A toolbar allows objects to be saved and loaded from the repository, and another panel enables the user to search and browse Resource Maps. Finally, metadata can be added or edited in the Properties panel.27 LORE was created as a part of the Aus-e-Lit project by the eResearch group at the University of Queensland28 under the direction of Jane Hunter, who also served on the ORE Advisory Committee.

31

Interview with Patrick Hochstenbach, Ghent University Please describe your overall project. In what context are you using ORE? I am working with Ghent University’s Academic Bibliography and institutional repository, which together we refer to as Biblio.29 Biblio is based on three subprojects: 1. Ghent University Library was asked by the University Department of Research Affairs to create a bibliography of all publications written by our university. The main purpose of the bibliography is to create bibliometric calculations. Faculty and department funding as well as doctoral promotions are based on the number of publications available in our bibliography. The bibliography needs to be complete, and the metadata quality needs to be very high to perform accurate bibliometrics. Ghent University researchers are required to register all their publications in the Biblio application in order to receive funding. Library staff enriches these registered publications by adding and enhancing their metadata.

Library Technology Reports  www.alatechsource.org  May/June 2010

2. The Biblio application is also an institutional repository. Our library is very open-access-minded. We are involved in many national and international projects to promote open access publishing and the archiving of scientific research output.

32

3. Ghent University is member of the European DRIVER project, which is a European Commission Seventh Framework program to optimize access to research data. Our group was involved in setting up a Belgian portal for institutional repositories, and we theorized about possible extensions to the OAI-PMH–based grid to allow for easier integration with new networks and exchange of complex objects.30 The Biblio project started four years ago as a DSpace implementation. However, new requirements became too difficult to implement in DSpace, and so we needed to migrate to a new solution. Two years ago, in a joint project between Lund University and Ghent University, we developed a new generation of the Biblio software from scratch that acts now as the combined bibliography and institutional repository of Ghent University.

Can you describe your particular implementation of ORE? How are you using it? In our Driver Technology Watch Report, we theorized about several possible techniques to disseminate complex objects

Object Reuse and Exchange (OAI-ORE)  Michael Witt

from institutional repositories. To test the theory in practice, we implemented several of these ideas in the Biblio application: ORE, MPEG-21/DIDL, METS, Sitemaps, microformats and simple CSV/Excel exports.31 The Biblio application uses a template system to generate webpages. Exporting ORE or any other format required adding a new template that creates RDF webpages instead of HTML webpages. We changed our code a bit to allow for HTTP Content Negotiation so that RDF web harvesters would be able to get easier access to the ORE exports.

What makes ORE well suited for this purpose? Did you consider any other options? Yes, we considered all the options from our DRIVER report. All the options are very well suited in some way or another. Our partners inside the university tend to go for the simplest export possible: CSV/Excel. They have direct contact with our development team and can easily contact us to ask for new features. In national and international projects where this kind of direct interaction is not possible, ORE provides a very good way to join the Linked Data network32 with all the available tools and semantics. However, big search engines like the Google, Yahoo, MSN tend to favor their own semantics, and here microformats would be better suited. So, it depends a bit on which project or network partners you have and in which way (or better “if”) you can negotiate on the what standards and protocols to use.

Do you make your aggregations publicly accessible, and if so, how? What do you envision people doing with them? At the moment we export all the formats mentioned above, but we don’t use them to import data (this is still on our to-do list). Our main task is to provide several alternative interfaces to our datasets. We encourage external developers to use these exports to provide extended services to our repository. We aren’t in direct contact with developers using our ORE datasets, but we see many crawlers accessing our site in search for these RDF exports. Our experiments can give valuable input for our DRIVER partners to decide which format to promote in Europe. METS is popular in the UK, the Netherlands is using MPEG-21/DIDL, ORE is getting more and more popular linked also to the Linked Data bandwagon. Microformats are hugely popular outside the library world (big names supporting it) but doesn’t get much traction in the digital repository world.

What do you think librarians and libraries need to know about ORE? How do you think ORE can or could impact libraries, especially digital library collections?

1. In what ways will libraries open their digital collections to the world with easy licenses, so that external service provides can harvest your content and to create new applications not under your local control?

Librarians and libraries should learn about the new opportunities ORE, and more in general, Linked Data will give. Likewise they should know what kind of problems all these standards are trying to solve, rather than digging into the technicalities of the protocols and metadata formats. Take the early 90s, for example. I would have suggested that librarians should learn about the new world of hyperlinks and brainstorm about new possible applications for it, rather than discussing TCP/IP, DNS, and HTTP. This said, librarians and libraries should know enough about ORE to promote the great work that is being done in our library world on creating very powerful technical standards to describe library content. Big companies and external actors tend to reinvent the wheel when launching new products. Due to their size, power, they can often force their own formats and protocols onto us when there are already long-standing, established standards. The impact of ORE is tied to Linked Data. Two main discussions I foresee are:

2. And, in what way will libraries archive their content and provide perpetual access to their content?

The entire Linked Data world. ORE is a vocabulary added to the Linked Data semantics to better describe aggregations of information resources. There is not an “ORE-world” like there was an “OAI-PMH world,” where only OAI-PMH-capable applications could interact with the datasets. Everything in the Linked Data cloud is an example: DBPedia, US Census data, Geonames, etc. The list goes on, and it will only grow longer. ORE is part of that cloud. All of these collections can be available for use with the same tools. Now we need to invent new applications for libraries and librarians.

14. Wade interview. 15. “Article Authoring Add-in for Word 2007,” Microsoft Research website, http://research.microsoft.com/ authoring (accessed March 6, 2010). 16. Wade interview. 17. Wade interview. 18. “What Is the NVO?” U. S. National Virtual Observatory website, http://www.us-vo.org/what.cfm (accessed March 6, 2010). 19. Tim DiLauro, interview by the author, January 20, 2010. 20. Ibid. 21. Ibid. 22. Ibid. 23. Ibid. 24. “About AustLit,” AustLit website, http://www.austlit .edu.au/about (accessed March 6, 2010). 25. Anna Gerber and Jane Hunter, “LORE: A Compound Object Authoring and Publishing Tool for Literary Scholars Based on the FRBR,” (presentation, Open Repositories Conference, Atlanta, GA, May 18–21, 2009), http://hdl.handle.net/1853/28466 (accessed March 6, 2010). 26. Ibid. 27. Anna Gerber and Jane Hunter, 2008. “LORE: A Compound Object Authoring and Publishing Tool for the Australian Literature Studies Community,” In Digital Libraries: Universal and Ubiquitous Access to Information: 11th International Conference on Asian Digital Libraries,

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

Case Study of a Digital Library Application,” in JCDL ’09: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries, 135–144 (New York: Association for Computing Machinery, 2009), http://doi .acm.org/10.1145/1555400.1555423 (accessed March 6, 2010). 5. Alexey Maslov, Adam Mikeal, Scott Phillips, John Leggett, and Mark McFarland, “Adding OAI-ORE Support to Repository Platforms,” (presentation, Open Repositories Conference, Atlanta, GA, May 18–21, 2009), http://hdl .handle.net/1969.1/86479 (accessed March 6, 2010). 6. Mark McFarland, interview by the author, January 15, 2010. 7. Robert Sanderson, interview by the author, January 12, 2010. 8. Ibid. 9. Ibid. 10. Alex Wade, interview by the author, January 15, 2010. 11. Microsoft External Research Scholarly Communications program website, http://www.microsoft.com/mscorp/tc/ scholarly_communication.mspx (accessed March 6, 2010). 12. Julie Allinson, Sebastien François and Stuart Lewis, “SWORD: Simple Web-Service Offering Repository Deposit,” Ariadne, issue 54 (January 2008), http://www .ariadne.ac.uk/issue54/allinson-et-al (accessed March 6, 2010). 13. Microsoft External Research Scholarly Communications program website.

In your opinion, what other projects could be considered exemplars for ORE that would be interesting to librarians?

33

Library Technology Reports  www.alatechsource.org  May/June 2010

ICADL 2008, Bali, Indonesia, December 2008, Proceedings, ed. George Buchanan, Masood Masoodian, and Sally Jo Cunningham, 246–255 (Berlin: SpringerVerlag, 2008), DOI: 10.1007/978-3-540-89533-6_25. 28. “Compound Object Authoring and Publishing,” University of Queensland eResearch, http://www.itee.uq.edu. au/~eresearch/projects/aus-e-lit/#compoundobjects (accessed March 6, 2010). 29. Ghent University, Academic Bibliography and Institutional Repository of Ghent University, http://biblio .ugent.be (accessed March 15, 2010). 30. Patrick Hochstenbach, Karen Van Godtsenhoven, Maurice Vanderfeesten, Rosemary Russell, Gerd

34

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Schmelz Pedersen, and Mikael Karstens Elbaek, Driver Technology Watch Report (Driver Project, 2008), http:// hdl.handle.net/1854/LU-723558 (accessed March 15, 2010). 31. See also Patrick Hochstenbach, “Linked-Data in the Academic Bibliography,” TekTok—Digital Library Technology Blog, Oct. 7, 2009, http://lib.ugent.be/tektok/ 2009/10/test.html (acessed March 15, 2010). 32. Linked Data website, http://linkeddata.org (accessed March 15, 2010).

Chapter 5

Resources Selected ORE Implementations, Demonstrations, and Tools Australian Partnership for Sustainable Repositories A demonstration of ORE utilizing Resource Maps to integrate the Open Journal System with a DSpace reposi­ tory for the submission and subsequent rendering of Aggregations to end-users.

Using OAI-ORE in an e-Journal-to-Repository Workflow http://www.apsr.edu.au/ore

European Commission A portal, API, and collection of materials from cultural institutions all over Europe that uses ORE as a modeling tool to present complex digital objects.

Europeana http://www.europeana.eu

German National Library of Economics (ZBW) A collection of more than six million digitized newspaper clippings between the 19th century and 2005 that are exposed as ORE Aggregations with Resource Maps serial­ ized using RDFa.

20th Century Newspaper Archives

A powerful demonstration at the Open Repositories 2008 conference where a group of repository developers used ORE to transfer the entire contents of an e-Prints reposi­ tory into a Fedora repository and back again.

http://zbw.eu/pm20

The CRIG Repository Developer Challenge http://www.ariadne.ac.uk/issue57/rumsey-osteen

Data Archiving and Networked Services (DANS) Demonstrations of “enhanced” publications that are con­ structed using ORE and displayed as webpages using client-side (DRIVER-II) and server-side (JALC) XSLT trans­ formations. DRIVER-II also provides Java applet for visu­ alizing the relationships between the resources.

Enhanced Publications http://driver2.dans.knaw.nl http://develop01.dans.knaw.nl/jalc

Ghent University A custom-made institutional repository and faculty cita­ tion management system that includes support for ORE.

Biblio http://biblio.ugent.be

Japanese Digital Mathematics Library Utilizes ORE for aggregating resources that relate to mathematics from distributed digital libraries, serializing Resource Maps as Atom XML.

Japanese Digital Mathematics Library http://dmljp.math.sci.hokudai.ac.jp

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

Ben O’Steen, David Tarrant, Tim Brody

35

Jewish American Archive A module for the Drupal content management system that allows you to build and present media presentations based on OAI-ORE.

Nodequeue OAI-ORE http://drupal.org/project/nodequeue_oaiore

Johns Hopkins University Librarians collaborating with astronomers to publish, archive, and cross-link documents and research data sets.

National Virtual Observatory http://www.us-vo.org

Library of Congress and National Endowment for the Humanities A long-term program to digitize a selection of newspapers published in the United States between 1836 and 1922 to preserve and improve access to them.

National Digital Newspaper Program http://www.loc.gov/ndnp

Los Alamos National Labs, Old Dominion University

Library Technology Reports  www.alatechsource.org  May/June 2010

A project that utilizes HTTP content negotiation and ORE concepts to retrieve archived versions of Web resources from the past.

Memento http://www.mementoweb.org

Michael Giarlo A plug-in for the Wordpress blog platform that generates Resource Maps for blog pages and posts that are exposed using Atom.

Wordpress Plug-in for ORE http://lackoftalent.org/michael/blog/ore-wordpress-plug-in

Microsoft External Research A word processor plug-in for Microsoft Word 2007 that adds support for SWORD and the generation and embed­ ding of Resource Maps into .docx files.

Article Authoring Add-in for Microsoft Word http://research.microsoft.com/authoring

An institutional repository built on top of Microsoft SQL Server and the .Net framework that supports ORE natively.

Zentity http://research.microsoft.com/en-us/projects/zentity

Mellon Foundation A broad collaboration to facilitate standards and imple­ mentations for leveraging annotations across clients, servers, and collections.

Open Annotation Collaboration http://www.openannotation.org

oreChem A broad collaboration sponsored by Microsoft that builds on the work of ORE by defining a core model, ontology, and extensions for chemical entities, populating and exposing data sources using the model, and developing a set of demonstration applications in scholarly communi­ cation and research in chemistry.

oreChem http://www.openarchives.org/oreChem

Oskar Grenholm, National Library of Sweden An open-source ORE implementation for the Fedora repository.

oreProvider http://oreprovider.sourceforge.net

University of Cambridge A JavaScript application that presents a quick way to view and navigate through the resources in Aggregations in a pop-up pane that uses preloading and ORE autodiscovery.

peekORE http://blip.tv/file/1157218

University of Liverpool A project that has produced code libraries for implement­ ing ORE in Java and Python as well as an extension for Mozilla Firefox called explORE that can visually charac­ terize Aggregations, which was originally developed for JSTOR and with support from HP Labs.

Foresite http://foresite.cheshire3.org

36

Object Reuse and Exchange (OAI-ORE)  Michael Witt

The winner of the “ORE Challenge” at RepoCamp 2008 that provides a visual interface for navigating nested Aggregations.

OREsome http://www.vimeo.com/1413467

University of Queensland A Mozilla Firefox extension that enable researchers to create and publish ORE-compliant literary objects that encapsulate their digital resources and bibliographic metadata and view them as Aggregations.

LORE http://www.itee.uq.edu.au/~eresearch/projects/aus-e-lit

A scientific authoring, publishing, and editing environ­ ment that generates ORE-compliant digital objects.

SCOPE: A Scientific Compound Object Publishing and Editing System http://www.ijdc.net/index.php/ijdc/article/view/84

University of Southampton, University of Manchester A social networking platform for sharing formalized scien­ tific workflows that can be exposed as ORE Aggregations.

MyExperiment.org http://myexperiment.org

A demonstration of ORE in various applications in a pro­ duction environment including integrating repositories and the production of electronic theses and dissertations.

ICE Theorem http://ice.usq.edu.au/introduction/ice_theorem.htm

Texas Digital Library An implementation of ORE for the DSpace repository platform that supports a digital library of electronic the­ ses and dissertations.

Vireo http://etd.tdl.org

• Primer, http://www.openarchives.org/ore/1.0/primer .html • Tools and Additional Resources, http://www.open archives.org/ore/1.0/tools.html Specification • Abstract Data Model, http://www.openarchives.org/ ore/1.0/datamodel.html • Vocabulary, http://www.openarchives.org/ore/1.0/ vocabulary.html User Guides • Resource Map Implementation in Atom, http://www .openarchives.org/ore/1.0/atom.html • Resource Map Implementation in RDF/XML, http:// www.openarchives.org/ore/1.0/rdfxml.htmlRe­ source Map Implementation in RDFa, http://www. openarchives.org/ore/1.0/rdfa.html • HTTP Implementation, http://www.openarchives .org/ore/1.0/http.html • Resource Map Discovery, http://www.openarchives .org/ore/1.0/discovery.html

Selected References • The Architecture of the World Wide Web, Volume 1, http://www.w3.org/TR/webarch • Cool URIs for the Semantic Web, http://www .w3.org/TR/cooluris • DCMI Metadata Terms, http://dublincore.org/docu ments/dcmi-terms • Expressing Dublin Core metadata using the Resource Description Framework (RDF), http://dublincore .org/documents/dc-rdf • Hypertext Transfer Protocol: HTTP/1.1, http:// tools.ietf.org/html/rfc2616 • The “info” URI Scheme: Resource Pages, http:// www.loc.gov/standards/uri/info.html • Linked Data: Guides and Tutorials, http://linked data.org/guides-and-tutorials • OWL Web Ontology Language Guide, http://www .w3.org/TR/owl-guide • Resource Description Framework: Concepts and Abstract Syntax, http://www.w3.org/TR/rdf-concepts • RDF Semantics, http://www.w3.org/TR/rdf-mt

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010

University of Southern Queensland, University of Cambridge

Official ORE Documentation

37

• RDF Vocabulary Description Language 1.0: RDF Schema, http://www.w3.org/TR/rdf-schema • The Shape of the Scientific Article in The Developing Cyberinfrastructure, C. Lynch, CTWatch Quarterly, August 2007, http://www.ctwatch.org/quarterly/ print.php%3Fp=79.html • Uniform Resource Identifier (URI): Generic Syntax, http://www.ietf.org/rfc/rfc3986.txt • HTML+RDFa: A Mechanism for Embedding RDF in HTML, http://www.w3.org/TR/rdfa-in-html/

OAI-ORE Community

Library Technology Reports  www.alatechsource.org  May/June 2010

ORE Executive Committee Carl Lagoze, Cornell University Herbert Van de Sompel, Los Alamos National Laboratory Research Library

38

ORE Advisory Committee Sayeed Choudhury, Johns Hopkins University Gregory Crane, Tufts University Lorcan Dempsey, OCLC Mark Doyle, The American Physical Society John Erickson, Hewlett-Packard Laboratories Steve Griffin, National Science Foundation Robert Hanisch, Space Telescope Science Institute Jane Hunter, The University of Queensland Clifford Lynch, Coalition for Networked Information Liz Lyon, UKOLN Peter Murray Rust, University of Cambridge Jim Ostell, National Center for Biotechnology Information Sandy Payette, Cornell University Robby Robson, Eduworks MacKenzie Smith, MIT Libraries Leo Waaijers, SURF Platform ICT and Research

Object Reuse and Exchange (OAI-ORE)  Michael Witt

ORE Technical Committee Chris Bizer, Free University of Berlin Les Carr, University of Southampton Tim DiLauro, Johns Hopkins University Leigh Dodds, Ingenta David Fulker, UCAR Tony Hammond, Nature Publishing Group Pete Johnston, Eduserv Foundation Richard Jones, Imperial College Peter Murray, OhioLINK Michael Nelson, Old Dominion University Ray Plante, NCSA and National Virtual Observatory Rob Sanderson, University of Liverpool Simeon Warner, Cornell University Jeff Young, OCLC ORE Liaison Group Leonardo Candela, ISTI-CNRI and DRIVER Tim Cole, DLF Aquifer and UIUC Library Julie Allinson, University of York and JISC Jane Hunter, DEST Savas Parastatidis, Microsoft Sandy Payette, Fedora Commons Thomas Place, DARE and University of Tilburg Andy Powell, DCMI Robert Tansley, Google, Inc. and DSpace OAI-ORE Community on Google Groups http://groups.google.com/group/oai-ore

Appendix Resource Map for a Batch Aggregation Serialized as RDF/XML

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010



2010-02-14T15:29:04-04:00 2010-02-14T15:29:04-04:00



2009-12-03T01:58:18-04:00 batch_mohi_carver_ver01









Object Reuse and Exchange (OAI-ORE)  Michael Witt

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports  www.alatechsource.org  May/June 2010





















1906/1932



2010-02-14T15:30:09-04:00

2010-02-14T15:30:09-04:00

41

Library Technology Reports  www.alatechsource.org  May/June 2010

Notes

42

Object Reuse and Exchange (OAI-ORE)  Michael Witt

Library Technology Reports Respond to Your Library’s Digital Dilemmas Eight times per year, Library Technology Reports (LTR) provides library professionals with insightful elucidation, covering the technology and technological issues the library world grapples with on a daily basis in the information age. Library Technology Reports 2010, Vol. 46 January 46:1 February/ March 46:2

“Understanding the Semantic Web: Bibliographic data and Metadata” by Karen Coyle, Digital Library Consultant “RDA Vocabularies for a 21st-Century Data Environment” by Karen Coyle, Digital Library Consultant

April 46:3

“Gadgets & Gizmos: Personal Electronics at your Library” by Jason Griffey, Head of Library Information Technology, University of Tennessee at Chattanooga

May/June 46:4

“Object Reuse and Exchange (OAI-ORE)” by Michael Witt, Interdisciplinary Research Librarian & Assistant Professor of Library Science, Purdue University Libraries

July 46:5

“Web-Based Voice and Video: Investigating  Library Applications and Challenges” by Char Booth, E-Learning Librarian, University of California, Berkeley

August/ September 46:6

“Understanding Electronic Resources Usage: a Review of the State of the Art” by Jill E. Grogg, E-Resources Librarian, University of Alabama Libraries, and Rachel Fleming-May, Assistant Professor, School of Information Sciences at the University of Tennessee

October 46:7

November/ December 46:8

“Open URL” by Cindi Trainor, Coordinator for Library Technology & Data Services at Eastern Kentucky University, and Jason Price, E-resource Package Analyst, Statewide California Electronic Library Consortium “Privacy and Freedom of Information in 21st Century Libraries” by the ALA Office of Information Freedom, Chicago, IL

www.alatechsource.org ALA TechSource, a unit of the publishing department of the American Library Association

MeeT The NeW! FAce oF ALA TechSource online

Library Technology R

e

p

o

r

t

May/June 2010 vol. 46 / no. 4 ISSN 0024-2586

s

Expert Guides to Library Systems and Services

www.alatechsource.org

a publishing unit of the American Library Association

• Access a growing archive of more than 8 years of Library Technology Reports (LTR) and Smart Libraries Newsletter (SLN) • Read full issues online (LTR only) or as downloadable PDFs • Learn from industry-leading practitioners • Share unlimited simultaneous access across your institution • Personalize with RSS alerts, saved items, and emailed favorites • Perform full-text searches ISBN 978-0-8389-5807-0

free samples @ alatechsource.metapress.com

library TechNOlOgy

9 780838 958070 UNcoveReD,

exPLoReD, oNLiNe

subscribe to Techsource Online today! alatechsource.metapress.com

Your support helps fund advocacy, awareness, and accreditation programs for library professionals worldwide.

ISBN 978-0-8389-5810-0

9 780838 958100