Subject Access: Preparing for the Future 9783110234442, 9783110234435

This volume contains the proceedings of a special conference held in Florence, August 2009. The theoretical and methodol

228 112 19MB

English Pages 188 [196] Year 2011

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Contents
Introduction
Session 1 Systems, Tools and Standards in Subject Indexing
Focusing on User Needs: New Ways of Subject Access in Czechia
Subject Analysis and Indexing: An “Italian Version” of the Analytico-Synthetic Model
Subject Search in Italian OPACs: An Opportunity in Waiting?
Semiautomatic Merging of Two Universal Thesauri: The Case of Estonia
Session 2 Retrieval in Multilingual, Multicultural Environments
20 Years SWD – German Subject Authority Data Prepared for the Future
Mixed Translations of the DDC: Design, Usability, and Implications for Knowledge Organization in Multilingual Environments
Animals Belonging to the Emperor: Enabling Viewpoint Warrant in Classification
Dewey in Sweden: Leaving SAB after 87 Years
Enhancing Information Services Using Machine-to- Machine Terminology Services
Session 3 Web Indexing and Social Indexing
Social Bookmarking and Subject Indexing
Social Indexing at the Stockholm Public Library
The Nuovo Soggettario Thesaurus: Structural Features and Web Application Projects
Jêzyk Hasel Przedmiotowych Biblioteki Narodowej (National Library of Poland Subject Headings) - From Card Catalogs to Digital Library: Some Questions About the Future of a Local Subject Headings System in the Changing World of Information Retrieval
FAST Headings as Tags for WorldCat
Contributors
Recommend Papers

Subject Access: Preparing for the Future
 9783110234442, 9783110234435

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

International Federation of Library Associations and Institutions Fédération Internationale des Associations de Bibliothécaires et des Bibliothèques Internationaler Verband der bibliothekarischen Vereine und Institutionen Международная Федерация Библиотечных Ассоциаций и Учреждений Federación Internacional de Asociaciones de Bibliotecarios y Bibliotecas

࿖䰙೒к佚णӮϢᴎᵘ㘨ড়Ӯ

ΕΎΒΘϜϤϟ΍ ΕΎδγΆϣϭ ΕΎϴόϤΠϟ ϲϟϭΪϟ΍ ΩΎΤΗϻ΍ About IFLA

www.ifla.org

IFLA (The International Federation of Library Associations and Institutions) is the leading international body representing the interests of library and information services and their users. It is the global voice of the library and information profession. IFLA provides information specialists throughout the world with a forum for exchanging ideas and promoting international cooperation, research, and development in all fields of library activity and information service. IFLA is one of the means through which libraries, information centres, and information professionals worldwide can formulate their goals, exert their influence as a group, protect their interests, and find solutions to global problems. IFLA’s aims, objectives, and professional programme can only be fulfilled with the cooperation and active involvement of its members and affiliates. Currently, approximately 1,600 associations, institutions and individuals, from widely divergent cultural backgrounds, are working together to further the goals of the Federation and to promote librarianship on a global level. Through its formal membership, IFLA directly or indirectly represents some 500,000 library and information professionals worldwide. IFLA pursues its aims through a variety of channels, including the publication of a major journal, as well as guidelines, reports and monographs on a wide range of topics. IFLA organizes workshops and seminars around the world to enhance professional practice and increase awareness of the growing importance of libraries in the digital age. All this is done in collaboration with a number of other non-governmental organizations, funding bodies and international agencies such as UNESCO and WIPO. IFLANET, the Federation’s website, is a prime source of information about IFLA, its policies and activities: www.ifla.org Library and information professionals gather annually at the IFLA World Library and Information Congress, held in August each year in cities around the world. IFLA was founded in Edinburgh, Scotland, in 1927 at an international conference of national library directors. IFLA was registered in the Netherlands in 1971. The Koninklijke Bibliotheek (Royal Library), the national library of the Netherlands, in The Hague, generously provides the facilities for our headquarters. Regional offices are located in Rio de Janeiro, Brazil; Pretoria, South Africa; and Singapore.

IFLA Series on Bibliographic Control Vol 42

Subject Access Preparing for the Future

Edited by Patrice Landry, Leda Bultrini, Edward T. O’Neill and Sandra K. Roe

De Gruyter Saur

IFLA Series on Bibliographic Control edited by Sjoerd Koopman

ISBN 978-3-11-023443-5 e-ISBN 978-3-11-023444-2 ISSN 1868-8438 Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at http://dnb.d-nb.de. Walter de Gruyter GmbH & Co. KG, Berlin/ Boston © 2011 by International Federation of Library Associations and Institutions, The Hague, The Netherlands ∞ Printed on permanent paper The paper used in this publication meets the minimum requirements of American National Standard – Permanence of Paper for Publications and Documents in Libraries and Archives ANSI/NISO Z39.48-1992 (R1997) Typesetting: Michael Peschke, Berlin Printing and binding: Strauss GmbH, Mörlenbach Printed in Germany www.degruyter.com

Contents Patrice Landry Introduction ...............................................................................................................

1

Session 1: Systems, Tools and Standards in Subject Indexing Marie Balíková Focusing on User Needs: New Ways of Subject Access in Czechia .....................

7

Giuseppe Buizza Subject Analysis and Indexing: An “Italian Version” of the Analytico-Synthetic Model .................................

25

Emanuela Casson, Andrea Fabbrizzi, Aida Slavic Subject Search in Italian OPACs: An Opportunity in Waiting? ............................

37

Sirje Nilbe Semiautomatic Merging of Two Universal Thesauri: The Case of Estonia ..........

51

Session 2: Retrieval in Multilingual, Multicultural Environments Yvonne Jahns 20 Years SWD – German Subject Authority Data Prepared for the Future ..........

61

Joan Mitchell, Ingebjørg Rype, Magdalena Svanberg Mixed Translations of the DDC: Design, Usability, and Implications for Knowledge Organization in Multilingual Environments .................................

77

Claudio Gnoli Animals Belonging to the Emperor: Enabling Viewpoint Warrant in Classification .................................................

91

Magdalena Svanberg Dewey in Sweden: Leaving SAB after 87 Years .................................................... 101 Gordon Dunsire Enhancing Information Services Using Machine-toMachine Terminology Services ......................................................................... 111

Session 3: Web Indexing and Social Indexing Lois Mai Chan Social Bookmarking and Subject Indexing ............................................................. 127 Harriet Aagaard Social Indexing at the Stockholm Public Library ................................................... 143

vi

Contents

Luciana Franci, Anna Lucarelli, Marta Motta, Massimo Rolle The Nuovo Soggettario Thesaurus: Structural Features and Web Application Projects .......................................... 155 Wanda Klenczon Język Haseł Przedmiotowych Biblioteki Narodowej (National Library of Poland Subject Headings) – From Card Catalogs to Digital Library: Some Questions About the Future of a Local Subject Heading Systems in the Changing World of Information Retrieval ............................................. 169 Diane Vizine-Goetz FAST Headings as Tags for WorldCat ................................................................... 181 Contributors ............................................................................................................... 189

Introduction On August 20-21, 2009 in Florence, the IFLA Classification and Indexing Section sponsored an IFLA satellite conference entitled “Looking at the Past and preparing for the Future”. With the assistance of the Biblioteca Nazionale Centrale di Firenze and AIB Gris, and supported by the Fondazione Rinascimento Digitale who set up the secretariat, the Conference was a great success in the quality of the content and organization and the number of participants. This conference marked the first time since 2001 that the Classification and Indexing Section had organized an IFLA satellite conference. It offered a great opportunity to look back on what had happened since 2001 and beyond and also take stock of ongoing projects and current initiatives in subject indexing. In preparing the Florence Conference, the organizers were keenly aware that subject librarians have been responding actively over the last decade to the ever changing world of information by upgrading existing indexing tools as well as developing new ones and adapting them in new information discovery tools. As I.A. McIlwaine remarked in the introduction to the 2001 Satellite Conference proceedings, “the principal difference that faces us today is the speed with which information is amassed and the quantities of it, together with constant economic pressures”1. She could also have added that in this new information environment, users have new expectations and needs. In the new bibliographic environment that includes ever more digital materials (digital surrogates as well as “born digital” materials), libraries must offer access to a greater variety of documents and formats. Information seekers and researchers need not only to locate digital material through libraries’ online catalogues but also in web repositories and web services. The challenge facing the library community has been to devise ways to optimize existing indexing tools, such as subject headings lists and classification schemes in order create enhanced and expanded access to resources wherever they may be held. The fundamental challenge is to create quality metadata through standard controlled vocabularies. The need to look at the state of subject access tools was a timely issue in 2009 as many of the current tools used today were created or went through major development in the later part of the 20th Century. In the 1980s, many national libraries, namely in France, Germany, Spain and Portugal set up new subject headings lists, while others adapted their lists to the new automated environment, for example Library of Congress’ automation of the LCSH file. In the 1990s and early 2000, this trend continued, predominantly in Eastern Europe and in Latin America. This “golden age” of the development of subject headings lists was supported by the work of the IFLA Classification and Indexing Section notably with the publication of the Principles underlying subject heading languages (SHLs)(1999). The principles were

1

Subject retrieval in a networked environment (2003), p. vii

2

Patrice Landry

developed from the practices of existing lists and supported by standard elements of authority control. During the same period, major international classification schemes also underwent renewal. Both the Dewey Decimal Classification (DDC) and the Universal Decimal Classification (UDC) invested greatly in the maintenance and the development of their content and made up to date modifications available to their respective user groups through new online tools. During that time, both have broadened their international perspective to accommodate their diverse group of users and to respond to new bibliographic information needs. Other national classification schemes such as the Library of Congress Classification and the German RVK (Regensburger Verbundklassifikation) have also made significant changes to adapt their schemes to new search environments. The Conference focused on subject access tools through three threads: How subject tools are developed, how they ensure efficient retrieval and how they are used in social and web indexing. In the first session Systems, tools and standards in subject indexing, the authors presented recent developments of subject tools aimed at creating a new subject heading language (Italy), merging two tools (Estonia) and adapting an existing one to new user needs (Czech Republic). In the four papers presented in this session, there is an underlying focus on efficient access and users’ needs. The second session, Retrieval in multilingual, multicultural environments brought together papers on the extension of subject indexing tools, both subject headings lists and classification schemes, which proposed solutions to multilingual subject access in a multicultural environment. The five papers selected were good examples of initiatives and projects aiming at expanded search and retrieval tools in the web environment and constituted a nice follow-up to the 2001 Satellite conference which had focused on this theme. The third session, Web indexing and social indexing looked at how subject indexing is adapting to new user expectations and needs in the web 2.0 environment. The five papers addressed many sides of this issue; adapting existing traditional tools to allow social tagging, the development of new community tools, such as folksonomy and the use of subject headings as tags. During the two day conference, there was an excellent interchange of ideas and discussions between the participants and the authors. The papers presented here are a great testimony to this sharing of ideas, views and perspectives in this very rich area of subject indexing in the 21st Century. It is my hope this Satellite conference will be followed soon by another one to look more closely at new perspectives in this area. In closing, I would like to express two series of thanks. Firstly, to the persons associated in making the satellite conference such a success and in particular to Leda Bultrini, the co-chair of the Florence satellite conference. Also my deepest gratitude to the three session chairs: Anne-Céline Lambotte, Leda Bultrini and Ed O’Neill for their work with the authors and in managing their session so smoothly during the conference. Also thanks to Marco Rufino and Sara Piccolo of Fondazione Rinascimento Digitale for the logistic support, to Antonia Fontana and Federica Paradisi from Biblioteca Nazionale Centrale di Firenze and Massimo Rolle from AIB Gris for the organizational support. It was truly a pleasure to have worked with all of them. And finally, a special thanks to the speakers for accepting to be part of the conference and to the participants for their support and fruitful discussions.

Introduction

3

Secondly, I would like to express my thanks to my three co-editors for their dedication and work in making this publication happen. A special thanks to Sandra K. Roe who accepted to join our team as a replacement for Anne-Céline Lambotte who was unable to participate in this publication. And finally thanks to Sjoerd Koopman, from IFLA and Claudia Heyer from De Gruyter Saur for their support, advice and guidance. Patrice Landry

March 2011

References Subject Retrieval in a networked environment: proceedings of the IFLA satellite meeting held in Dublin, OH, 14-16 August 2001 and sponsored by the IFLA Classification and Indexing Section, the IFLA Information Technology Section and OCLC. 2003. Edited by I.C. McIlwaine. München: K.G. Saur, 193 p. Working Group on Principles Underlying Subject Heading Languages. 1999. Principles Underlying Subject Heading Languages (SHLs). 1999. Edited by Maria Inês Lopes, Julianne Beall. Approved by the Standing Committee of the IFLA Section on Classification and Indexing. München: K.G. Saur, 183 pages.

Session 1 Systems, Tools and Standards in Subject Indexing

Focusing on User Needs: New Ways of Subject Access in Czechia Marie Balíková Abstract The article gives a short overview of subject access in Czechia within the last two decades. The impact on new information technologies is highlighted and practical examples of new way solutions accepted by NL CR are mentioned. The role of subject controlled vocabularies and thesauri in the retrieval process is stressed.

1. Introduction Within the last two decades, the development and application of sophisticated information technology has brought about rapid changes in libraries above all in the subject analysis area. The internet technology has facilitated global access to heterogeneous information resources: databases, journals, e-books, digital online resources which now besides the texts include photographs, slides and video clips, as well. There has also been a development of sophisticated online public access catalogues, information gateways, portals, systems based on semantic technologies e.g. questionanswering systems, and most recently social networking and a wide scale use of Web 2.0 technologies in libraries.

2. Focusing on Users Needs The explosion of internet access, digital online resources, increased availability of information services and advanced search engines has produced a new type of library users who have a keen sense of expectation, demand remote access, expect more precise information, are very impatient; in addition, within Web 2.0 users are both content consumers and content providers. Users' requirements have changed significantly, especially as a result of the impact of the Web, consequently the services providing subject access need to adapt to these changes. The information centres must provide subject access to relevant, current and valid information in user-friendly and highly valuable manner (to meet search criteria of users of all types).

8

Marie Balíková

3. The Role of Controlled Retrieval Systems For decades, ensuring effective and efficient subject access to information has been one of the primary goals of libraries and other memory institutions that aimed to continuously develop subject access tools and means enabling users to find and discover information that will meet his/her search criteria. In the era of book-type and card catalogues, such a tool represented subject heading strings system created and maintained according to commonly accepted rules. The user could browse through and locate "subject card" with terms which he/she was looking for using "browsing method" only. The main problem of the book-type catalog was the insertion of entries for new acquisitions. In card catalogues, each entry has its own card and each card contains only one entry; any new entry can be filed between any two existing entries. The catalog offers the opportunity to have a completely up-to-date file. The main drawbacks to card services: delays in production and the labour of filing the cards when they arrive were overcome by the computerized catalogue which offers a more practical approach because of the storage capacity and the operating speed. Conclusion: The usage of card catalogues should have been replaced with computerized systems, in the end Online Public Catalogues (OPACs). National Library "new way" solution: to discontinue card catalogues in 1996

Figure 1. Example of a handwritten catalogue card.

Computerized and online catalogues (OPACs) are more user-friendly; they offer two options of searching: to browse subject heading or to provide searches using keywords-controlled terms, or title keywords. A comparison was made by professionals (subject librarians) and non-professionals (users) between the subject controlled terms and the title keywords of 550 records. It was found that 55% of the records were considerably enhanced by controlled terms. It has to be mentioned that the title of publication does not always offer sufficient information for title keyword searching, or even contains misleading information. A second and even more important aspect of subject control terms applied in OPACs is that they offer a synonym and homograph control which is key factor for successful searching in OPACs. Conclusion: Title keywords searching does not produce relevant results, a controlled vocabulary should be used. National Library "new way" solution: to create subject authority file enabling users to perform search queries both by controlled and uncontrolled keywords.

Focusing on User Needs: New Ways of Subject Access in Czechia

9

Figure 2. Example of Subject Authority index containing UDC class marks with verbal expressions in Czech and English languages.

Due to enormous amounts of information available on the web which should be accessible for users in a user-friendly manner, information-seeking behaviour is now changing rapidly. Many libraries intend to develop "new generation" or "next generation" catalogues that would support simpler access to information like Google and other Web services do. "Next generation" catalogues provide an easier and friendlier search option by exploring fully subject categorization and classification in faceted browsing and clustering. Faceted browsing based on subject controlled terms deliver more precise results while offering a range of refining options. Conclusion: "Next generation" library catalogue should be implemented. National Library "new way" solution: to implement “next generation” catalogue with more sophisticated faceted and clustering option.

Figure 3. Example of facets in Uniform Information Gateway.

There is another way to categorize online information resources/digital objects and how to involve users in the process of accessing information: user-added tagging. "A tag is a non-hierarchical keyword or term assigned to a piece of information (such as an internet bookmark, digital image, or computer file). This kind of metadata helps to

10

Marie Balíková

describe an item and allows it to be found again by browsing or searching. Tags are chosen informally and personally by the item's creator or by its viewer, depending on the system. Tagging was popularized by websites associated with Web 2.0 and is an important feature of many Web 2.0 services". (Wikipedia: The Free Encyclopedia. 2011) How can tags be used in subject analysis process? They can be used as synonyms for subject controlled terms; they can enrich vocabulary of controlled systems by making vocabulary up-to-date and more userfriendly. Even more, the idea of using user-added tag (tag clouds) and the terminology of large library (MARC) databases to generate word clusters associated with controlled vocabulary terms and classifications, seems to be very promising. Conclusion: To allow to user to add reasonable public tags (not so called private tags dependent on individual context, e.g. "read", "unread", "tbr") to metadata records created by professionals. National Library "new way" solution: to enable users to add public tags to library news since 2010, inspired by the practice of the District Library in Ann Arbor.

Figure 4. Example of tagging in Ann Arbor District Library.

4. Other Searching Techniques Often, however, the user wants not whole documents but brief answers to specific questions: How old is the President? When did Jan Hus die? Answering short questions thus becomes a problem of finding the best combination of word-level information retrieval (IR) and syntactic/semantic-level natural language processing (NLP) techniques. Automated question answering - the ability of a machine to answer questions, simple or complex, posed in ordinary human language - is one of today’s most exciting technological developments. The system is based on semantic technologies and in the future, it is intended to displace the existing search methods and establish new standards for user-centered access to information.

Focusing on User Needs: New Ways of Subject Access in Czechia

11

Conclusion: To support the development of automated question answering systems National Library "new way" solution: to participate in project dealing with usercentered access to information: M-CAST project (2004-2006) M-CAST is a multilingual query-answering system that allows asking a question in an language and finding an exact answer in digitalized resources in different languages. It can serve as a monolingual query-answering system as well. The M-CAST QA system could be applied in both digital and hybrid libraries, because it enables the user to pose questions using either a set of search terms or natural language questions.

Figure 5. Example of M-CAST search.

5. Subject Authority Control Effective subject access in library catalogues cannot exist without standardized access points, i. e. without bibliographic and authority control. The controlled subject access to information objects in library environment deals with order, logic, objectivity, precise denotation and consistency. The subject term should always have the same denotation expressed by natural or artificial language term, e. g. classification notation. The authority control is achieved through authority records linked to all bibliographic records to which they pertain. (Gorman 2004) The main goal of subject authority control procedure is: • •

to standardize the terminology used in metadata records to guide a user from general, broad terminology to more specific, narrower headings because subject headings are part of an overall scheme and need to

12

Marie Balíková

fit within a general hierarchy. A fully employed hierarchy must be available to the user, to assist in the retrieval of pertinent materials.

6. Who Benefits from Standardization of Subject Access Points "Possible subject authority record data user groups include a) information professionals who create metadata, b) reference and public services librarians and other information professionals who are searching for information as intermediaries, c) controlled vocabulary creators, such as cataloguers, thesaurus and ontology creators, and d) end-users using information retrieval systems to fulfil their information needs." (Zumer, Maja, Athena Salaba and Marcia Lei Zeng 2007) Subject authority data user tasks were formulated as follow: Find: to find a subject entity or set of entities corresponding to stated criteria Identify: To identify a subject entity based on certain attributes/characteristics Select: To select subject a entity Obtain: To obtain additional information about the subject entity and/or to obtain bibliographic records or resources about this subject entity Explore: to explore relationships between subject entities, correlations to other subject vocabularies and structure of a subject domain".

7. Subject Access in Czechia Subject access at the National Library of the Czech Republic (hereafter NLCR) has a long tradition. The beginnings of the use of subject headings go back to the 1st half of the 19th century. A systematic and really comprehensive subject access represented by subject heading strings was launched at the beginning of the 1950s. In the 1960s -1980s complex UDC classification notations were implemented along with subject heading strings. In the 1990s subject heading strings, free keywords a top level of UDC numbers were used. Since 2000 a new controlled vocabulary – subject authority file entitled CZENAS has been developed and applied to the material accessed since 1995. The new categorization scheme used in Conspectus method has been used. In principle, all types of collections are indexed and classified, except for: special parts of historical collections (manuscripts and incunabula) which are not indexed and classified at all; poetry and fiction which are provided by genre/form headings and classified only and printed music which are indexed only. A uniform indexing and classification system is applied. The same principle is applied to the national bibliography, the records of it being fully compatible with those of library catalogues.

Focusing on User Needs: New Ways of Subject Access in Czechia

13

8. Subject Access in Pre-Automated Era Research conducted in the early 1980s has shown that subject access is still one of the most dominant approaches in Czech library catalogues. The study has indicated that 45% of searches conducted in library catalogues are related to subject searches. In the era of card catalogues, complex subject heading strings were applied.

9. Indexing Languages and Classification Schemes Used in Czechia Survey of indexing languages and classification schemes used in Czechia conducted in 1997 by National Library of the Czech Republic showed that •

24 % of libraries use subject headings strings  7 % of libraries use subject heading strings only  17% of libraries use subject heading strings as variant access to their holdings, • 14 % of libraries use thesauri terms, • 79 % of libraries use free keywords (uncontrolled or semi-controlled terms), • 64 % of libraries use the UDC Classification system, The card catalogues were still in use in libraries in Czechia.

10. Subject Access since 1990s The traditional library practice to assign one or two subject headings, one or two descriptors and to use one notation number per book, has been become, since the 1990s, inadequate for user’s requirements. In the previous period, the users were obliged to search for general subjects while the more specific information was unavailable for them. More detailed subject access to documents has become the vital need in the online environment (OPACS, internet). The integration of the controlled vocabulary and a thesauri apparatus in indexing and the retrieval process in a more efficient manner supposes to solve such tasks as harmonization of several information languages, creation of integrated tools (subject authority file), etc. The functionality of online library catalogues, bibliographic databases, union catalogues, information gateways (from end user’s point of view) depends above all on well established and provided subject access to library collections or documents themselves. It is well known that merging metadata and bibliographic records of many external databases into one database gives rise to discrepancies between index terms (lexical units), application syntax, and hierarchical structure of original indexing systems. Therefore, special concern on standardization in subject access field among libraries and information centres at national level is needed. There was a lack of guidelines for subject indexing and classification in the nineties, in Czechia, subject authority file for general use did not exist.

14

Marie Balíková

11. LCSH in Czechia "Library of Congress Subject Headings (LCSH), a system originally designed as a tool for subject access to the Library's own collection in the late nineteenth century, has become, in the course of the last century, the main subject retrieval tool in library catalogues throughout the United States and in many other countries. It is one of the largest non-specialized controlled vocabularies in the world. As LCSH enters a new century, it faces an information environment that has undergone vast changes from what had prevailed when LCSH began, or, indeed, from its state in the early days of the online age. In order to continue its mission and to be useful in spheres outside library catalogues as well, LCSH must adapt to the multifarious environment. One possible approach is to adopt a series of scalable and flexible syntax and application rules to meet the needs of different user communities." (Chan, Lois Mai and Theodora Hodges 2000) Since subject access depends on national languages (subject headings, preferred terms, descriptors are expressed in national languages), it was difficult to find and apply any "international" recipe. After much debate, LCSH system has been finally chosen. It appeared, however, useful to meet local needs and requirements as well (note that most libraries in Czechia used descriptors, uncontrolled or semi controlled terms, it means isolated lexical units, not complex subject heading strings), so some modifications of the LCSH scheme were formulated: In order to improve interoperability between National Library Subject Headings (NLSH) based on LCSH and other retrieval systems used in Czechia, and to facilitate parallel and federated searching, faceting and clustering in information subject gateways, portals and next generation catalogues, it was decided •

• •

to prefer post-coordination. We were aware that pre-coordination in LCSH was needed for "disambiguation, suggestibility and precision" and contributed to give context to terms. We took in account as well that postcoordinated language could provide the same precision as the precoordinated one when including equivalent notation of a classification scheme into the retrieval process. A notation of classification scheme gives a context to the verbal search term; to apply pre-coordination in so called minimal subject headings strings to express facets under topical or geographical headings like place, time and topic if needed to simplify application syntax, it means to use minimal subject heading strings  to define standard order of facets/subdivisions Topic – Place – Time Topic – Topic Place – Topic – Time  to reduce number of facets/subdivisions entered in a single subject heading, the specificity may be achieved by assigning additional headings to bring out specific aspects of a topic

Focusing on User Needs: New Ways of Subject Access in Czechia

15

12. Main Principles of Applying Subject Heading Strings Name entities (personal, corporate body, meeting names, titles and author/title combination) used as subject are not accompanied with any facet/subdivision. Any additional thematic, geographic, or chronological information has been taken out of name subject heading strings and put in a “class of person and types of corporate bodies” subject heading strings or as additional main entries: • • •

to support easier automatic subject authority control and maintenance of name entities to add more access points to bibliographic records to allow links by name entities (persons; corporate bodies) between subject headings and the corresponding UDC notations

Geographic names used as subject and as geographical subdivision are treated as follows: • •

direct form of geographical subdivisions in subject headings of the NLSH has been preferred: i.e., geographic names immediately follow the main heading or main heading/topical subdivision combination as geographical subdivision are entered name of states or large regions, geographical name of smaller localities (cities, villages) are entered as additional main entry in the field (651 MARC 21 format)

Form subdivisions have been taken out of subject heading strings and assigned to a specific field (655 MARC 21 format). The current practice is shown on next example (Fig. 6). Additional geographic, chronological and formal information are added in special "Class of persons" headings or in the MARC fields in which this kind of information is entered. Table 1. Example of subject access points in BIB record. Control no.

fd133970

Heading

české pohádky

UDC

821.162.3-34

English

Czech fairy tales

13. Czech National Subject Authority File - CZENAS We started to work on the subject authority file in 2000. Our intention was to create a standardized system of controlled terms which could serve the needs of professionals (cataloguers, indexers) and non-professionals e.g. web content creators as well. We tried to offer them an organizing tool not only to retrieve material, but to tag material as well. Research studies conducted in 1998-1999 showed that non-professionals (students, technicians, scientists, and teachers) would like to use a standardized indexing and retrieval tool, but simple in structure (like Dublin Core format), in syntax (descriptor-type system), and with up-to-date terminology.

16

Marie Balíková

Having this in mind, we decided to study very carefully all (almost all) studies concerning this topic, above all the recommendations of ALCTS/SAC/Subcommittee (1999) which identified and specified requirements to the subject control scheme used by accessing information resources in libraries and on the web as follows: • • • •

be simple and easy to apply and to comprehend, be intuitive so that sophisticated training in subject indexing and classification, while highly desirable, is not required in order to implement, be logical so that it requires the least effort to understand and implement, be scalable for implementation from the simplest to the most sophisticated

The functions of effective subject access tools were identified by Chan: • • • • • •

to assist searchers in identifying the most efficient paths for subject retrieval to help users focus their searches to enable optimal recall to enable optimal precision to assist searchers in developing alternative search strategies to provide all of the above in the most efficient, effective, and economical manner

Subject authority file of NL CR is an integrated indexing and retrieval tool in which verbal (controlled) terms are being linked to the UDC equivalent notations. When creating the subject authority file we respect IFLA recommendations: • •

to formulate Guidelines for subject authority records and for their interrelationships within subject authority files to consider possible relationships between subject authority records and classification

Thesauri and similar vocabulary tools can complement full-text access in on-line environment (OPACs, internet) by aiding users in focusing their searches • • •

the unitary terms (isolated lexical units) provide much greater flexibility in searching than subject headings strings with its complexities of subdivisions and inversions most databases today are indexed with thesaurus-type terms than with complex subject headings the quality of an indexing and retrieval system depends on the terms used in indexing process and on its capability to express semantic relationships among them

We found out that an integrated retrieval tool of verbal expressions based on a classification scheme seems to be the best solution, as the classification scheme supports: • • • • •

browsing and retrieval capability of system; creation of hierarchical structure; easier identification of concepts; display of subject relationships between terms; switching language in multilingual environment

Verbal controlled scheme supports above all:

Focusing on User Needs: New Ways of Subject Access in Czechia

• •

17

synonym and homograph control; usage of current and expressive captions

14. Mapping between Indexing Terms, UDC Numbers and English Equivalents Controlled vocabulary structure is tied to a classification scheme so that relationships between indexing terms can be expressed more definitely. Mapping process between Czech verbal expressions and UDC numbers is being done intellectually. Candidates of controlled terms are chosen with document in hand (from bottom up) in order to suggest terms as specific as needed (not as specific as possible). Single or complex UDC numbers (pre-combined) are linked, English equivalents of preferred terms, mostly LCSH terms, are chosen. Sometimes, we are not successful in finding LCSH equivalents because the LC terms are too broad: in this case, the reference sources like LC titles and subtitles files, www pages, full text databases, language vocabularies, encyclopaedias, different manuals are consulted. The proposals of preferred terms linked to the UDC class numbers and English equivalents are sent to a special senior cataloguers working group for approval, approved authority records are via special programme procedure entered into the authority database.

15. Searching Techniques Applied in CZENAS Subject authority file can be browsed by controlled terms, English equivalents, UDC notations, geographic names, formal descriptors, and chronological terms. The authority file is searchable by subject terms only at present. (In future, our intention is to make it searchable by UDC class numbers as well, using first element of notation and right truncation).

16. National Subject Authority File Consists of Four Specific Authority Files • • • •

topical authority file geographic authority file genre/form authority file (formal descriptors) chronological authority file

17. Topical Authority File The topical authority file contains authorized subject headings (descriptor-type) assigned to bibliographic records. The file includes topical headings as well as names for special entities representing the subject of an information resource. Topical terms

18

Marie Balíková

representing concepts are expressed by nouns (in singular or plural) or phrases in natural order. The form of index terms is intended to help cataloguer and researchers to select the term(s) most appropriate for indexing and retrieval. The preferred and variant forms of terms are in alphabetical order followed by scope notes, UDC class marks and English equivalents. Associations between terms are indicated by the convention of broader, narrower, related, and “used for” relationships. The topical controlled vocabulary is designed to provide terms for subject access to information resources. There are two kinds of authority records in topical subject authority file: full and minimal level records. The full records contain all references as needed, in minimal records (music headings), only main caption, variant form as SEE reference, UDC notation, and English equivalent is presented.

Figure 6. Example of full and minimal level topical records.

18. Geographic Authority File Authority file of geographic names is a structured vocabulary including names of geographical entities relevant for the subject cataloguing in Czechia. The Czech authority file of geographic names is intended to be an official repository of national geographic features names. It contains expressions referring to any object or place which has a geographic name i.e. a proper name consisting of one or more words, used to designate an individual geographic entity. It contains information about physical and cultural geographic features in Czechia and associated areas, both current and historical including the names of natural features (nature parks, trails), populated places, civil divisions, areas and regions, as well as cultural features such as roads, streets, highways, bridges, etc. Geographic terms are linked to the geographic area code and UDC area class marks. Geolink project: The Geographic Coordinates in Authority Records of Geographic Names and Features In 2008, we conducted a research study and found it appropriate to include coordinates in authority records of geographic names and features, because this data

Focusing on User Needs: New Ways of Subject Access in Czechia

19

brings important information about the entity described in the authority record. We decided to enhance the authority records of geographic names by the specific field 034 (not to record data in a textual form in note field - 670) and provide supplementary information about the entity on the map using the 856 field.

Figure 7. Example of the application of geographic coordinates in authority record for places.

19. Genre/Form authority file Genre/Form authority file is a structured vocabulary containing terms indicating the form, genre and/or physical characteristics of the materials being described. Genre/Form terms indicate what a document is, according to its form of expression, rather than what the document is about. Genre terms, for instance, designate the style or technique with which the intellectual content of the document is presented, e.g. biographies, essays, hymns, reviews, historical novels. Form terms designating physical characteristics specify documents with respect to their function or presentation format. Terms are distinguished by an examination of a documents presentation format, by the order of information within the document, or by the subject of the intellectual content, e.g. dictionaries, diaries, directories, journals, questionnaires. Table 2. Example of a genre/form authority record. Control no.

fd133970

Heading

české pohádky

UDC

821.162.3-34

English

Czech fairy tales

20. Czech Subject Authorities in Conspectus Categorization Scheme Conspectus method is an international standard whose primary aim is the coordinated building of library holdings and access to them through content characteristics. The Conspectus method was developed for use in libraries organized either using the Library of Congress (LC) classification or the Dewey Decimal Classification (DDC) structures. It was necessary to accommodate the Conspectus scheme to the

20

Marie Balíková

local practice defining the concordances between DDC and UDC numbers and the corresponding subject terminology in English and Czech.

Figure 8. Concordance table UDC - DDC.

21. Where is the Conspectus Categorization Scheme Applied The categorising scheme for the needs of Conspectus comprises three hierarchical levels, namely: • • •

24 basic groups, the so-called Subject Categories 584 subordinate subcategories, the so-called Conspectus groups/categories thematic terms of the subject authority file (CZENAS)

The most important characteristic of the Conspectus scheme - access to information resources through content characteristics - was used when creating and developing the subject-oriented universal gateway (UIG – Uniform Information Gateway), subject information gateways (KIV /Library and Information Science/, MUS, TECH, ART) and Topic Map of the Czech National Library Collection.

22. Uniform Information Gateway (UIG) Uniform Information Gateway is a nation wide portal which unifies access to on-line library services in Czech Republic. For international users there are catalogues of major Czech libraries available for searching; for Czech libraries patrons, there are in addition licensed full texts/abstracts databases and document delivery services available as well. The highest hierarchical level of Conspectus scheme – 24 subject categories along with further related categories – serves in the Uniform Information Gateway subject crossroads.

Focusing on User Needs: New Ways of Subject Access in Czechia

21

Figure 9. Conspectus scheme as a crossroads in Uniform Information and Subject Gateways.

23. Subject Information Gateways Subject gateways are usually characterised in professional literature as specific internet services offering qualitatively evaluated (intellectually selected and assessed) information resources and systematically supporting information retrieval and research in the given field, or in a given semantic domain. They focus on a certain group of users, offering adequate research approaches and methods. The information resources made available are organised into categories based on various criteria (typological, formal, thematic), with the thematic criteria usually being considered as the most important. It is understandable, since the users are inundated with an abundance of miscellaneous documents, that they choose more and more often to access information resources on the basis of their content characteristics, i.e. subject data. For this reason, it is necessary, when creating subject gateways, to pay ever greater attention to the tools used for subject access and to use subject controlled vocabulary which will increase the successfulness of parallel search as well as of browsing the heterogeneous information resources made accessible through this gateway. The important role of subject authority file is evident.

24. Topic Map of Library Collection The Conspectus categorization scheme based on UDC class marks displayed with class descriptions is used for creation of NL CR collection topic map which is used for subject access to NL CR collections. Topic map of library collections can serve

22

Marie Balíková

as an user-friendly subject access for inexperienced library users and for those who prefer to get information on documents location directly and without the use of library catalogue. The main objective is to provide high quality services and improve information sharing. Topic map presents subject information in a systematic way and helps patrons to use the portal through a simple interface. In order to achieve this navigation and search functions, information resources are connected to the topics through the Conspectus categories data entered in bibliographic records.

Figure 10. Topic map of library collection.

25. Czech National Library Catalogue Enhancements Since October 2006, the National Library of the Czech Republic has offered its customers additional content information on individual publications via hyperlinks at the library catalogue. These enhancements include tables of contents and cover images. The tables of contents of all new Czech monographs are scanned (except for fiction). The user can browse the content of books online by clicking on the icon Content. These added features are helpful in determining if the book holds the information which the user wants: he or she can see if there is information on the topic which he or she is interested in before requesting the book. According to a study conducted by Czech Subject team, the addition of tables of contents increased usage of new library materials by 30%. The effect on search retrieval of the addition of Table of Contents (TOC) to the catalogue record has produced significant benefits to library patrons, namely the students.

Focusing on User Needs: New Ways of Subject Access in Czechia

23

Figure 11. Enrichment of BIB records by cover image and table of contents.

26. Conclusion To discover, identify and create new ways of subject access has been a very demanding task, both in the past, and in the present, and it will be in the future as well. The first step is to examine users' needs and create the subject access tools which are able meet their criteria. The second one is to fully explore potential flexibility in the current systems of subject access, potential abilities of information technologies in subject analysis area, potential funding strategies and partnerships. The third precondition is to form collaboration teams of information professionals, IT professionals as well as non-professionals, to develop and test new organising and retrieval tools. The most important achievement, in our view, has been the launch of the subject authority file as a controlled vocabulary. It serves as a base platform for creating other user-friendly organising tools such as the Conspectus categorisation scheme, scheme for Uniform Information Gateway, subject information gateways, and thematic maps of library collection. It could support the development of ontologies (which are very important for semantic web) and user centred retrieval systems based on semantic technologies (M-CAST), as well. Currently, our greatest challenge is to explore potential enrichment/enhancement of the terminology of controlled vocabulary by terms originating from folksonomies. Our intention is to support the use of the up-to-date terminology enabling users to add their own access terms at least as variant forms, i. e. see references. In this way, we would like to augment and make the access to heterogeneous information resources, above all digitised or digital-born, more user-friendly for all types and levels of user communities.

24

Marie Balíková

References Chan, Lois Mai and Theodora Hodges. 2000. “Entering the Millennium: A New Century for LCSH.” Cataloging and Classification Quarterly 29, no 1 / 2 (2000): 225-234. Gorman, Michael. 2004. “Authority Control in the Context of Bibliographic Control in the Electronic Environment.” Cataloging and Classification Quarterly 38, no 3 / 4 (2004): 11-22. Wikipedia: The Free Encyclopedia. 2011. “Tag (Metadata)”, San Francisco: Wikimedia Foundation. Available at http://en.wikipedia.org/wiki/Tag_(metadata) Accessed May 12. Zumer, Maja, Athena Salaba and Marcia Lei Zeng. 2007. “Functional Requirements for Subject Authority Records (FRSAR): A Conceptual Model of Aboutness” In Asian Digital Libraries: Looking Back 10 Years and Forging New Frontiers. 10th International Conference on Asian Digital Libraries, ICADL, Hanoi, Vitnam, December 10-13, 2007. Proceedings, edited by Dion Hoe-Lian Goh, Tru Hoang Cao and Ingeborg Torvik Solvberg. p. 489. New York: Springer.

Subject Analysis and Indexing: An “Italian Version” of the Analytico-Synthetic Model Giuseppe Buizza Abstract The paper presents the theoretical foundation of Italian indexing system. A consistent integration of vocabulary control through a thesaurus (semantics) and of role analysis to construct subject strings (syntax) allows to represent the full theme of a work, even if complex, in one string. The conceptual model produces a binary scheme: each aspect (entities, relationships, etc.) consists of a couple of elements, drawing the two lines of semantics and syntax. The meaning of concept and theme is analysed, also in comparison with the FRBR and FRSAD models, with the proposal of an enriched model. A double existence of concepts is suggested: document-independent and document-dependent.

1. Introduction The tools and the tasks of indexing, of searching and organizing knowledge, are quickly changing. A crucial point is to maintain or to discover which are the essential features of a sound and efficient indexing system, the common characters of any shareable or at least communicable system. The Italian research in the domain pays particular attention to theoretical aspects, in an autonomous way, not contrasting with other current systems. Indeed the future we are preparing is rooted in the shared past of common foundations: the analytico-synthetic model devised by S. R. Ranganathan for faceted classification, the outcomes of the Classification Research Group in the fifties and sixties of last century, leading to a verbal subject indexing system in PRECIS and to the ISO norms for subject analysis and thesaurus construction. We prefer to say “Italian version” of the analytico-synthetic model, instead of “Italian model”, because the model is the same, its inflexion is original. Not a local choice, but a contribution to the rethinking and rediscovering of well-known principles and practices. To quote the title of our satellite meeting: the past we are looking at is not a shore we are leaving to ship towards a future promised land (a ‘lost’ past, whether nostalgically or gladly); it is a treasure we have inherited and wish to use, being sure of yielding fruits of it (a ‘living’ past). The peculiar theoretical and methodological way, that the analytico-synthetic model of subject analysis and indexing have assumed in the geographic and linguistic Italian area, is well represented in two recent tools:

26

Giuseppe Buizza





Guida all’indicizzazione per soggetto (Guidelines for subject indexing, 1996, 2001, in short Guida GRIS)1, drawn up by GRIS, Gruppo di ricerca sull’indicizzazione per soggetto (Research Group on Subject Indexing, of Associazione italiana biblioteche, the Italian Library Association), Nuovo soggettario (2006)2 by the National Library of Florence, superseding and deeply renewing the old Soggettario (Subject headings, 1956)3, which is currently used by the majority of Italian libraries.

Guida GRIS develops the model by combining principles and methods from different settings into a general organic vision and into a consistent set of rules, applicable to any kind of general pre-coordinated indexing system. Nuovo soggettario is an indexing system equally rooted in the analytico-synthetic model and consisting of (a) a set of rules, (b) a controlled and structured vocabulary (thesaurus), with full semantic relationships, and with syntactic notes enclosed, (c) the file of subject strings. Only few notes about these tools, because the focus of the paper is on the underlying principles and the overall framework of Italian indexing, convinced that correct and strong bases are necessary to get high results.

2. The Analytico-Synthetic Approach Analytico-synthetic principles, criteria and rules are adopted in full in Guida GRIS and in Nuovo soggettario. The method is presented in short, step by step. •



1 2

3

Conceptual analysis. In the definition of the aboutness of the work, typical linguistic operations are adopted, such as deletion, generalization, selection and construction of concepts, in order to reach the base theme, the unifying intentional centre of all particular themes involved in the discourse. The logical functions of each concept included in the theme are analysed, following ISO 5963:1985. The output is the subject statement, a phrase in natural language. Syntactic analysis and synthesis. The choice of co-extensiveness (from E. J. Coates) aims at representing the full theme in one string, no matter how complex it may be, supplying clear and complete information about the exact subject content of the work. The analysis of logic roles together with a scheme of syntactic roles (as it was earlier suggested by D. Austin and in

Associazione italiana biblioteche-GRIS-Gruppo di ricerca sull’indicizzazione per soggetto, Guida all’indicizzazione per soggetto. Roma, AIB, rist. 2001, http://www.aib.it/aib/gris/gris.htm. [All Web resources have been last accessed on 2011-02-13] Biblioteca nazionale centrale di Firenze, Nuovo soggettario. Guida al sistema italiano di indicizzazione per soggetto. Prototipo del Thesaurus. Milano, Bibliografica, ©2006 (stampa 2007). The thesaurus and the newly added Manuale applicativo (2010) are now available online, http://thes. bncf.firenze.sbn.it/index.html. Soggettario per i cataloghi delle biblioteche italiane, a cura della Biblioteca nazionale centrale di Firenze. Firenze, Stamperia Il cenacolo, 1956.

Subject Analysis and Indexing



27

PRECIS) forms the basis for the choice of the key concept and the citation order of the others. The output is the subject string. Vocabulary control. The choice of the preferred form for terms, the factoring of compound terms, the semantic relationships and the method for thesaurus construction are established to achieve and maintain consistency (according to ISO 2788:1986 and BS 8723-2:2005). The principle of place of unique definition (derived from J. Farradane), the analysis of the semantic category and facet analysis (from Ranganathan, CRG, BBC and ISO and BS standards again) are adopted to define semantically the concepts and to build the hierarchies of the thesaurus.

To explain the method through an example let us examine a work about “the appreciation of Victor Hugo’s works in Germany between 1870 and 1914”. It is presented as an instance of the class (entity) concept in FRBROO4. In our definition this is not a concept, but the base theme of the work. We find five elements in it (considering two dates as one period of time), not simply dealt with side by side, but syntactically related. The analysis  apart from there being general concepts or named entities  looks for the logical role played by each of them, starting from the existence of a concept of action (“appreciation”), with an object (“works”) and its owner (“Victor Hugo”), a place (“Germany”) and a time (“1870-1914”) where it happened and when.

Figure 1. The analytico-synthetic approach in Guida GRIS and Nuovo soggettario.

To connect these elements in a meaningful and useful way, the rules for the citation order (syntactic rules) state the first position for the key concept, based on role analysis: here, the object of a transitive action. It is followed by the related transitive element (the action), and by the extra-core concepts (place and time). “Hugo” and his “works” are analysed, in a possessive relationship, as the possessor and his property, the latter being a dependent role, thus following the independent one; as the key concept, they keep the first place together, before the action.

4

FRBR object-oriented definition and mapping to FRBRER (version 0.9 draft), International Working Group on FRBR and CIDOC CRM Harmonisation, supported by Delos NoE, editors Chryssoula Bekiari, Martin Doerr, Patrick Le Boeuf, http://www.ifla.org/files/cataloguing/frbrrg/frbr-oov9.1_pr.pdf, p. 34

28

Giuseppe Buizza

Figure 2. The analytico-synthetic approach ... 2.

The work is about the appreciation, but per se this concept is less interesting; Hugo’s works are not studied per se, but only in the appreciation they had in Germany. Nevertheless, they remain the focus of the work, with the limiting specification of the peculiar point of view considered here, the German judgement. From a morphological and semantic point of view, each concept is represented by a preferred term, selected from natural language according to recommended procedures for vocabulary control. The accompanying construction of a thesaurus places every concept in its semantic field and helps exactly searching as well as to discover related access points of interest for similar inquiries. The result of the synthetic process and vocabulary control is a subject string: Hugo, Victor - Works - Appreciation - Germany - 1870-1914 that is: • • • •

co-extensive to the subject content of the work, because it includes every concept considered, easily understandable, because it describes the complete theme in a logic way. suitable for arrangement and browsing, because the terms are syntactically ordered, and suitable for searching, because the terms are individually identified and uniformly selected.

3. A Binary Scheme After this simple description, the focus moves to the conceptual model underlying the Italian system, a model valuable also for other indexing systems based on the same principles. We can recognize a binary structure, where each aspect is made up by a couple of fundamental elements:

Subject Analysis and Indexing

• • • • • • •

29

two kinds of entity: concept and theme two kinds of relationship: semantic and syntactic two kinds of language: vocabulary and subject strings two kinds of operations: vocabulary construction and subject strings construction two steps in searching (“two-steps search”): by terms and by strings two main kinds of users’ interest: wide survey and exact theme two kinds of quality ratio: recall and precision.

All these elements are logically connected on both the horizontal (within each couple: concept–theme, and so on) and the vertical plane, along both semantic and syntactic sequences, as we see in the diagram: Table 1. Subject. Entity

concept

theme

relationships

semantic relationships

syntactic relationships

language

vocabulary

subject strings

operations

vocabulary control

subject strings construction

searching

by terms

by strings

users’ tasks

wide survey

exact theme

quality ratio

recall

precision

The horizontal lines link together the couples of elements, defining each aspect of the conceptual universe (e.g. entities, relationships, language, etc.). The vertical columns link all the aspects of the conceptual universe to each element and vice versa (e.g. a wide survey is linked to the term(s) used in searching, to the vocabulary and its semantic relationships representing concepts, to the recall as principle of evaluation). Without examining each element in detail, let us explain some of them for a clearer comprehension and some amplification. First, the couple of entities. In an analytico-synthetic approach we are immediately aware that, even if a subject should be represented in a unitary way in relation with the work considered, when we want to build a subject catalogue, on the contrary, each complex subject should be examined distinguishing in it every constituent element. The analytic phase leads to discover the concepts used in the discourse developed in the text. They are interlaced components, but each is single in itself, and from this singularity and the variety of possible combinations derives the very possibility of any communication and dialogue, of comparing discourses, of recognizing known and shared notions, as well as new and available knowledge. Therefore, concept is assumed in the precise meaning fixed by the norm ISO 5963 “a unit of thought”, expressed by a single “indexing term”. To express the overall meaning of subject of a work, what it is about, the term theme has been chosen. It is not simply a term to avoid the semantic heritage of other long used words, like “subject” or “topic”. The notion of theme is particularly important in this model, as it is the core concept in the definition of subject. According to

30

Giuseppe Buizza

ISO 5963, subject is “any concept or combination of concepts representing a theme in a document” (but already in Cutter’s Rules for a dictionary catalog a subject was “the theme or themes of the book, whether stated in the title or not”). Guida GRIS, on the basis of a deep reflexion, has defined the base theme as “the unitary object of knowledge to which any particular theme discussed in the document may be referred, and to which all information intentionally given by the author is correlated in the text, since the fundamental aim of the intellectual production of the whole document is just the will of communicating direct and specific notions about that object of knowledge”. Furthermore, the base theme “is a necessary property of any text perceptible as a consistent unit. The text, as a whole, may be considered as the answer to a single question, whose content is coincident with the base theme of the document”5. Following this line of reasoning, we discover that the notion of theme is essential not only with regard to the definition of subject, but also for the conceptual organization of a document and the communicating process. Looking at the following three couples in the diagram above, let us run the three steps on each column separately, semantic and syntactic. The distinction between syntactic and semantic relationships is well-known and is expressed in standards as follows: •



semantic (or paradigmatic, a priori) relationships are those “between terms assigned to documents and other terms which, because they form part of common and shared frames of reference, are present by implication” (ISO 2788-1986), or “that are valid in almost all contexts, especially when they are inherent in the definitions of the concepts which the terms represent” (BS 8723-2:2005)6. syntactic (or syntagmatic, a posteriori) relationships are those “between the terms which together summarize the subject of a document” (ISO 2788:1986), or “which exist only because the terms are used together in the context of a particular document” (BS 8723-2:2005).

Analysing what a work is about we find the core concepts it deals with. Each concept we meet in indexing and that we isolate, is nevertheless linked in our mind and in common knowledge, and surely in the work too, to other concepts, broader or narrower or belonging to another category but strictly linked to it (like the typical action of an agent, e.g. teaching and teacher). Generally we can define a concept by using some of these other concepts, for instance qualifying a thing as a particular kind of a more generic class of things, or explaining an action with the tool used to do it. Somehow all the concepts implied in the concept we are dealing with are worthy of mention together with it, so it is convenient to enrich every concept we use in indexing with a constellation of related concepts, by means of hierarchic and associative relationships. In the meanwhile we know or discover that the same concept may be expressed by more than one term. So we need to choose one term as the preferred one among all, and to establish relationships between equivalent terms (preferred and 5 6

Guida all’indicizzazione per soggetto, p. 13 BS 8723-2:2005 - British Standards Institution. Structured vocabularies for information retrieval. Part 2: Thesauri. London, British Standards Institution.

Subject Analysis and Indexing

31

not preferred), completing the semantic relationships of each term considered. In doing so, independently from the particular discourse a work is developing, we fix semantic relationships, uniform terms and access vocabulary, and build a structured vocabulary that, in the case of Nuovo soggettario, takes the shape of a thesaurus. Analysing the concepts forming the theme of a work, we need to analyse also the relationships connecting each other in that particular setting. Through the role analysis, these syntactic relationships are shown, the core or key concept is chosen on the basis of a logical construction of the subject statement, and the other concepts follow it in a logical order, lead by syntactic rules. The resulting subject string is readable and expressive of the full theme. Vocabulary control is independent from subject strings construction. Vocabulary (i.e. semantic) relationships do not appear in strings. Strings construction depends upon vocabulary control only for the form of terms, not for their order or relationships. Also searching is seen as a couple of elements: search (a) by subject strings (typically browsing through a list) to find directly the exact theme of works, or (b) by terms, finding all the occurrences of a term, that is to say all the works where a concept is relevantly treated and, thanks to vocabulary control, without the inclusion of other concepts in case of polysemy or omography, without loss of occurrences as it happens in case of expression of one concept by equivalent words. But the best achievement of the binary scheme is the possibility of doing the “two-step search”: searching for one or more terms and finding first all the strings containing the wished term(s), with the opportunity to choose those relevant to the search, avoiding examining the themes that are less or not at all interesting. A clear distinction of the two kinds of search corresponds to two main kinds of interest in users: • •

a wide survey of all the works that deal with one concept (as the first moment of a comprehensive research, for instance), including the exploration of its domain, by means of semantic relationships, and a need of information about an exact theme of a work or a particular association of concepts in it (as a means to identify or select a useful document, for instance), avoiding not pertinent or meaningless co-occurrences of the terms searched.

At last, also the upshot of searching is clearly distinct along the two lines of semantics and syntax: searching by terms satisfies the requirements of the best recall, while searching by strings satisfies precision. The two columns are autonomously conceivable (e.g. the semantic column, from identifying a concept, to the choice of a preferred term, to the net of relationships, can stand without any involvement in a particular theme developed in a work). At the same time there are many points of contact and, generally speaking, they are reciprocally necessary (e.g. well constructed strings are useless without control of terms).

32

Giuseppe Buizza

4. Subject Indexing in Conceptual Models Looking at the Italian system in a more abstract way, and including it in the wider context of the bibliographic universe, we can propose an enriched version of the FRBR model. In FRBR final report the entities traditionally coded in lists of proper names are considered for the subject relationships. They are distinct by tangibility (concept and object) and by the dimension where they exist, in space or in time (place and event). The entities of group one and two (work, expression, manifestation, item; person, corporate body) are added as they are already present in the model. There is no consideration for the classes collecting these individual entities (e.g. battles, to collect The Battle of Hastings with other great battles when treated together in one work) and consequentially for relationships between entities as subjects (semantic relationships, e.g. between battles and wars). There is no consideration for the relationships between the entities that are subject of a work, and are connected to one another (syntactic relationships). The entity concept, in the broader meaning assumed in FRBR, could function as the collector of all the subjects more complex than an individual entity, including or not including named entities that should stand by themselves, like in the example used above. In it “Victor Hugo” and “Germany” appear as parts of one concept (the full theme), but they are a person and a place respectively. Thus, according to the model, three relations may be provided: Work

==>

has as subject ==>

==> ==>

has as subject ==> has as subject ==>

concept: the appreciation of Victor Hugo’s works in Germany between 1870 and 1914 person: Victor Hugo place: Germany

To overcome this inconsistency or incompleteness within an abstract model, it is worth considering the deep structure of subject indexing, instead of starting from the evidence of current indexing systems (i.e. the surface of indexing). From this point of view, we consider the aboutness of a work, particularly as the theme around which a discourse is organised, the base theme, independently from the different theories “about aboutness”. This is not an alternative, but an integration to the model of fragmented relationships towards individual concepts, objects, etc., provided that we put together the two levels • •

the level of the structured discourses combined in the theme and taken as a whole, and the level of the modular plurality of the concepts serving to build discourses.

So, the model underlying the Italian system, and suitable for any other indexing system, considers only two entities for the subject relationship in group three: theme and concept, and new relationships between them and with the other entities. Instead of four juxtaposed entities plus pre-considered entities from group one and two, i.e. a sort of categorisation not exhaustive (concept serves also as a residual class) and partly overlapping (an item is an object), the nature of aboutness requires an approach similar to that of group one entities. The “products of intellectual or artistic endeavour” are analysed in the inner relationships between the different aspects of the same thing (a book or a disc are seen as a work, as an expression, as a

Subject Analysis and Indexing

33

manifestation, as an item). In a similar way the aboutness of a work should be analysed in the inner relationships of the overall theme and of the concepts contributing to its making. The syntactic relationships grant this kind of connection between the two levels. The attribute of this relationship is the logical function of the concept in the theme: theme ==> concept ==>

has as logical function ==> is a logical function in ==>

concept theme

It specifies the logical roles played by each component concept in the specific theme through their values (e.g. “The appreciation of Victor Hugo’s works in Germany between 1870 and 1914” has as action “Appreciation”, and “Appreciation” is the action in “The appreciation of Victor Hugo’s works in Germany between 1870 and 1914”). A diagram shows the proposed model applied to our example:

Figure 3. Subject relationship according to FRBR, enriched.

The syntactic relationships allow any concept to be part of many themes, i.e. in subject relationship with many works, where it may play different logical functions or roles, always keeping its own unity and uniqueness (the same happens to a work in group one that may be realised through many expressions and embodied in many manifestations, but still remains the same work). The component concepts, freed from accidental belonging to a specific theme and playing a specific role, are related to other concepts by implication or by other kinds of a priori relationship and should be connected to them by semantic relationships, as it happens in every indexing language, thus contributing to the complete mapping of the domain covered. But semantic relationships, like the definition of categories and the choice of terms, are a task of indexing languages. Therefore, they should be considered in the model only as the necessary expansion into the field of the implementations and of the distinct solutions supplied by different systems. In the diagram above it is shown on the right side, together with the work of authority control, which is done on the names for the concepts and on the subject strings for the themes.

34

Giuseppe Buizza

What is said as regards subject headings languages is valid, to a large extent, also for classifications, where a notation, no matter how expressive it may be, stands for a complex subject. The component concepts are considered in the analysis and, somehow, in the scheme, even if they are neither represented on their own nor by the level of low specificity of the classification. The conceptual model for Functional Requirements for Subject Authority Data (FRSAD) has just been published in its 2nd draft 2009-06-10 for worldwide review7. Despite the focus on authority data, the first goal of the study was to build a conceptual model of group three entities within the FRBR framework. Two new entities have been pointed out by the FRSAR Working Group: thema and nomen, and two relationships: work has as subject thema, and thema has appellation nomen. Thema refers to “anything that can be subject of a work”8 and includes any FRBR entity, like a super-class, or a generic term without a meaning on its own. Despite the almost coincident word, thema looks like our concept, representing separate ideas (see the example for A history of time), not like our theme. Complex themata (or themas) are admitted without facing their inner relationships, and pushing them to the representation side and to the differences among indexing systems. But this is a core of the work-thema relationship, that is marked many to many, and a clear distinction should be drawn between the aboutness of a work, as resulting from its conceptual analysis, and the indexing policy of the agencies. Complexity in aboutness is in the variety of concepts treated and in their relations. Concepts may be (a) juxtaposed as not connected themes, or (b) related to one another to form one complex thema. The latter case requires, from a logic point of view, the two levels of theme and concept of the binary model in order to represent both the whole aboutness of the work and the presence of singular concepts. A different point is the indexing policy chosen by the agency and strictly linked to the adopted system. A system may provide that all concepts are summarised in one subject, or that any distinct concept treated in a work is indexed, like the index of a book listing all the noteworthy things mentioned in the text. But the latter choice does not delete the existence, in any structured text, of one unitary and intentional complex theme where the concepts converge. Therefore, the binary scheme is useful to exchange information among systems even when two or more themes coexist or when the overall level is not used in the implementation. Moreover, this is the way: • •

to move categories out of the tangle of general and additional attributes, towards the clear attributions of categories to concepts, as a device to build semantic relationships, and to include syntactic relationships, as required for the completeness of the model and in the goals of the study.

The latter new entity nomen in FRSAD has a parallel in the entity name in FRAD. Here again the binary scheme of the analytico-synthetic model offers a better base than a mono-linear one for a model devoted to authority data. It allows a distinct and more accurate control both on terms for single concepts (including their categorisa7 8

http://nkos.slis.kent.edu/FRSAR/report090623.pdf. The final report, dated June 2010, is now available, http://www.ifla.org/files/classification-and-indexing/functional-requirements-for-subject-authority-data/ frsad-final-report.pdf. In the final report: “any entity used as a subject of a work”.

Subject Analysis and Indexing

35

tion) and on strings, classification numbers or equivalent representations of themes composed by related concepts, not only formally (e.g., the consistence of citation order) but also applying the syntactic analysis of logical roles. Nomina (or nomens) representing simple concepts need vocabulary control and semantic relationships, and do not admit role analysis and syntactic relationships; the opposite for nomina representing themes. Therefore, it is difficult to imagine an authority system that does not distinguish the two levels.

5. Which Kind of Ontology? The analytico-synthetic model proposes an issue as a logic conclusion of its adoption: which kind of “ontology” is needed for the world of information? That is to say, in our example: do simple concepts like “appreciation”, “works”, “Victor Hugo”, “Germany” and their categorization and relationships have the same nature as the complex concept, that we have called theme, “the appreciation of Victor Hugo’s works in Germany” or not? The solution suggested is the “double existence” of concepts: • •

a document-independent existence, typical of simple concepts, of their categories, of their semantic relationships; a document-dependent existence, typical of complex concepts and of their syntactic relationships.

In this way, we have a double “inventory of the world”. In the former, the concepts exist per se, and we call them properly “concepts”. In the latter, they exist as they are treated together with others in particular documents, and we call these associations “themes”.

Figure 4. "Ontology" according to analyitico-synthetic model.

A concept keeps its identity in its double existence, while the nature of concepts and themes is different in that:

36

Giuseppe Buizza

• •

a concept exists in any context, it forms part of common and shared frames of reference and belongs to an ontological category; a theme exists only because it has been conceived in the context of a peculiar work, where its component concepts are associated by peculiar relations, and it cannot belong to any ontological category, as it is formed by a combination of categories.

In the structure of indexing languages, concepts are treated in semantics and their representation is subject to authority control on vocabulary, while themes are formed in a consistent manner according to syntactic rules and their representation is subject to authority control on subject strings, class numbers, etc. In library catalogues the themes propose the aboutness of works, while the concepts are the most immediate nuclei of mental approach for searchers, independently from the associations in which they might occur, and unaware of which themes actually were treated in catalogued documents. Saving the independent value of singular concepts in the context of the themes where they appear is the condition to find known and unknown themes when searching concepts: exactly and without noise. The binary interpretation supplied by the analytico-synthetic model allows users to satisfy any different kind of task, as it is founded on both the particles and the aggregations of the meaning of works (of what is meant in works). The clear and consistent distinction between semantics and syntax, and their intersection in the “double existence” of concepts allows and helps both discrete and joined searching at will. Moreover, it makes information exchange easier in multilingual and multicultural context, as it does not stand on the signifiers of languages and allows to reduce some differences in structure between very far languages by means of analysis, that is to say factoring complex syntagms into simple concepts, translating and recombining them in the other language. The same is valuable also between different indexing systems, between alphabetical indexing and classifications, for instance, discovering the deep common origin of strings and notations in the same sets of concepts combined in different ways. The analytico-synthetic model of subject indexing, restored and developed in the Italian renewal by Guida GRIS and Nuovo soggettario, shows a wide range of efficient functions and suggests consistent improvements on IFLA’s conceptual models. It supplies the right premises even for effective searching in the web and may serve as the basis for designing and implementing high quality automatic searching. The author would like to thank Alberto Cheti: the paper has been written in close collaboration with him and his contribution has been decisive.

Subject Search in Italian OPACs: An Opportunity in Waiting? Emanuela Casson, Andrea Fabbrizzi, Aida Slavic Abstract Subject access to bibliographic data supported by knowledge organization systems, such as subject headings and classification, plays an important role in ensuring the quality of library catalogues. It is generally acknowledged that users have a strong affinity to subject browsing and searching and are inclined to follow meaningful links between resources. Research studies, however, show that library OPACs are not designed to support or make good use of subject indexes and their underlying semantic structure. A project entitled OPAC semantici was initiated in 2003 by a number of Italian subject specialists and by the Gruppo di ricerca sull’indicizzazione per soggetto (Research Group on Subject Indexing) (GRIS) with the aim of analysing and evaluating subject access in Italian library catalogues through a survey of 150 OPACs. Applying the same methodology, a follow-up survey to assess whether any improvement had taken place was conducted five years later, in spring 2008. The analysis of these two surveys indicated that there was a slight improvement. The authors discuss the results of the two surveys and analyse the problems in subject searching in OPACs. Using the example of Italian OPACs, the authors make some specific considerations on the two-step subject search recommended by the GRIS and explain how this can be achieved through authority control.

1. Introduction to OPAC Subject Searching and Difficulties in Catalogue Use Subject data can help libraries communicate the true value of large and well organized library collections. Subject indexes represent semantic knowledge maps of library collections and create added value in information discovery by enabling navigation through a knowledge space. It is not surprising, therefore, as numerous studies have shown, that subject searching represents the majority of OPAC searches and that, when supported, subject access is favored by users (Maltby & Duxbury, 1972; Besant, 1982; Markey, 1983; Hancock, 1987; Hildreth, 1990; Gödert & Horny 1990; Cousins, 1992). Various research conducted in the area of user information behaviour using TLA (transaction log analysis), questionnaires, interviews, focus groups and direct observation, have shown, however, that not only a high proportion of searches are subject searches but also that a high proportion of searches fail to find pertinent information (Hildreth, 1990; Larson, 1991, 1991a; Markey & Weller, 1996; Yu & Young, 2004). Many studies of OPAC searching have demonstrated that users experience problems

38

Emanuela Casson, Andrea Fabbrizzi, Aida Slavic

in finding the correct search term, increasing recall when results are too few and increasing precision when too many items are found. Frequently when using OPACs users have information searching requirements that library catalogues do not meet (cf. Online catalogs: what users and librarians want, 2009). This is often the case when search terms used by patrons do not match indexing terms supported by subject searching in catalogues and it reveals the weakness of a subject indexing language which should, obviously, be updated and extended with a greater number of synonyms i.e. equivalent terms. It is also often the case that patrons look for a term that the cataloguers have not used and searching produces zero results which is accompanied by an on-screen explanation 'no documents found matching the query', while the actual meaning is ‘you did not use the same term as the cataloguer’. It appears that users have objective difficulties when looking for documents about a certain subject and formulating their information problem. At the same time subject data contained in bibliographic records are a readily-available remedy for this problem and many research projects have demonstrated how the use of classification and subject headings can improve information discovery in OPACs (Piascik, 1993; Hildreth, 1995; Markey, 1986, 1990, 1996, 1996a; Gödert, 2003). The OPAC studies show, however, that catalogues are making very poor use of subject data. In some cases the situation was described as being so poor that it made the professional community question the very purpose of subject indexing where catalogues were designed to be deficient in supporting subject access (Markey, 1986; Hancock, 1987; Crawford, Thom & Powles, 1993; Calhoun, 2006). Italian subject specialists initiated a project in 2003 entitled Opac semantici, in order to learn more about the problems of subject searching in OPACs and hopefully contribute to their improvement. One of the aims of the project was to assess and evaluate the use of subject indexing (subject headings, thesauri, and classifications) in 150 Italian online library catalogues. Aligning the modern OPAC technology with findings from some of the previous research (Poll & Boekhorst, 1996; Gödert & Horny, 1990; Guerrini, 2000), the Opac semantici project created their own checklist method for observing and collecting data on subject searching. In order to evaluate the extent to which subject access had evolved, if at all, two surveys based on a checklist of functionalities were conducted over a five-year period.

2. OPAC Semantici Project With library automation and the transition from card to online catalogues (OPACs), priority was given to a retrospective conversion and the focus was mainly on formal cataloguing and capturing of document identification data. As a result of the apparent negligence of content description and subject access to the collection, Italian online library catalogues have become impoverished versions of their card predecessors. Catalogues have lost much of the semantically rich content information and thus one of their fundamental functions in finding information 'about something' was diminished significantly. When searching library catalogues, users should normally be able to perform three types of searches corresponding to three types of information needs:

Subject Search in Italian OPACs

• • •

39

finding known documents through searching for authors, titles, publishers finding unknown documents about specific subjects through searching assigned subject terms or expressions from a controlled vocabulary finding unknown documents pertaining to a certain scientific discipline or field of knowledge - through browsing or searching of knowledge classification

It is obvious that seeking content-based information through an alphabetical indexing language or classification corresponds to most common information needs. Understandably, the recently published Statement of International Cataloguing Principles (2009) stresses that controlled subject terms and/or classification numbers are essential access points in bibliographic records. This implies that these access points should always be present in a bibliographic record. When we try to use and navigate OPACs, however, we become aware that although subject data may be present in the catalogue they are found only through secondary or advanced options in searching. The IFLA Guidelines for Online Public Access Catalogue (OPAC) displays (2005) acknowledges this weakness and suggests that this interface problem be resolved through introducing more appropriate subject searching and navigation. Current developments in the information technology domain place even more emphasis on subject information and move the library tradition in information and knowledge organization into a different context. The terms 'knowledge systems' and 'semantic' are currently frequently used in relation to the development of the Semantic Web, semantic search engines and social tagging within Web 2.0 applications. In this ‘semantic’ landscape, the aim of the Opac semantici project was to look into how such 'semantic' information discovery can be supported by Italian OPACs in which one can observe, evaluate and exploit library expertise and techniques in document indexing. The project was initiated following a spontaneous get-together by Claudio Gnoli, Riccardo Ridi and Giulia Visintin who then took on the rôle of project coordinators and established collaboration with the Gruppo di ricerca sull’indicizzazione per soggetto (Research Group on Subject Indexing) (GRIS). The project was articulated in several phases. The first phase was the development of the OPAC evaluation checklist () with a coding system to facilitate data collection during the OPAC survey. The checklist contains functionalities and features that may be present or absent in an OPAC and consists of around fifty questions organized into the following wider groups: • • • • •

information about subject access (in Opac semantici project the term used for subject access was 'accessi semantici' hence this checklist area was called 'informazioni sugli accessi semantici') subject access through subject headings subject access through classmark visualisation of subject browsing visualization in record display

Data collection based on the checklist was coded in such a way that more positive responses corresponded to more functionalities being supported; this indicating better

40

Emanuela Casson, Andrea Fabbrizzi, Aida Slavic

subject access and more information being semantically related. Based on this simple method it was possible to define an index of 'semanticity' whereby an OPAC scoring 100 points would represent an ideal 'semantic OPAC'. The survey was conducted in autumn 2003 and included a selection of 150 catalogues chosen from a total of 600 Italian OPACs (the list of Italian OPACs is available on the Italian Library Association website ). Around fifteen researchers were included in the study, the majority of whom were students participating in the Electronic Publishing course from the Ca' Foscari University of Venice. The remainder were volunteering professionals coordinated via electronic mail. All researchers were instructed to record objectively the range of functionalities supported and to observe OPACs from the point of view of a library user. In order to assess whether subject access had evolved, the same survey was repeated in spring 2008, five years after the initial research. The same checklist was applied on the same selection of OPACs. The researchers involved in the second survey were library and information professionals.

2.1 Comparison of Surveys The results of the first survey (Gnoli, Ridi & Visintin, 2004) showed an average score of ‘semanticity’ of 26.5 and indicated a general low level in subject access support. Five years later, the research reported an average score of 30.5. Although this showed a measurable improvement it did not move much closer to the ideal score of 100. The complete results of both surveys are published and available on the project website (). In this paper we would like to outline and compare some of the more interesting findings within the five major groups of checklist functionalities observed.

2.1.A Information on Subject Access 2003: Although it does not require much effort to make information about indexing systems available and there are no specific technical requirements, this is an area in which catalogues proved to be most deficient. Of 152 OPACs surveyed, 136 (90.7%) did not provide any indication on the subject heading systems or classification used to index the collection. There was no information found on thesaurus use. All of the catalogues surveyed offered search fields for subject heading systems and classification as if these were the only options available. More importantly, even for these two most frequent indexing systems, OPAC interfaces lacked information or suggestions on how to search. 2008: Of 147 OPACs, 130 (88.4%) do not supply any information on the identity or nature of the subject indexing systems in use.

Subject Search in Italian OPACs

41

2.1.B. Subject Access through Subject Headings 2003: Subject headings were not equally present in all searching options: subject headings were offered in 75% of the index browsing options, in 85% of simple search-by-field options and in 33.3% of advanced search-by-field options. Whenever subject headings were offered through an advanced search-by-field option, they were also present in a simple search-by-field option. The results also showed that searching for terms in a subject field would retrieve all records containing search terms irrespective of their position within the field. It should be noted that no OPAC offered two-step searching whereby a user, after launching the initial search, would be presented not with an endless list of records, but rather with an option to refine the search through relevant and meaningfully organized subject expressions (Fabbrizzi, 2000). This two-step approach, strongly recommended by GRIS, will be discussed further in this paper. 2008: The survey repeated five years later showed little change in the rate of subject headings use. Statistics on two-step searching, however, showed that under the influence of the report from the first phase of the Opac semantici project, a provincial union catalogue attempted an implementation of such an approach. Moreover, an informal investigation outside of the main survey showed that, ten OPACs using the same library system software also offered the two-step searching option.

2.1.C. Subject Access through Classmark 2003: In principle the option to search by classmark is offered less frequently than searching by subject heading. It is present in 54% of the index browsing options, in 51% of simple field searching and 23.2% of advanced field searching. The presence of classmark searching in the advanced field, while the same, is lacking in simple field searching and is present in only 20 out of 35 cases. 2008: In the period following the last survey, access to classmark in an index browsing option has slightly improved and was reported to be 57.5% compared to 54% in 2003. The rest of functions remained largely unchanged.

2.1.D Visualisation of Subject Browsing 2003: The possibility to browse a list of single words taken from the subject field is offered in 41 catalogues (27%). In addition, 112 (73.7%) of OPACs would offer subject browsing only if the term searched was in the initial position of the subject string. Only 25.7% of catalogues offer subject browsing even if the term occupies other positions in the subject string. When it comes to classification browsing, it is significant that catalogues very rarely display a verbal description of the classmark, i.e. caption (only 16%). In spite of their importance for subject searching, syntactic and semantic relationships are very rarely shown or visualized in the subject browsing lists: only 6 catalogues link the word sought to subject heading strings containing it, 11 display semantic relationships as links (synonymous terms, broader/narrower or related concepts) and 4 provide cross-references between subject strings.

42

Emanuela Casson, Andrea Fabbrizzi, Aida Slavic

2008: The option to view and browse the single word subject list has improved and from being present in 27% of OPACs in 2003 it was found in 37.9% five years later. This is so, however, only for subject list browsing in which sought terms are in initial positions of subject expressions. The number of OPACs offering subject term browsing by words positioned elsewhere in subject expressions decreased to 14.6%. With respect to semantic and syntactic relationships, visualized through hyperlinks, there were no significant changes. The only noticeable improvement was in subject string to subject string relationships which is an option that has become present in 11.6%, as opposed to 2.2% in 2003.

2.1.E. Visualization in Record Display 2003: Subject data does not appear frequently in a summary result display following a catalogue search. A summary result display, i.e. the list of hits, often contains parts of the formal document description (author, title, date). Subject data can usually be found in a more detailed or full record display. Subject headings strings were present in 78.9% of full record displays, classmarks in 73%; the verbal equivalent of the classmark was only present in 28.5% of the catalogue surveyed. More importantly, every subject heading, thesaurus term or classmark in the catalogue records displayed were hyperlinked, thus allowing users to look up and retrieve all other documents in the collection indexed by the same term. 2008: The follow up survey showed that the presence of subject data in the summary result display decreased even further. At the same time, the presence of subject data in the full record display increased: for subject headings from 78.9% to 81.5%; for classmarks from 73% to 74.5% and for presence of class description next to the classmark from 28.5% to 38.6%. With respect to navigability, it should be noted that all subject terms have improved (subject headings, thesaurus descriptors and classmarks). The improvement was most significant with respect to verbal equivalents to the classmark in which the presence of hyperlinks increased from 0.7 to 13.9%.

2.2 Discussion We found that the data collected in two surveys is both interesting and useful if we want to analyze and gain more knowledge about OPAC functions, in order to improve this important service in the future. Hence our findings are not only of concern to librarians, but more importantly they can be used as a means of collaboration between information system architects and subject specialists in addressing the challenges of satisfying the information needs of OPAC users. In comparing the two OPAC surveys it could be said that in spite of a slight improvement, Italian OPACs still leave a lot to be desired with respect to subject searching. The research shows that over a span of five years, many Italian OPACs have moved to a new interface either because they migrated to a new library system or became part of a library network and made their records available through union catalogues. These changes, motivated by strategic or administrative reasons, may have a direct impact on the OPAC functions provided. In some cases they lead to a

Subject Search in Italian OPACs

43

slight improvement, in others they caused further loss of subject data and weakening of subject access. Similar studies conducted in other European (Crawford, Thom & Powles, 1993; Ihadjadene, 1998) and North American countries (Piascik, 1993) indicated the same problems with subject access identified by the Opac semantici research in Italy and show that the problems in subject searching are widespread. With respect to Opac semantici research, we would like to highlight some findings that we find very indicative for the area of subject access. It appeared that information on subject access, which scored extremely poorly across OPACs in both surveys, actually improved slightly over the five-year period. There is no reason or justification for the absence of this information, and this is not an OPAC issue specifically but rather an issue to be resolved as individual library policy on user services. Another anomaly that does not require much effort to correct is the absence of subject data from summary result displays. Upon launching a search users may be confronted with a large set of hits i.e. a summary record display, that can span several pages. The lack of subject data from this display creates problems because: a) records cannot be re-ordered by subject on the display screen, b) a selection of relevant records have to be made based on the title or author's name and not on the actual content of documents. Unfortunately the presence of subject data in summary result displays further decreased from 2003 to 2008. Overall, as could have been expected, subject headings scored better as opposed to classmarks, although a very slight improvement in classmark presence was noticed in 2008. The fact that subject headings are offered frequently in both simple and in advance searching by field option and in index browsing, indicated that subject headings are considered to be more important. Term index viewing, browsing and navigation through relationships (hierarchical, associative), was particularly poorly supported across all vocabularies and throughout the period covered by the two surveys. It should be stressed that problems noticed when searching, browsing and displaying of subject heading systems and classmarks are much more serious as their resolution resides, not so much in the library catalogue service policy or even in the OPAC interface technology, as in the appropriate management of subject data through authority control. One of such problems is the fact that terms that were not in the initial position of a subject string would be excluded from browsing or searching, which affects both recall and precision in information retrieval. Another problem is an inability of OPACs to support fully two-step searching which would help users to disambiguate, re-direct, narrow or expand their initial query through the selection of relevant controlled indexing terms. These are some of the prominent issues noticed by the Opac semantici research and in the following section would like to explain them in more detail and put forward some arguments and suggestions for their resolution.

3. Two Steps into the Catalogue We would like to make some specific considerations on the two-step subject search previously mentioned, because it can be useful as an example in the defining of a

44

Emanuela Casson, Andrea Fabbrizzi, Aida Slavic

general evaluation criterion of OPACs. This searching method is also a fundamental element of the binary structure underlying the "Italian version" of the analyticosynthetic model, proposed by Pino Buizza (2009). The Guidelines for Online Public Access Catalogue (OPAC) Displays, an IFLA document which in our opinion should be more considered by those to whom it is addressed (“librarians charged with customizing OPAC software and vendors and producers of this software), at least in Italy, specify that “OPAC displays must be designed to serve the functions of the catalogue, and, ultimately, to address the information needs of library users”. Therefore, when evaluating an OPAC, one has to focus on the functions of the catalogue. According to the Statement of International Cataloguing Principles, the catalogue should be an instrument that enables a user to find, in a collection, a single resource or sets of resources that correspond to the user’s stated criteria; to identify an entity; to select a bibliographic resource that meets the user’s needs; to acquire or obtain access to an item described; to navigate within a catalogue and beyond. In order to find a single resource or, more importantly, to collocate the bibliographic records for sets of resources that share common characteristics that correspond to the user’s stated criteria, controlled access points should be provided, to obtain the consistency needed to find sets of resources. At the first step in the two-step subject search, the user launches a freely chosen search term and receives a first response by the retrieval system: a list of controlled indexing terms that may match the query. At the second step, the user selects the more relevant terms from the list of controlled terms and, for each of these terms, obtains a display with the list of all subject strings in which that term is included. According to the above-cited procedure, the two entities of pre-coordinated indexing are pointed out as they are represented, namely: • •

at the first step, the indexing language basic units, the controlled indexing terms, either representing a concept or referring to an individually-named entity with uniqueness and unity; at the second step, the subject strings, representing the base themes of works in indexed documents.

In the two-step search the relationship between the two entities is also represented, that is to say that subject strings are composed of indexing terms that are syntactically coordinated in a linguistic expression that globally represents the base theme of one or more indexed documents. Now the user can choose the most relevant theme, or themes, according to his interest with a high grade of precision, in order to arrive at a display of the list of bibliographic records for documents indexed with the selected string. In this way, the subject relationship between themes and works is clearly expressed. At every step of the search, the entities and the relationships of subject indexing, expressed to users, fulfill the functions of the catalogue, insofar as these are a representation of user needs.

Subject Search in Italian OPACs

45

In particular: • •

at the first step, the indexing terms, which express the user’s search expression in the controlled language of the catalogue, fulfill the find function, and they are particularly necessary for the collocative function of the catalogue; at the second step, the display of subject strings including the term chosen at the first step, supports the select function, because it allows the user to choose the most appropriate subject string in the list, so as to achieve access to the related bibliographic records. The pre-coordinated subject strings in the GRIS method and the Nuovo soggettario system are particularly effective for this function, because they are complex but understandable expressions of the entire subject content of the work (Guida all’indicizzazione per soggetto, 1996).

It is important to stress that the relevance of the two-step search, as applied to the conceptual model underlying the Italian system, consists in making subject indexing entities and relationships explicit to users, rather than in actual implementing modalities, for which advanced software solutions can also be used. The two-step subject search is not a novelty: in Italy it was recommended for the catalog of Servizio Bibliotecario Nazionale (National Library Service) in 1985 (Bilancio di un lavoro di ricerca, 1985). An equivalent mode has been used in printed permuted subject indexes for a long time, in which a subject string is entered in the index under each indexing term that makes up the string itself. An example of printed subject string index can be found in the bibliographic review Rassegna bibliografica infanzia e adolescenza, 2000-2007. Our previous considerations on the two-step subject search can be generalized to define a standard for all searches of entire sets of resources that correspond to the user’s stated criteria in OPACs, therefore not only for subject searching. Entities and relationships, which appear in a specific form of indexing, must always be presented to users in a visible and understandable way. In this case, the result is that the characteristics, fundamental elements, and therefore all the potentials of that indexing, can be expressed in the process of use. On the contrary, if the entities and their relationships are not clearly presented or if they are lacking in the planning of displays and searches, cataloguing loses effectiveness and the work of cataloguers is actually wasted at least partially, which is bad both for the library economy and for the users. The inadequacy of the catalogue interface can easily result in a hindrance to the catalogue’s ability to communicate the language of the library: on one hand the catalogue will not be able to support users in querying and on the other it will prevent the library from giving precise answers.

4. HiFi Data Recordings The two-step subject search was introduced into the OPAC evaluation checklist of Opac semantici project because it is perfectly consistent with the characteristics of the new indexing method and system developed in Italy: for this reason, it is the best way to verify if these characteristics are expressed in Italian OPACs.

46

Emanuela Casson, Andrea Fabbrizzi, Aida Slavic

As mentioned previously (cf. section 2.1: B Subject access through subject headings), according to the 2003 survey there were no OPACs supporting this type of search; in the 2008 survey, it is supported only by one library system software, which was found in a provincial union catalogue. What are the reasons for this gap between the characteristics and requirements of subject indexing and the situation of online catalogue interfaces? A first answer is that, in Italian libraries, the implementation of the new indexing method is largely linked to the Nuovo soggettario system, developed only recently by the Biblioteca nazionale centrale di Firenze (Central National Library of Florence) (Nuovo Soggettario, 2006). Due to the spread of the new indexing method, the available software applications for two-step searching may possibly increase. But here our focus is on two further conditions for effective application of the Nuovo soggettario system in Italian catalogues: correct data recordings and authority control. In order to offer users subject searches that express the entities involved in indexing, the linguistic expressions, which represent those entities, must be properly recorded in bibliographic and authority records: language basic units (indexing controlled terms) must be duly identified in the database. The new indexing language envisages additional elements in subject strings, namely the connectives: prepositions, locutions and conjunctions, which may be necessary in the subject string to link one indexing term to the following one. For example, in the subject string Natalità – Effetti delle Politiche sociali (in English: Birth rate – Effects of Social policy) the connective ‘delle’ between the terms Effetti and Politiche sociali, explains the logical relationship between the two concepts. Therefore, to be able to search and retrieve indexing terms correctly, two elements must be identified in the subject strings for data recording: the controlled indexing terms and the connectives, which are non-controlled elements. However, at present, the data recording options offered by bibliographic formats, do not generally allow proper recording. This is due to the fact that the data recording formats and the cataloguing software are designed to support the traditional subject heading structure, which consists of the main subject heading and its sub-headings. In this way, the various parts of a traditional subject heading may correspond to indexing terms, for instance a simple term such as Città (Towns), or a compound term like Commercio internazionale (International trade), but often this is not the case. For example, in a subject heading such as Donna nella letteratura (Women in literature) two concepts are represented and syntactically coordinated, but not individually recorded, whereas the Nuovo soggettario envisages an individual recording of each concept by means of the distinct indexing terms Donne and Letteratura. If an OPAC allows the two-step subject search, but the indexing terms are not correctly recorded, the indexing language effectiveness is significantly reduced, because the user at the first step will not have an exclusive list of controlled terms but rather a list of compound expressions, which in some cases do not match the indexing terms. In such a list a concept may be represented more than once in different contexts. For this reason, at the second step, the retrieval system will not be able to present to the user all the pertinent subject strings containing a particular concept. Correct data recording is the necessary requirement for effective authority control, which should introduce into the catalogue environment all the equivalence, the hierarchical and associative relationships of a vocabulary, as proposed by the GRIS

Subject Search in Italian OPACs

47

method (cf. a case study regarding TECA, an application for cataloguing data based on CDS/ISIS software, Fabbrizzi, 2000) and the Nuovo soggettario system. The opportunity to display the established relationships among indexing terms at the first step increases effectiveness of the two-step search to a large extent. With respect to this, the backwardness of Italian catalogues becomes clear, dating back to the early automation period when it was in fact decided, in a rather short-sighted way, that cross references, which were regularly recorded in card catalogue, could be discarded. With authority control becoming a common practice we are now in a better position to improve subject vocabulary representation and control, and raise OPACs performance to a more desirable level.

5. Conclusion In spite of the Internet's advanced information searching solutions such as Google, a library catalogue with its syndetic structure and systematic organization represents an important and high quality information discovery tool, the role and potentials of which in information organization cannot be disputed. And in spite of the fact some people are of the opinion that a catalogue is an obsolete instrument and there are even suggestions that subject indexing should be abandoned, the central position of subject searching in catalogues appears to be very strong and impossible to ignore. In Italy, the idea of giving up subject access was never seriously taken into consideration. On the contrary, there were many positive attitudes towards subject indexing. This has been particularly true, recently, following the publication of the Italian edition of Dewey Decimal Classification Edition 22 (Dewey, 2009), but even more so when, after a lengthy period of preparation, the Nuovo soggettario was made available. Concerning user needs, the Opac semantici project has demonstrated that our catalogues are, metaphorically speaking, aphasic. Because of the interface design shortcomings, OPACs are either able to give an answer but cannot 'understand' the query or although they are 'able to talk' they cannot actually express and communicate information on their content. The inadequacy of the interface can easily prevent a catalogue from communicating with its structured and controlled library language. On one hand, an inadequate interface is not able to support users in querying the collection, on the other it prevents a library from giving accurate and precise answers about its holdings. In our considerations on subject searching, we focused on the two-step approach. Two-step subject search satisfies a general criterion in the catalogue interface design to make OPACs more responsive to user needs, as it can make subject indexing entities and relationships explicit to users. Nevertheless, if an OPAC supports two-step searching, but the indexing terms are not entered appropriately into the system, the efficiency of the indexing language will be hindered at the very first stage of searching. Therefore, it is important to stress that exact and detailed recording of the indexing terminology is a prerequisite for true authority control, which requires a translation into machine readable format of the catalogue of all the equivalence, hierarchical, and associative relationships that exist in the controlled vocabulary.

48

Emanuela Casson, Andrea Fabbrizzi, Aida Slavic

At present, the greatest weakness of the subject indexes in catalogues lies in the field of authority control. It is important to pay more attention and put more effort into the very first stage of vocabulary data entry as "an authority system is an environment in which an infection carries a serious risk of becoming an epidemic" (Buizza, 2008). The current trend in the development of bibliographic models and standards, as well as the inevitable technology-driven evolution of the OPAC interface, shows that library catalogues are about to enter a new phase. This will lead to a greater understanding of the usability of semantic linking of the bibliographic data contained in the catalogues. It is evident that library catalogues need to be improved. From the work undertaken by the GRIS and the Opac semantici project it transpires that there is a clear understanding of what needs to be done and that Italian librarians have the required expertise to embrace the opportunity to improve their catalogues.

References Besant, Larry (1982) Early Survey findings: users of public online catalogs want sophisticated subject access. American Libraries, 13 (3), 160. Bilancio di un lavoro di ricerca (1985). Gruppo di ricerca indicizzazione per soggetto-SBN. In: Il recupero dell’informazione: atti del Convegno-Esposizione bibliografica “Indicizzazione per soggetto e automazione”, Trieste, 21-22 ottobre 1985. Eds. Adriano Dugulin, Antonia Ida Fontana, Annamaria Zecchia. Buizza, Pino (2008) Gli opac: funzionalità e limiti nel mondo del web, Bibliotime, XI, (1) (accessed 14/2/2011). Buizza, Pino (2009) Subject analysis and indexing: an “Italian version” of the analytico-synthetic model. Paper presented at the IFLA satellite preconference "Looking at the Past and Preparing for the Future", Florence, 20-21 August 2009 (accessed 14/2/2011). Calhoun, Karen (2006) The changing nature of the catalog and its integration with other discovery tools: final report prepared for the Library of Congress. 17 March 2006 (accessed 14/2/2011). Cousins, Shirley Anne. (1992) Enhancing subject access to opacs: controlled vocabulary vs natural language. Journal of Documentation, 48 (3), 291-309 (accessed 14/2/2011). Crawford, J. C.; Thom, L.C.; Powles, J.A. (1993) A survey of subject access to academic library catalogues in Great Britain. Journal of Librarianship and Information Science, 25 (2), 85-93. Dewey, Melvil (2009). Classificazione decimale Dewey e Indice relativo. Ed. 22., ed. italiana, a cura della Biblioteca nazionale centrale di Firenze. Roma: Associazione italiana biblioteche, 2009. 4 v. Fabbrizzi, Andrea (2000) L’applicazione delle norme GRIS in CDS-ISIS TECA. In: L’indicizzazione per soggetto della sezione locale: una applicazione delle norme GRIS, Eds. Massimo Fedi and Raffaella Marconi, with collaboration of Andrea Fabbrizzi, Marta Gori, Paolo Panizza, 89-109 (accessed 14/2/2011). Gnoli, Claudio; Ridi, Riccardo; Visintin, Giulia (2004) Di che parla questo catalogo? Biblioteche Oggi, 8, 23-29 (accessed 14/2/2011). Gödert, Winfried (2003) Navigation und Konzepte für ein interaktives Retrieval im OPAC oder Von der Informationserschließung zur Wissenserkundung. In: AKMB-Veranstaltung Allegro und mehr Bibliothekskataloge : Gestaltung und Mehrwertdienste, Wolfenbütte 20-21 Nov,

Subject Search in Italian OPACs

49

2003

(accessed 14/2/2011). Gödert, Winfried; Horny, S. (1990) The design of subject access elements in online public access catalogue. International Classification, 17, (2), 66-76. Gruppo di ricerca sull’indicizzazione per soggetto (GRIS) (accessed 14/2/2011). Guerrini, Mauro (2000) Il catalogo di qualità. Oltre gli indicatori quantitativi: dieci criteri di analisi qualitativa. Biblioteche Oggi, 5, 6-17 (accessed 14/2/2011). Guida all’indicizzazione per soggetto (1996). GRIS, Gruppo di ricerca sull’indicizzazione per soggetto. Roma: Associazione italiana biblioteche, 1996 . Reprint with corrections 2001 (accessed 14/2/2011). Guidelines for Online Public Access Catalogue (OPAC) displays (2005). Final report (May 2005). München: K. G. Saur, 2005. (IFLA Series on Bibliographic Control 27) Hancock, Micheline (1987) Subject searching behaviour at the library catalogue and at the shelves: implications for online interactive catalogues. Journal of Documentation, 43 (4), 303-321 (accessed 14/2/2011). Hildreth, Charles R. (1990) End users and structured searching of online catalogues : recent research finding. In: Tools for knowledge organization and the human interface : proceedings 1st International ISKO Conference, Darmstadt, 14-17 August, 1990 : vol. 2. Ed. R. Fugmann. Frankfurt/Main : Indeks Verlag, 1991. (Advances in Knowledge Organization 2). 9-19. Hildreth, Charles R. (1995) Online catalog design models : are we moving in the right direction?: a report commissioned by The Council on Library Resources. August 1995 (accessed 14/2/2011). Ihadjadene, Majid (1998) L'accès sujet dans les catalogues en ligne. Le cas des bibliothèques universitaires en France. Bulletin des Bibliothèques de France, 4, p. 104-109 (accessed 14/2/2011). Larson, Ray R. (1991) Between Scylla and Charybdis: subject searching in the online catalog, Advances in Librarianship, 15, 175-236. Larson, Ray R. (1991a) The decline of subject searching: long-term trends and patterns of index use in an online catalog. Journal of the American Society for Information Science and Technology, 42 (3), 197-215. Maltby, A.; Duxbury, A. (1972) Description and annotation in catalogues: reader requirements. New Library World, 7 (862), 260-262, 273. Markey, Karen (1983) Thus spake the OPAC user. Information Technology and Libraries, 2 (4), 381-388. Markey, Karen (1986) Users and the online catalog : subject access problems. In: The impact of online catalogs. Ed. J. R. Matthews. New York; London : Neal-Schuman Publishers, 1986. 3570. Markey, Karen (1990) Experiences with online catalogs in the USA using classification system as a subject searching tool. IN: Tools for knowledge organization and the human interface : proceedings of the 1st International ISKO Conference, Darmstadt, 14-17 August 1990. Ed. R. Fugmann. Frankfurt/Main : Indeks Verlag. (Advances in knowledge organization 1). 35-46. Markey, Karen (1996) Classification to the rescue: handling the problems of too many and too few retrievals. IN: Knowledge organization and change : proceedings of the Fourth International ISKO Conference, Washington, DC, 15-18 July 1996. Ed. R. Green. Frankfurt/Main : Indeks Verlag, 1996a. (Advances in Knowledge Organization 5). 107-136. Markey, Karen (1996a) Enhancing a new design for subject access to online catalogs. Library Hi Tech, 14 (1), 87-108 (accessed 14/2/ 2011). Markey, K.; Weller, M. S. (1996) Failure analysis of subject searches in a test of new design for subject access to online catalogs. Journal of the American Society for Information Science, 47 (7), 519-537.

50

Emanuela Casson, Andrea Fabbrizzi, Aida Slavic

Nuovo soggettario (2006). Guida al sistema italiano di indicizzazione per soggetto, prototipo del thesaurus. Biblioteca nazionale centrale di Firenze. Milano: Bibliografica, 2006 (accessed 14/2/2011). Online catalogs: what users and librarians want: an OCLC report (2009). Principal contributors: Karen Calhoun, Joanne Cantrell, Peggy Gallagher and Janet Hawk. Dublin, OH: OCLC Online Computer Library Center, Inc. (accessed 14/2/2011). OPAC semantici : indagine sugli accessi semantici nei cataloghi in rete italiani (accessed 14/2/2011). Piascik, J. M. (1993) Enhanced subject access in Ohio public libraries. Cataloging & Classification Quarterly, 16 (4), 77-91. Poll, Roswita; Boekhorst, Peter te (2007) Measuring Quality: performance measurement in libraries. 2nd revised ed. Munich: K. G. Saur, 2007. (IFLA Publications: 127). Rassegna bibliografica infanzia e adolescenza (2000-2007) Centro nazionale di documentazione ed analisi per l’infanzia e l’adolescenza; Regione Toscana, Centro di documentazione per l’infanzia e l’adolescenza; Istituto degli Innocenti. Firenze: Istituto degli Innocenti, 2000-2007. Quarterly (accessed 14/2/2011). Statement of International Cataloguing Principles (2009). IFLA, Cataloguing Section (February 2009), (accessed 14/2/2011). Yu, Holly; Young, Margo (2004) The Impact of Web Search Engines on Subject Searching in OPAC. Information Technology and Libraries, 23 (4), 168-180.

Semiautomatic Merging of Two Universal Thesauri: The Case of Estonia Sirje Nilbe Abstract The paper deals with a project carried out in Estonia from 2007 to 2009 with the aim of merging two subject indexing tools into one thesaurus in order to facilitate subject search in union catalogues and bibliographic databases and to organize subject indexing and authority control work more economically. The term records of the two thesauri were merged automatically, loaded into a database and the resulting compilation underwent quick human editing. The project was financed by the ELNET Consortium (Consortium of Estonian Libraries Network) and involved two participating libraries – the National Library of Estonia and the University of Tartu Library. The management of the new thesaurus will be the responsibility of those three institutions.

1. Two Estonian-Language Universal Thesauri In the 1990s, two universal thesauri were developed in Estonia – the thesaurus of the University of Tartu Library and the Estonian Universal Thesaurus which was managed by the National Library of Estonia. The reason for developing two universal, largely overlapping thesauri was the fact that at the beginning of the 1990s when the first e-catalogues and bibliographic databases were created in Estonia, libraries had no clear vision of the future trends in information technology development and its impact on library automation. The experience in subject cataloguing and managing controlled vocabularies was rather limited as large libraries had traditionally been using classified card catalogues built up according to the classification system (mostly UDC). The thesaurus of the University of Tartu Library, also called the INGRID Thesaurus, was initiated together with the library’s home-made e-catalogue in 1994 and terms were added to it in the course of current indexing. Only those terms were included in the thesaurus which were needed for subject indexing the documents acquired by the university library. In the library system, the thesaurus module was directly connected with the catalogue module. In 1996 the INGRID was provided with a separate web interface In 1996 the INGRID was provided with a separate web interface that allowed the thesaurus to be both browsed and searched. The University of Tartu Library never published the thesaurus as a separate publication and it has not been available to other libraries for indexing. The maintenance of the INGRID was the joint responsibility of the whole classification and subject indexing team.

52

Sirje Nilbe

During the same period the National Library of Estonia developed the Estonian Universal Thesaurus (Eesti üldine märksõnastik, EÜM) according to the example of the Finnish General Thesaurus and the UNESCO SPINES Thesaurus. The EÜM was meant mostly for the National Library (which also acts as a parliamentary library in Estonia) but also for the public libraries network and other libraries. The EÜM was developed with thesaurus management software designed on the basis of FoxPro but nevertheless the thesaurus was planned to be given out as a printed publication. The EÜM was published at the beginning of 1999; the web version was launched in 2006.

2. Necessity and Preconditions of Merging the Thesauri After the establishment of the ELNET Consortium and the implementation of INNOPAC (the shared integrated library system of major Estonian libraries), both thesauri were taken into use within this system. INNOPAC was launched at full capacity at the beginning of 1999. Most member libraries started to index using the EÜM which was brand new but not yet tested in practice. The University of Tartu Library continued to use their own thesaurus as it was familiar to the users, trusted as reliable by the indexers, and the library had already indexed over 30 000 bibliographic records on the basis of the INGRID. The union catalogue of the ELNET Consortium member libraries consists of two databases – ESTER Tallinn and ESTER Tartu. The differences between the two thesauri have disrupted the information search, subject indexing and authority control in the Tartu database primarily, but the Tallinn database was also affected due to the copying of bibliographic records. Also, a lot of duplication has been done by the above two libraries in compiling the thesauri – which we definitely cannot afford, taken the constant lack of qualified staff. The discussion over the necessity of a shared controlled vocabulary and the possible methods for creating it dates back to the year 2000, initially arising in the Classification and Indexing Working Group of the ELNET Consortium. Negotiations were also held between the owners of the two thesauri – the National Library of Estonia and the University of Tartu Library. It was finally decided that the most appropriate organization for carrying out the project was the ELNET Consortium, and the fastest method was the merging of the two thesauri by a computer program, followed by the human editing of the resulting compilation. Several preconditions existed for carrying out this complex project: • • • •

the typological and structural similarity of both thesauri; good cooperation between the editorial teams of both thesauri; the possibility to involve a software designer with extensive experience in creating thesaurus management software; the interest of all member libraries of the Consortium in the project and their consent to cover the project costs from the Consortium’s budget.

Semiautomatic Merging of Two Universal Thesauri

53

3. Feasibility Study In the autumn of 2007 the Consortium carried out a feasibility study on the merging of the two thesauri. The aims of the feasibility study were the following: • • •

to identify the compatibility of the data structures and to find the best option for merging the data; to identify the overlap of terms and relationships between them; to identify the approximate amount of logical mistakes evolving in the merging process in order to evaluate the manpower needed for human editing.

During the testing of automatic merging, approximately one-third of the most important data of records of the two thesauri were merged (the productive first letters in the Estonian language A, K and T): preferred terms, nonpreferred terms and relationships between terms. It appeared that 32% of the terms overlapped, two-thirds of the terms occurred either in one or the other thesaurus. The test merging showed 30% of overlapping relationships, 68% of relationships occurred only in one of the thesauri, and 2% of the relationships were conflicting. A conflicting relationship means that one and the same term is in relationship with another term in two different ways, e.g. in both hierarchical and associative relationship. On one hand, the one-third overlap was a surprise, because everyday work had given the impression that the similarity of the thesauri was more extensive. On the other hand, the reasons for the differencies are clear – the INGRID has been designed for the specific needs of a multidisciplinary library of a classical university, while the development of the EÜM has mostly been influenced by the Estonian publishing output and the acquisition policy of the National Library of Estonia as a research library for the humanities and social sciences. The indexing vocabulary requirements of public libraries are more similar to those of the National Library than those of the University of Tartu Library. In addition, all terms and relationships of the subject field Computer Science were merged during the test. The overlap here was even smaller, including 17% of terms and 15% of relationships. A more exhaustive analysis revealed that one important reason for this difference was the large amount of names in computer science, e.g. the names of computer programmes and programming languages. These names had often been formulated into thesaurus terms according to different rules, resulting in a lot of names re-occurring in different forms. Another reason for the small overlap was the fact that the content covered by the tested subject field in the EÜM was more extensive than that in the INGRID, including also automatic control.

54

Sirje Nilbe

Table 1. Results of the test merging of data AKT Terms

Computer Science

13 767

1476

Overlapping

4446

32%

253

17%

EÜM

6450

47%

943

64%

INGRID

2871

21%

280

19%

Relationships

50 827

4015

Overlapping

15 087

30%

609

15%

Only in one

34 648

68%

3321

83%

Conflicting

1092

2%

85

2%

The feasibility study showed that: • • • • •

the semiautomatic merging of the data of both thesauri is possible; the merging must be preceded by the harmonisation of the list of subject fields and extensive manual correction of the subject field specifiers of terms; the merged data compilation contains about 3500 logical mistakes; the editing cannot be confined to the correction of logical mistakes, it should also involve the merging of synonymous terms; the capacity of the required editing is approximately 2400 work hours, plus the time needed for programming and computing.

4. Main Project The time initially planned for software designing, data merging and human editing was one year. According to this the use of the new controlled vocabulary was designed to start at the beginning of 2009. However, the development of the project was not as smooth as desired. The preparation of the documentation and the thesauri, data merging and the realization of the web design of the new thesaurus were more time-consuming than expected. On the other hand, the creation of editing software and the editing process itself followed the initial timescale. The new thesaurus, called the Estonian Subject Thesaurus (Eesti märksõnastik, EMS) was opened for public use on 14 May 2009. The project group involved two computer specialists, one of them responsible for data merging, coding and analysis, and the other for preparing the editing software and realizing the final user interface and web design. This enabled them to work in parallel which was an advantage as their involvement in this project was an addition to their everyday job. The harmonizing of subject fields was completed at the beginning of September 2008. On 8 September the last amendments and corrections were made in the EÜM and the INGRID, after that both were “frozen”. The merging of data and the loading

Semiautomatic Merging of Two Universal Thesauri

55

of the merged data into the database of editing software were carried out from September to December 2008. Table 2. Comparison of the data elements of term records EÜM Preferred term Term + Subject field repeatable field, numspecifier bers and words English equiva- repeatable field lent Scope note + Editor’s note + UF + BT + NT + RT + Nonpreferred term Term Subject field repeatable field, numspecifier bers and words English equiva- repeatable field lent Editor’s note + USE +

INGRID

EMS

+ repeatable field, words all in one field, separated by ; + + + + +

+ repeatable field, numbers and words repeatable field

repeatable field, words all in one field, separated by ; +

repeatable field, numbers and words repeatable field

+ + + + + +

+ +

The statistical analysis carried out after the data merging of the two thesauri indicated that the overlap of the terms of both thesauri was actually even less than onethird. Table 3. The proportional division of the origin of the terms in the merged thesaurus Merged

12 223

25,84 %

EÜM

24 399

51,59 %

INGRID

10 674

22,57 %

Total

47 296

100 %

5. Editing of the New Merged Thesaurus The editing team consisted of 8 persons, all of them previously involved in the management of the initial two thesauri. The whole editing team thus had the necessary competence and experience. The editing period lasted for three months - from the beginning of January until the end of March in 2009. It was additional work for all editors and all of them were paid extra for that. Access to the database was possible via a regular browser which fortunately also enabled them to do the editing at home. At the first stage the workload was divided by subject fields, trying to take into account the knowledge and practical work experience of each editor. At the second

56

Sirje Nilbe

stage the thesaurus was divided alphabetically between the editors and reviewed. A specially compiled editing guide could not solve all problems and the editors had to make a lot of independent substantive decisions. The software created for editing was based on the software of the EÜM, containing several improvements which facilitated editing. The origin and nature of the terms was indicated by colour coding – overlapping terms were black, those occurring only in the INGRID were blue, and the terms occurring only in the EÜM were green. Red denoted the terms with conflicting relationships and the terms acting as preferred terms in one thesaurus and as nonpreferred terms in the other. Lilac designated the terms with a merged scope note. The database presented only the logical mistakes that evolved from the merge, new mistakes were blocked. When a logical mistake was corrected, it turned from red to black. The editors could merge term records if they considered the corresponding terms to be synonymous. In this case the relationships of the two terms were merged automatically and the other term could be left as a nonpreferred term.

6. The Main Characteristics of the Estonian Subject Thesaurus (EMS) The EMS includes about 36 000 preferred terms and 15 000 nonpreferred terms – altogether 51 000 terms, divided into 48 subject fields. It is a universal controlled vocabulary for indexing and searching Estonian library materials. The database of the thesaurus enables the users: • • • • • • •

to browse subject terms by subject fields; to search terms by the beginning or part of word, or by exact match; to search terms by English equivalent; to view search results as word lists or as full records; to search by every term in the online catalogue ESTER, in the database of Estonian articles ISE or in Google; to print, e-mail or save to a file selected word lists or full records; to subscribe current awareness service for new, changed and deleted subject terms.

The thesaurus is jointly managed by the ELNET Consortium, the National Library of Estonia and the University of Tartu Library, it is used as a standard in the Consortium’s union catalogue ESTER (tallinn.ester.ee; tartu.ester.ee) and in the database of articles ISE (ise.elnet.ee). As the EMS replaces the EÜM, all libraries previously using the EÜM will continue to index by the new EMS. The Estonian Subject Thesaurus is freely accessible on the web (ems.elnet.ee).

Semiautomatic Merging of Two Universal Thesauri

57

7. Conclusion The project of merging two thesauri could be considered successful. In a relatively short period a shared thesaurus was created which is suitable for indexing and information search. The programmatic merging of the thesauri and subsequent inevitable editing is the fastest method for creating a shared controlled vocabulary. This is the best way to preserve the compliance between the authority data and the existing subject indexing, as only a small number of subject terms are changed and deleted. In the current e-environment it is more practical to use shared indexing languages which are universal, multifunctional and flexible, instead of later facing the compatibility problems of different indexing languages and the task of achieving interoperability. This principle is particularly sensible in a small country with a small language community like Estonia.

Session 2 Retrieval in Multilingual, Multicultural Environments

20 Years SWD – German Subject Authority Data Prepared for the Future Yvonne Jahns Abstract The German subject headings authority file – Schlagwortnormdatei (SWD) – provides a terminologically controlled vocabulary, covering all fields of knowledge. The subject headings are determined by the German Rules for the Subject Catalogue. The authority file is produced and updated daily by participating libraries from around Germany, Austria and Switzerland. Over the last twenty years, it grew to an onlineaccessible database with about 550.000 headings. They are linked to other thesauri, also to French and English equivalents and with notations of the Dewey Decimal Classification (DDC). Thus, SWD allows multilingual access and searching in dispersed, heterogeneously indexed catalogues. The vocabulary is not only used for cataloguing library materials, but also web-resources and objects in archives and museums.

1. Introduction As subject indexers we look mostly into the future: how can we enhance our indexing languages, how can we respond to requests for information when knowledge itself is constantly changing, how can we handle millions of new publications, to make them quickly available to information seekers, how can we improve our indexing rules, finding automatic methods to save human resources, how can we find solutions for the semantic web etc. “Librarians at the forefront of organizing knowledge” – the last IFLA indexing and classification section’s satellite meeting on subject retrieval, held in Dublin (Ohio) in 2001 was conceived in this light (McIlwaine et al. 2003). Looking at the past and overlooking the last 20 years, we realize that we are well prepared for the future. German librarians have created a rich indexing language. This vocabulary is called Schlagwortnormdatei (SWD), meaning nothing else than “subject headings authority file”. Controlled vocabularies are essential for information retrieval tools because of the complexity and the ambiguity of natural language through which information or knowledge is expressed. Seeking subject access is a daily phenomenon for scientists, students, every user of search engines, online-catalogues, web-directories, even for people doing translations or creating instruction manuals. All of them need a special combination of linguistic and content information. In order to provide people with topics they want, we describe these with controlled subject headings.

62

Yvonne Jahns

For a better understanding of such artificial languages and of course encouraging international cooperation and data exchange, the principles underlying our subject heading language are described below.

2. History and Partners of the SWD Our SWD authority file is a terminologically controlled vocabulary, although we often refer to it as a thesaurus. Whether SWD is considered a thesaurus or not, strictly speaking, is dependent on the definition or thesauri in ISO code 2788. The word "thesaurus" is derived from 16th-century Latin, coming from ancient Greek thesauros, meaning a collection of things which are of high importance or value (Wikipedia 2010). Since such definitions are always a controversial issue (due to the lack of certain hierarchical conceptual relationships in our vocabulary) the term thesaurus will be avoided here. Undoubtedly SWD is a kind of knowledge repository. SWD started as a list of about 120,000 standardised subject headings from some libraries – the German National Library (DNB) and a few Bavarian libraries - in the 1980s, resulting from their subject (card) cataloguing practise. At the same time librarians approved the first German Rules of the Subject Catalogue (abbreviated in German RSWK). There was a long tradition of verbal indexing in German libraries, but no common rules until this time. The beginning of IT in libraries in the 1970s allowed data exchange and required harmonized rules and of course authority files. Lists were no longer effectient. In 1986 DNB started cataloguing with RSWK and SWD and others followed. Since then, SWD subject headings are determined by these subject indexing rules and guarantee consistency of indexing. During this time, efforts in developing subject heading languages were made all over Europe. Interoperability projects and mergers of existing vocabularies were on the agenda (Landry 2008, 249). Today our rules are applied by a vast number of academic and public libraries, many of them participating in producing and maintaining subject headings. The German National Library cooperates with all major libraries and also library networks or service centres in Germany, Austria and Switzerland, among them special institutions like art libraries or church libraries. Actually the participants and users do not only come from the library sector, but also from museums, archives or broadcast stations or other institutions from the media sector. The subject cataloguing rules are now widely accepted and the authority file is a harmonized tool and information source for subject retrieval. Simultaneously, our rules come under criticism for not being web-oriented enough. We do consider these arguments carefully.

20 Years SWD

63

3. The Database 3.1. Content and Structure of Authority Records SWD covers all fields of knowledge; it is a universal vocabulary, very similar to LCSH or RAMEAU. There are now 550,000 headings and even more references providing an efficient information retrieval. The file consists of different categories, the main entities are: geographical names, topical headings (concepts and objects), terms for time and form, names of institutions (corporate bodies), titles of works. (Names of persons are stored in an extra authority file, PND, used also for cataloguing authors’ names.) Table 1. Content of the SWD SWD 2009 550.000 headings 160,000

topical headings (concepts), also events, time periods and forms

190,000

geographical names

130,000

names of corporate bodies

70,000

titles of works

The figure shows, the greater part of terms are individual terms - names of places, events or titles of works and so on. The distribution of categories of the headings has been changing from 1986 until today, now the number of topical terms increase more gradually. In other words, SWD has a hybrid nature: on one hand there is a list of hierarchically structured terms as semantic units for conveying concepts and on the other hand a name authority file. Subject headings can be single words or complex phrases, usually nouns, since nouns are the most concrete part of speech. Commonly known designations are preferred, usually names and terms in German. Encyclopaedias, dictionaries or other thesauri have to be consulted before creating a heading to ascertain the most commonly used name or term. This source will be given as a reference. Unlike a dictionary, an authority record does not define words, and definitions can be found at the referenced sources. An example of an authority record is shown in figure 1.

64

Yvonne Jahns

Figure 1. Example of an authority record.

What kinds of references and relations do we use? These are the well-known categories of thesauri, divided into three types: • • •

equivalent, hierarchical (broader/narrower), associative relations.

Synonyms, alternative spellings, variant forms of names, former names, broader and narrower terms, related terms, etc. are indicated as you can see in the example above. We of course have a close look at the IFLA-Functional Requirements for Authority and Subject Authority Data and check how we can upgrade our vocabulary to guarantee international data exchange. Besides the main entities we distinguish between 40 sub-entities, indicated by a special code, e.g. the letters snz indicating the sub-entity for “biological or chemical nomenclature” or sip meaning the sub-entity “authority names for individual product names and brand marks”. Thus we can cluster the vocabulary.

20 Years SWD

65

Authority records are structured additionally by ISO country codes, ISO language codes and subject groups. This systematic allows us to get partial views of the vocabulary. Subject headings are highly specific, according to a major rule of RSWK prescribing to find always the narrowest, but comprehensive term. Compared to other well-known thesauri like MeSH or AGROVOC, SWD is well-developed and meets the needs of specialised institutions despite of being a universal vocabulary. The vocabulary is highly scientific, too. The conceptual dynamics of all disciplines are represented. Most of the topical headings belong to the disciplines of technology, biology, chemistry, medicine and economics, representing the enormous literary warrant in these fields, as for instance compared to fields such as library science or terms for handicraft works. Yet, we also have terms like knitting, papier-mâché or origami since German librarians index all kinds of documents with SWD and the German National Library collects and catalogues a huge number of non-scientific publications.

3.2. Maintenance Procedure and Construction Principles Our vocabulary is not at all static. New publications need new terms to describe contents adequately. The SWD database is updated daily. Beside the construction of new headings, existing terms have to be maintained. The use of habitual language has to be monitored. Constant updates represent the current development of research and science. We follow innovative terms and language usage just as the editorial staff of an encyclopaedia is watching the latest news. In my opinion, it is important that headings are created not only by people who are responsible for developing the vocabulary, but also by indexers, who apply the terms. These are the major construction principles of our vocabulary:1 •

• • • •

Terminology is controlled. Uniform headings are built with synonyms. Sometimes definitions are formulated. Each term is placed in its context, allowing a user to distinguish between homonyms like "bureau" - the office and "bureau" - the furniture. Each heading is stored in an authority record with a (permanent and unique) ID number. Terms follow the literary warrant. Terms are audience (user) oriented. Complex and compound subjects are expressed by linking component parts of subject headings via syntagmatic relationships (syntax principle).

The German National Library hosts a central authority database, maintenance is cooperative. All participants are able to send data, (different editorial levels are set up, depending on editorial experience of the participant/completeness of the authority record regarding the cataloguing rules) and communicate these online. 1

More details have already been published in the country report of the seminal IFLA study done by the Classification and Indexing Section in the 1990s (Lopes and Beall 1999).

66

Yvonne Jahns

Not only two or three editors create the controlled vocabulary on behalf of other cataloguers, but dozens of indexers and editors in Germany, Switzerland and Austria create new records in their familiar disciplines, reflecting the development of current research. In which respect is this different from the Wikipedia principle? In fact, it is much the same procedure, creating high synergy by many experts. Only a few editors make final revisions. In case there are discrepancies, all editors can send an email to the others, with a direct link to the authority record itself. A group of about ten editors from all participating consortia meet twice a year for personal discussion.

Figure 2. SWD editorial procedure.

One of our future plans is to be more open to new heading entries generated by users themselves. If they fail in subject retrieval, they can send the editorial staff a proposal for a term to be entered. We also learn from social tagging systems which terms users are searching for and which kinds of search behaviours exist. Furthermore linguistic components should be integrated into our catalogues or search engines to avoid “no results” due to the use of a plural search term, which in German are quite often non-trivial modifications of the noun (Kindergarten / Kindergärten). Integrating such a dictionary feature directly into the authority file would save retrieval systems from implementing such features individually. If we do so, we will also get better success using subject headings in automatic indexing processes.

3.3. Data Formats The standard communication formats used for storage and exchange of subject headings are the same as for bibliographic records: • •

MAB2, the library exchange format in Germany. MABxml, using a XML structure, suitable for XML based technologies like OAI, SRW/U and XSLT.

20 Years SWD

• •

67

Pica+, basically as an internal format in some libraries using OCLC Pica library systems. MARC21 authority2

Converting subject headings to appropriate semantic web formats like SKOS will be a great challenge. Expressing structure and content of our heading language with RDF-Schema or OWL allows the construction of machine-processable statements (Svensson 2008, 241). Within the European project TELplus (http://www. theeuropeanlibrary.org/telplus/) a first RDF-version of the SWD was built in 2008.

3.4. Access Tools SWD is available via a Z39.50 Gateway and also on DVD-ROM along with the Name Authority File (PND) and the Authority File of Corporate Bodies (GKD), furthermore on other storage media via the bibliographic services of DNB. Headings are of course searchable via the DNB online catalogue https://portal.dnb.de/. There are plans to make our subject terminology available as a public domain data set under the license of Creative Commons http://creativecommons.org/ licenses/by-sa/3.0/deed.de DNB offers also a web tool, containing a feature for comfortably navigating related terms, http://melvil.d-nb.de/swd.

2

For SWD-MARC authority see Deutsche Nationalbibliothek (2008)

68

Yvonne Jahns

Figure 3. An example from DNB web tool for navigating related terms.

Other library networks offer similar web access, e.g. the German South-West Library Network offers SWD in structured display, freely accessible http://swb.bszbw.de/DB=2.104/

20 Years SWD

69

Figure 4. An example of web access to SWD by German South-West Library Network.

4. Use and Applications There are four main traditional applications of SWD: • • • •

having a source vocabulary for subject indexing, ensuring consistency of cataloguing among various libraries and library networks, having topical search keys (access points), being a navigation tool for information retrieval.

The first and second aims have already been discussed. Coming now to our indexing practice and what access points there are: Adding subject headings one-by-one to a string means building a kind of a very short abstract of the content of publications. The German national library is doing this for nearly all new titles of the publishers’ book trade, following the rules of syntactic indexing of RSWK. In contrast to the Anglo-American cataloguing practise, German bibliographic records do not contain subject headings as text strings, but links to authority records via ID numbers. If an authority record changed, title records will be updated automatically. The linking mechanism also allows comfortably searching synonyms. As a contrast, German subject headings are not as rigorously pre-combined as in the LCSH, therefore terms can be freely combined. SWD is a multi-facetted, postcoordinated vocabulary. Searching such single-concept terms is very effective. SWD contains only the source vocabulary. Cataloguers can select terms, combine them and form heading strings as they wish.

70

Yvonne Jahns

Figure 5. A German bibliographic record.

Single subject headings or single words are searchable as well as their synonyms. A few German library online catalogues also allow searching and browsing complete subject heading phrases. We will now have a short look at the last goal – to navigate in online catalogues via subject headings. As we have seen in the beginning, the main goal of our indexing work is to help information seekers to find the relevant information. Most library catalogues today integrate subject headings into the basic search index, in order to enhance the search vocabulary. The system automatically searches for all synonyms of the users’ search term to find more hits. Beyond this, only a few online catalogues offer navigation via term structures, which means clicking on word nets or clouds, asking for associative terms (and documents).3 This is a great challenge for the design of user interfaces. Libraries should agree on standards also in this area. Investments in subject authorities should not remain an end in itself. User orientation should be maximized and we should look for new applications of authority files. There is for example a whole market in the field of business knowledge organization. Generally, authority records, particularly name authorities could not only be applied in the way of traditional cataloguing as we did the last twenty years. New databases, digitization and open access to data networks and all the possibilities of the web produced exponentially increasing data amounts. Such an amount and growth of digital data makes subject access, retrieval precision and selection more necessary than ever before. As demonstrated by several projects, authorities offer the right infrastructure. They allow linking information and resources, using the same documen-

3

Queens Library does it with word clouds showing related terms, spelling variations and translations, http://aqua.queenslibrary.org/

20 Years SWD

71

tation vocabulary and the same identifier for names and topics. This is the advantage of our documentary languages over today’s indexing realised in search engines. Networked authority data could be much more profitable, if the vocabularies provided worldwide would be mapped.

5. Mappings and Links Most libraries have resources indexed by different vocabularies and classification schemes. Therefore query links for searching heterogeneously indexed collections are needed. High-quality (precise) results of these crosswalks are based on (manual or automatic) mappings between vocabularies. Much research has been done in recent years in this field. Web services or so-called terminology services provide data to facilitate subject interoperability.4 The vision of a virtual international authority file (VIAF) was born in the VIAF project of OCLC, Library of Congress, Bibliothèque nationale de France and DNB, started in 2003 (OCLC 2009). A first prototype was established with names of persons, but VIAF now includes also geographic entities. This extension includes various names of both states and natural geographic units as rivers, mountains etc. (Hengel 2008, 269).

5.1. Crosswalks to Other Thesauri German librarians and documentalists were very creative in building subject heading languages (Kunz 2002). Often invented only for local application, they wanted to meet special user needs, resulting in different controlled vocabularies for different disciplines. That is why mapping projects started to support cross searching, mushrooming simultaneously with all the virtual libraries which have been founded in the last ten years.5 SWD was part of several mapping initiatives, the last one being the project KoMoHe (GESIS 2010). There cross-concordances were produced by manually creating links that determine equivalence, hierarchy, and associative relations between terms from SWD and other controlled vocabulary. Favoured expressions of any indexing language can be related to those of another either by 1:1 equivalence or 1:many equivalence, which includes “AND” or “OR” relationships. SHL 1 soziale Entwicklung = SHL 2 Gesellschaft „AND“ Entwicklung Vocabularies were analyzed in terms of topical and syntactical overlap before the mapping started. All mappings were created by researchers or terminology experts. Essential for a successful mapping is an understanding of the meaning and semantics of the terms and the internal relations within the vocabularies of concern. The vocabularies mapped to SWD or those SWD was mapped to, are the following: 4 5

See Nicholson (2008) for a good overview. The German Ministry for Education and Research alone funded 64 crosswalks with more than 500,000 established relations.

72

Yvonne Jahns

• • • • • • •

AGROVOC Thesaurus (AGROVOC): A multilingual vocabulary in the agricultural domain (39,000 terms). INFODATA Thesaurus (INFODATA): A vocabulary in the information science domain (1,000 terms). Medical Subject Headings (MeSH): A vocabulary in the medicine domain (23,000 terms). Psyndex Terms (PSYNDEX): A vocabulary in the psychological domain (5,400 terms). Standard Thesaurus Wirtschaft (STW): A vocabulary in the economics domain (5,700 terms). Thesaurus Bildung (BILDUNG): A vocabulary in the pedagogic domain (50,000 terms). Thesaurus Sozialwissenschaften (THESOZ): A vocabulary in the social science domain (7,700 terms).

Although the usefulness of terminology mappings is generally acknowledged, the actual effectiveness is rarely evaluated. In an information portal with many different databases, the question becomes crucial whether cross-concordances can enable a distributed search. A few studies have shown that they work successfully in bridging language gaps (Mayr and Petras 2008). DNB wants to use all these existing mappings to enhance SWD search vocabulary, to provide seamless searches across SWD-indexed catalogues and other databases in the future. One benefit of such mappings that should not be underestimated is quality control, resulting in systematic inspection of the vocabulary.

5.2. Multilingual Linkage (LCSH, Rameau) As we have seen, the corresponding English and French authority files are the Library of Congress Subject Headings (LCSH) and the Répertoire d’autorité-matière encyclopédique et alphabétique unifié (RAMEAU), respectively. To overcome linguistic barriers and to finally provide a truly multilingual search, a project called MACS (Multilingual Access to Subjects) was started in 1998, initiated by CENL. A working group of four libraries was set up, the British Library, the Bibliothèque Nationale de France, DNB and the National Library of Switzerland. Equivalence links are established between the three subject headings languages, LCSH, RAMEAU and SWD. The first study took a subset of headings from the three authority files in the fields of Sports and Theatre (Clavel-Merrin 1999). Different mapping methods were explored and the prototype of a linking tool was developed. This database called LMI (Link Management Interface) still exists. It is a web application, independent of any partner library system.

20 Years SWD

73

Table 2. An example of equivalence links among LCSH, RAMEAU and SWD LCSH

RAMEAU

SWD

Women

Femmes

Frau

Women – Employment

Femmes – Travail

Frau “AND” Berufstätigkeit

Women authors

Femmes écrivains

Schriftstellerin

Because all the tests were successful, the MACS organization was refined and linking mechanisms were improved. In 2009 the Swiss and German National Libraries will finish the linking work, which is done manually by subject experts, for the 50,000 most-used headings. Then, we will finally be able to share our resources with other libraries, using LCSH, RAMEAU or SWD and moreover to improve access to our collections in the language of the user’s choice. Furthermore, users will be able to search several catalogues simultaneously. Queries will be translated into the other subject headings. This is particularly important for European search environments like Europeana or TEL (http://search.theeuropeanlibrary.org/portal/de/index.html).

5.3. Linking Subject Headings with DDC Mapping subject headings to notations of the DDC is the aim of the CrissCross project, a cooperation of the German National Library and the Cologne University of Applied Sciences (http://linux2.fbi.fh-koeln.de/crisscross/index .html). In doing so, it faces the challenge of finding a crosswalk between two different knowledge organization systems. SWD works as initial vocabulary, DDC is the target language. Appropriate Dewey numbers are added directly to the SWD records. Table 3. DDC notation in a SWD record descriptor

Kürbis

synonyms

Cucurbita pepo Gartenkürbis

DDC notations

583.63 (Botanik - Cucurbitales) 635.62 (Gartengemüse - Pâtissonkürbisse und Kürbisse) 641.3562 (Lebensmittel – Zucchini,…) 641.6562

(Kochen mit Kürbissen)

Mappings are carried out as specifically as possible. Due the polysemy of many subject headings, a term can be assigned to several Dewey classes. The degree of correspondence or determination between a term and a class is specified by a number from 4 to 1. This measure of determinacy shall be used for ranking mechanisms in future retrieval. The project is underway and nearly 70,000 subject headings have been mapped to the classification so far. With this, we have an additional structural element within the SWD as well as an enhanced access vocabulary for the DDC. Both can be used in various retrieval scenarios.

74

Yvonne Jahns

5.4. Cooperation with Online-Encyclopaedias Broadly accepted authority files can expand their area of application by linking to and from other online sources. We started referencing name authorities to Wikipedia articles and we want to go on with this for topical terms. Thus, indexers and users can look up definitions of terms and Wikipedia users receive a list of related publications (via our library catalogue). For the same purpose DNB started cooperating with Germany’s most important encyclopaedia and SWD framework Brockhaus (http://www.brockhaus.de/), but the publisher went out of business and its online encyclopaedia will not be maintained. We have to face a general crisis of printed encyclopaedias, raising challenges also for future authority work.

6. Future Plans - A Comprehensive German Authority File for Persons, Institutions and Subjects: GND Beside SWD, two other major German authority files exist, one for names of persons, Personennormdatei (PND), and one for names of corporate bodies, Gemeinsame Körperschaftsdatei (GKD). The first one is used for alphabetic and subject cataloguing, that is, for describing whether a publication is by or about a person. The second one is mostly used for formal descriptions, in case a publication is published by an institution. GKD was the first authority file project, going back to the 1970s, when computerized cataloguing and also RAK (the German rules for alphabetic cataloguing) started. PND and GKD follow those RAK rules; therefore rules for names differ from the SWD. Turning to RDA and a closer cooperation of alphabetic and subject cataloguing, our aim is to bring together heading construction rules and merge our authority files. This merged file will be called Gemeinsame Normdatei (GND), a general, comprehensive authority file, and will hopefully be completed in 2011. The current step is creating or adopting a data format closely related to MARC21 authority.

7. Closing Remarks Organizing knowledge retrieval in a controlled way is a multidisciplinary task requiring more than building and applying a controlled vocabulary. It is an exciting task for information professionals, even more so in today’s web environment, where people can look up information from any place in the world. Overcoming language barriers will be one of the most important tasks, especially for our European initiatives like the search engine Europeana or gateways like TEL. Authority data can be used today either in standardised library cataloguing or indexing databases or websites. Libraries should realise the new potential of linking multimedia web resources of different providers, bringing together books, films and other documentations.

20 Years SWD

75

In spite of its advantages, authority work remains an expensive and coordinationintensive work, but there is no doubt that the benefits outweigh all efforts.

References Clavel-Merrin, Genevieve. 1999. “The Need for Co-operation in Creating and Maintaining Multilingual Subject Authority Files.” Paper presented at the World Library and Information Congress (65th IFLA General Conference and Council), 20-28 August 1999, Bangkok. Available at http://archive.ifla.org/IV/ifla65/papers/080-155e.htm Deutsche Nationalbibliothek. 2008.” Konkordanz MAB2 – MARC 21. Teil 2: Konkordanz MABPND, -GKD, -SWD – MARC Authority.” Available at http://www.d-nb.de/standardisierung/ pdf/konkordanz_2.pdf GESIS - Leibniz-Institut für Sozialwissenschaften. 2010. ”Kompetenzzentrum Modellbildung und Heterogenitätsbehandlung (KoMoHe).” Available at http://www.gesis.org/forschung-lehre/ drittmittelprojekte/ projektuebersicht-drittmittel/komohe/ Last updated December 11, 2010, accessed January 15, 2011 Hengel, Christel. 2008. “The Virtual International Authority File (VIAF)”. New Perspectives on Subject Indexing and Classification: Essays in Honour of Magda Heiner-Freiling. Leipzig, Frankfurt am Main, Berlin: Deutsche Nationalbibliothek. Kunz, Martin. 2002. “Subject Retrieval in Distributed Resources: a Short Review of Recent Developments.” Paper presented at the World Library and Information Congress (68th IFLA General Conference and Council), 18-24 August 2002, Glasgow. Available at http://archive. ifla.org/IV/ifla68/papers/007-122e.pdf Landry, Patrice. 2008. “The Evolution of Subject Heading Languages in Europe and their Impact on Subject Access Interoperability”. In New Perspectives on Subject Indexing and Classification: Essays in Honour of Magda Heiner-Freiling. Leipzig, Frankfurt am Main, Berlin: Deutsche Nationalbibliothek Lopes, Maria Inês, and Julianne Beall, eds. 1999. Principles Underlying Subject Heading Languages (SHLs), München: Saur Mayr, Philip, and Vivian Petras. 2008. “Cross-Concordances: Terminology Mapping and its Effectiveness for Information Retrieval”. Paper presented at the World Library and Information Congress (74th IFLA General Conference and Council), 10-14 August 2008, Québec. Available at http://archive.ifla.org/IV/ifla74/papers/129-Mayr_Petras-en.pdf McIlwaine, Ia. C., et al. eds. 2003. Subject Retrieval in a Networked Environment: Proceedings of the IFLA Satellite Meeting Held in Dublin, OH, 14 - 16 August 2001 and Sponsored by the IFLA Classification and Indexing Section, the IFLA Information Technology Section and OCLC. München: Saur. Nicholson, Dennis. 2008. “A Common Research and Development Agenda for Subject Interoperability Services?”. Available at http://pro.tsv.fi/stks/signum/200805/4.pdf Accessed November 23, 2010 OCLC (Online Computer Library Center). 2009. “VIAF (The Virtual International Authority File).” Available at http://www.oclc.org/research/activities/viaf/default.htm Svensson, Lars G. 2008. “Unified Access: A Semantic Web Based Model for Multilingual Navigation in Heterogeneous Data Sources”. In New Perspectives on Subject Indexing and Classification: Essays in Honour of Magda Heiner-Freiling. Leipzig, Frankfurt am Main, Berlin: Deutsche Nationalbibliothek. Wikipedia: The Free Encyclopedia. 2010. s.v. “Thesaurus”, San Francisco: Wikimedia Foundation. Available at http://en.wikipedia.org/wiki/Thesaurus Accessed November 23

Mixed Translations of the DDC: Design, Usability, and Implications for Knowledge Organization in Multilingual Environments1 Joan S. Mitchell, Ingebjørg Rype, Magdalena Svanberg Abstract This paper reports on an ongoing investigation of mixed translation models for the Dewey Decimal Classification (DDC) system to support classification and access. A mixed translation uses DDC classes in the vernacular to form the basic framework of the mixed edition; English-language records are ingested directly to complete hierarchies where needed. Separate indexes of available terminology in the vernacular and English are provided. Specific Norwegian and Swedish mixed models are described, along with testing results of the Norwegian model. General implications of mixed translation models for knowledge organization in multilingual environments are considered.

Introduction This paper reports on an ongoing investigation of mixed translation models for the Dewey Decimal Classification (DDC) system to support classification and access. A mixed translation uses DDC classes in the vernacular to form the basic framework of the mixed edition; English-language records are ingested directly to complete hierarchies where needed. Separate indexes of available terminology in the vernacular and English are provided. A mixed translation could speed the translation process and make the translation easier to maintain. The majority of updates to the DDC occur in classes subordinate to those found in the English-language abridged edition; therefore, it might be easier to keep a mixed translation up-to-date by ingesting English-language records directly at deeper levels. Possible productivity gains in the development/maintenance of a mixed translation must be weighed against its usability as a classifier’s tool and in end-user facing applications. Investigation of a mixed translation was first suggested as an outcome of a 2006 study by the National Library of Sweden to explore a Swedish translation of the DDC (Svanberg 2006a, 2006b). The study looked at three approaches to translation: 1

The authors are grateful for the advice of Karen Nisja Domaas and Marianne Troldmyr (National Library of Norway) on Norwegian-English translation issues, and for the technical assistance of Rebecca Green (OCLC) on converting English-language instructions to Norwegian. DDC, Dewey, and Dewey Decimal Classification are registered trademarks of OCLC Online Computer Library Center, Inc.

78

Joan S. Mitchell, Ingebjørg Rype, Magdalena Svanberg

a Swedish translation of the abridged edition, a Swedish translation of the full edition, or a Swedish customized abridgment similar to the Norwegian edition of the DDC. The abridged edition was rejected as too brief and the full edition as too detailed. With respect to the third approach, a customized abridgment, concerns were raised related to interoperability and the cost of development and maintenance. A mixed Swedish-English translation arose as a possible solution. Svanberg’s presentation (2006b) on the Swedish study during the Dewey Translators Meeting at the World Library and Information Congress in 2006 spurred interest on the part of the National Library of Norway to investigate a mixed translation as a possible approach to a new Norwegian edition of the DDC. In late 2006, the authors initiated a joint study to explore models for mixed translations, and to test mixed versions based on those models for usability as a classifier’s tool and in end-user facing applications. We began our investigation by proposing a basic design for mixed translations, and then developing specific models to address the Norwegian and Swedish contexts (Mitchell, Rype, and Svanberg 2008a). Using the Norwegian mixed model, several mixed Norwegian-English schedules were built and tested with users in Norway. Parallel to this work, Svanberg continued to refine the initial Swedish mixed model. After a brief description of the basic mixed translation model, the paper reviews the Norwegian mixed model and testing results, followed by a discussion of the current version of the Swedish mixed model. We close with some general observations and questions about the role of mixed translations as knowledge organization tools in multilingual environments.

Basic Design The current version of the basic model features available DDC data in the vernacular as the framework, updated to match the corresponding classes in the Englishlanguage full edition. English-language classes from the current full edition are added to the vernacular framework to complete the hierarchies. In hierarchies where interoperable expansions are available in the vernacular, the vernacular framework will be at a deeper level than its English-language equivalent.2 The auxiliary tables (Tables 1-6) will be translated in full with the exception of the geographic table (Table 2). Table 2 will feature interoperable expansions for geographic areas of interest in the vernacular; the records for some areas not likely to be needed at the level of detail provided in the English-language edition will be ingested directly into the mixed edition without translation (e.g., U.S. counties will not be translated in Table 2 in the Swedish mixed edition). The standard terminology for instructions in a class record will be in the language of the record, e.g., “Inkluderer” for classes in Norwegian, “Including” for classes in English. Separate indexes featuring the terminology available in each language will be included. The Introduction and Glossary will be translated in full and made available in both languages; most of the Manual (with the exception of Manual notes that refer only to classes in English) will be translated, and will also be made available in both languages. 2

See Beall (2003) for a detailed discussion plus examples of interoperable expansions.

Mixed Translations of the DDC

79

Norwegian Mixed Model The basic mixed translation design was customized to meet one Norwegian-specific requirement—the need to continue to provide an abridged edition (or abridgment instructions) based on the level of notation found in the current Norwegian edition of the DDC. DDK 5, the 5th edition of Deweys Desimalklassifikasjon (Dewey 2002), is a customized abridgment of DDC 21 based on the literary warrant in Norwegian libraries, and includes several adaptations to address the Norwegian cultural/political situation. We used the level of notation in DDK 5 as the guide for the vernacular framework of the mixed Norwegian-English version. In each of the sample mixed schedules, we updated the Norwegian classes to match the equivalent classes in DDC 22, and ingested English-language classes to complete the hierarchies. We also imported the existing Norwegian index terms. When indexable topics were dropped from Norwegian-language classes in the mixed edition because they appeared in subordinate English-language classes, we added them to the Norwegian index if not already represented there. We explored a number of different approaches to meet the requirement to provide instructions for abridgment.

Pilot Studies in Norway For the first pilot study, we built a mixed translation of classes 370-372 in 370 Education. We followed the basic design using an updated version of DDK 5 classes as the notational framework, and accompanied the mixed version with separate Norwegian and English indexes. Figure 1 shows an excerpt from the initial 370-372 mixed schedule. In that version, the abridgment requirement was addressed by using a slash (/) to mark the end of DDK 5-equivalent notation in notes (e.g., classes 370.152/8 and 370.152/3 are abridged to 370.152 in DDK 5).

Figure 1. Mixed Norwegian-English translation of 370 Education (excerpt).

In June 2008, we tested the mixed 370-372 schedule with a group of nineteen Norwegian librarians recruited from a variety of library types. Study participants were asked to classify a set of twenty titles (ten in Norwegian, ten in English) using the 370-372 schedule and Relative Index from DDK 5, the mixed edition, and DDC 22. Participants were asked to complete an online questionnaire probing the usefulness

80

Joan S. Mitchell, Ingebjørg Rype, Magdalena Svanberg

of the mixed translation as a classifier’s tool. Follow-up online interviews using open-ended questions were conducted with participants who completed the survey. A brief summary of the study and key findings follows; a fuller discussion can be found in Mitchell, Rype, and Svanberg (2008b) and Rype and Svanberg (2008). Twelve of those recruited completed the study; two national library participants answered jointly and were counted as a single respondent for a total of eleven responses. All respondents were current users of DDK 5. Three also used DDC 22 (for one of the university libraries, DDC 22 was the primary tool and DDK 5 the secondary tool). One used WebDewey, two used older English-language editions (DDC 21 and DDC 20, respectively). The study had several limitations: DDK 5 itself was not fully updated to reflect DDC 22; some interim updates to DDK 5 classes were not included in the mixed edition; and only two respondents were from public and county libraries, a key user group of DDK 5. Survey participants showed openness to using a mixed edition, using DDK 5 as the guide for the level of notation in such an edition, and including Norwegian index terms for English-language classes. There was less interest in having Englishlanguage index terms associated with classes in Norwegian. In follow-up interviews with nine participants (again, two national library participants answered jointly for a total of eight respondents), we were able to probe likes and dislikes more deeply. Respondents liked the Norwegian framework for the mixed version, the addition of more terms to the Norwegian index, and the depth/context provided by having the English-language classes close at hand. Some found the mix of languages confusing, and thought more attention should be paid to the basic design in terms of color, font, etc. While numbers in notes included a slash mark to show abridgment to the DDK 5 level, class numbers in the number column and index did not include abridgment marks. Some found the association of Norwegian index terms with English-language classes confusing. One respondent raised a concern about the mastery of English among Norwegian librarians. Several commented on the need for a more comprehensive Norwegian index—one with more terms and with additional aspects of subjects. One key concern among respondents was the loss of information in the Norwegian classes in the mixed edition. For example, figure 2 shows class 370.153 as it appears in DDK 5; figure 1 shows the same class in the mixed edition. The DDK 5 version was developed by abridging the contents of the corresponding subdivisions of 370.153 in DDC 21; that abridgment is reflected in the contents of some of the notes under 370.153 in DDK 5. In the mixed version of 370.153 (fig. 1), there is no longer an abridged summary of the class in Norwegian and the subdivisions are explicitly listed in English. Even though most of the terminology from the DDK 5 version of 370.153 still appears in the Norwegian index associated with the mixed edition, many of the terms now point to English-language classes.

Mixed Translations of the DDC

81

Figure 2. Class 370.153 from DDK 5.

Because of the limited participation of public librarians in the original study, a second study was launched in November 2008 with all large public libraries in Norway plus a 10% sample of small and medium public libraries (fifty-six participants in total). An updated version of the 370-372 schedule was prepared that addressed some typographical errors and omissions in the version used in the initial study. Unfortunately, only three libraries responded to the second study, and none completed it. In early 2009, we prepared mixed Norwegian-English versions of two additional schedules, 006 Special computer methods and part of 616 Diseases (616-616.1). The computer science schedule was chosen because it represented a fast-changing area, and the medicine schedule was chosen because it featured a complicated add table for which a special instruction had to be devised to handle different application instructions for abridged users versus full mixed edition users (see fig. 3). The note under notation 0023 and 00284 prefaced by “DDK 5” instructs users of notation at the DDK 5 level to class the topic represented by the notation in the number for the disease (“sykdommen”) without adding the notation.

Figure 3. Add table under 616.1-616.9 in mixed Norwegian-English edition (excerpt).

The original plan was to present the materials to participants in a special workshop on the future of the Norwegian translation scheduled for 4 February 2009 in Oslo. Instead, a general conversation was held with participants in that meeting on the

82

Joan S. Mitchell, Ingebjørg Rype, Magdalena Svanberg

future of the Norwegian edition. The discussion at the workshop centered around two questions: What level is needed for a new translation of Dewey? Is it possible to create a Norwegian edition of Dewey that could be used by all libraries? The workshop participants recommended the development of a full translation in Norwegian, in which abridgment instructions based on the DDK 5 level of notation would be provided for smaller libraries. The reasons behind the recommendation included the importance of Norwegian terminology, and consistency in application to support exchange of classification data. Norwegian terminology is important in order for classifiers to apply the DDC correctly, and as the basis for subject access for librarians and users (there is no national subject heading system in Norway). Participants felt that if all Norwegian libraries were using the same edition of the DDC, it would be easier to maintain consistency in classification. Following the February workshop, we prepared three new versions of 006 Special computer methods: a mixed version that included abridgment marks for class numbers in the number column and index plus those already in the notes (addressing an earlier criticism by respondents in the original pilot study), and two Norwegianonly abridged versions derived from the mixed edition. Figure 4 shows class 006.33 in DDK 5, and figure 5 shows class 006.33 plus its subdivisions in the mixed version of 006.

Figure 4. Class 006.33 from DDK 5.

Mixed Translations of the DDC

83

Figure 5. Class 006.33 from Norwegian-English mixed edition.

We also explored two approaches to deriving a Norwegian-only abridged edition from the mixed edition. Figure 6 shows an abridged version of class 006.33 that was derived using Norwegian index terms mapped to English-language classes in the mixed edition. The abridged version of class 006.33 in figure 7 was derived using data from classes one level down from the established DDK 5 notational framework according to rules for automatic abridgment under study by Green and Mitchell (2009). The abridgment in figure 7 is a fuller representation of 006.33 than found in DDK 5, but it also required additional translation of topics not selected for inclusion in DDK 5. If a machine-assisted abridgment of a mixed edition requires additional translation in order to produce an abridged edition in the vernacular, that could be a hidden cost in a mixed model for which a vernacular abridgment is an additional requirement.

84

Joan S. Mitchell, Ingebjørg Rype, Magdalena Svanberg

Figure 6. Abridged class 006.33 derived from Norwegian index terms associated with subordinate classes in Norwegian-English mixed edition.

Figure 7. Abridged class 006.33 derived from subordinate classes in Norwegian-English mixed edition.

Status of the Mixed Model in Norway In April 2009, the Norwegian Committee on Classification and Indexing (NKKI) recommended to the National Library of Norway that the library proceed with a full translation of the DDC into Norwegian, and accompany the translation with abridgment instructions. The National Library of Norway agreed in principle with NKKI’s recommendation, but has postponed a final decision until a full review of the costs associated with a full translation can be completed. Does this mean that the mixed model does not have a future in Norway? The answer is, probably not as an end, but perhaps as a means to an end. In the last section of this paper, we discuss the use of the mixed model as a way of exposing a translation early in the process to users in areas where English enjoys wide usage.

Swedish Mixed Model The idea of the mixed translation originally arose in Sweden. In Sweden, the mixed model still seems like a good way to produce a DDC translation within a fixed time limit and with limited resources. Also, the Swedish situation differs from the Norwe-

Mixed Translations of the DDC

85

gian situation—there is no previous edition of Dewey in Swedish, nor is there a requirement to produce a Swedish abridged view of the mixed edition. The Swedish edition will follow the mixed model described in the beginning of this paper. The initial guide for building the vernacular framework will be the level of specificity found in the classification scheme used by most Swedish libraries, Klassifikationssystem för svenska bibliotek (Viktorsson 2006), generally known as SAB. SAB will guide the initial translation of class records into Swedish. SAB represents literary warrant in Swedish libraries, and provides a level of specificity that has proven to be usable in Sweden. There is a conversion table between DDC and SAB that will form the basis for the decisions on which classes to translate (Gustavsson 2000). The printed conversion table maps classes from DDC 21 to classes in SAB 7. There is also a web version that has been updated to reflect the classes in DDC 22 and SAB 8, the latest SAB edition published in 2006.3 Revising and expanding the conversion table is also part of the Swedish DDC project; further details on the Swedish DDC project are reported by Svanberg (2009). All classes superordinate to a translated class will be translated, even if certain DDC classes in the upward hierarchy do not have a corresponding class in SAB. The mix of languages in a mixed edition can be confusing, and the language shift in a hierarchy should occur only once. Figure 8 shows an example from DDC class 616.04. In the conversion table, five numbers in 616.04 are mapped to SAB:

Figure 8. DDC numbers in 616.04 in the Conversion table DDC-SAB.

This means that those classes will be translated into Swedish, while the rest remain in English. Class 616.04 will be translated into Swedish since it is a superordinate class to 616.042, 616.043, and 616.047.

3

Konverteringstabell mellan Dewey och SAB, the web version of the table, is available at http://export.libris.kb.se/DS/. See Svanberg (2008) for more information about mapping DDC and SAB.

86

Joan S. Mitchell, Ingebjørg Rype, Magdalena Svanberg

Figure 9. Class 616.04 in the Swedish-English mixed edition.

An additional source of entry vocabulary will be the Swedish subject headings, Svenska ämnesord. Svenska ämnesord includes links to equivalent LC subject headings, and LC subject headings have been mapped to many DDC classes. In addition to providing access to vernacular content, the Swedish headings can be a source of additional access to classes that have not been translated into Swedish. The mapping of the Swedish headings to the DDC is described in more detail in Svanberg (2009). SAB and DDC share many similarities, but there also significant differences between the two systems. Some DDC classes at a broad level in the hierarchy do not have equivalents in SAB. To avoid a situation where some DDC sections do not even have the top level in the hierarchy translated into Swedish, it was decided to use the level in the English-language abridged edition as second source to decide what to translate. For example, DDC class 363.4 Controversies related to public morals has no counterpart in SAB, but will still be translated into Swedish, since it appears in the abridged edition. In order not to spend too much time on choosing what to translate instead of translating, the Swedish team developed a set of simple rules: 1. All DDC classes with equivalents in SAB are translated. The conversion table SAB-DDC is used to decide which classes are equivalents. 2. All superordinate classes of translated classes are translated. 3. All classes from the abridged DDC edition are translated. 4. When there is an obvious need for Swedish terminology, other classes can be translated as well. This should be done restrictively. Since the translation process has not started yet, the rules have not been tested in practice; modifications or additional rules may be added once the translation is under way. The mixed model has only been tested in pilot studies this far, and the Swedish version of the mixed model has not been tested at all. The Swedish team came to the conclusion that the mixed model is a model that makes it possible for Sweden to reach a DDC translation within limited time and with limited resources. There are, however, important questions to consider: Are there important classes in DDC that

Mixed Translations of the DDC

87

will be missed? Are there weaknesses of SAB that might be brought into the Swedish translation of DDC? Here are just a few examples from 616 with more than 50 hits in LIBRIS, the Swedish Union Catalogue, that will not be translated: 616.8521

Traumatic neuroses

616.89142 616.89156

Behavior therapy (Behavior modification therapy) Family psychotherapy

Even if these classes will not be translated initially, the Swedish team plans to continue the translation work after the Swedish DDC is made available. Using data on frequency in LIBRIS will be one way to decide what to translate once the initial translation is done. Another consideration is if the Swedish team might need to adjust the model to avoid spending time translating unimportant classes that are mapped to the DDC for completeness. Most of the mappings in the conversion table are made on the level of the least specific system, often SAB. However, there are also mappings on a more specific level that are made to handle differences between the systems. Examples of this can be found in DDC class 616.2 Diseases of respiratory organs. This DDC class is mapped to SAB class Vei Respiratory organs. Some of the subdivisions of 616.2 are mapped to other SAB classes. The DDC classes 616.201 Croup, 616.203 Influenza and 616.204 Whooping cough are mapped to SAB class Veba Immunologic diseases. The main equivalent in DDC for Veba is 616.97 Diseases of immune system. The mappings of DDC classes 616.201, 616.203, and 616.204 have been introduced to show that works on croup, influenza, and whooping cough are put in a different class in SAB than general works on diseases of respiratory organs (DDC class 616.2). They do not show an existing level in SAB. It is not yet known how frequently this kind of mapping occurs in the conversion table, or how easy such mappings are to identify for the translator.

Implications for Multilingual Representations The literature contains discussions of theoretical issues in bilingual and multilingual representations of knowledge organization systems, but mixed translation models of classification systems do not appear to have been explored previously. A mixed translation is not a bilingual edition in the sense of parallel classes in both languages. The mixed model features a vernacular framework in which English-language classes supplement classes in the vernacular, plus separate language-specific indexes that may contain varying levels of terminology. Nonetheless, some literature focused on multilingual thesauri can inform the mixed translation work. In a discussion of multilingual thesauri, Hudon (1997) argues for acceptance of nonidentical and nonsymmetrical structures, and recommends that the number of descriptors in each linguistic version should be permitted to vary. In the mixed translation model, the terminological content of the vernacular index may be shallower in some areas than its English-language equivalent, and deeper in

88

Joan S. Mitchell, Ingebjørg Rype, Magdalena Svanberg

other areas, matching some of the elasticity in the vernacular framework. In Guidelines for multilingual thesauri (IFLA 2009), the building of nonsymmetrical thesauri is noted as an important development in multilingual thesauri construction. What have we learned to date from building and testing models for mixed translations, and what are the implications of this work for knowledge organization in multilingual environments? We have previously discussed the loss of descriptive content in the NorwegianEnglish mixed edition when topics from subdivisions of a class in the full edition are moved from notes in the abridged edition to the index. If the Norwegians decide to proceed with a full edition, this will no longer be a consideration. Developing a DDC edition using a mixed model can serve as a vehicle to expose embedded assumptions in vernacular and English-language versions of the DDC. Work on the mixed Norwegian-English translation made it possible for one of the authors (Mitchell) to view the general DDC framework from within a Norwegian perspective—a different view from that of a mere reviewer of a translation or a developer of the general English-language DDC framework. Editorial work was already under way on improvements to the 370 Education schedule, but the process of creating the mixed version of 370 exposed some deep-seated differences in the levels of education and the primary school curriculum from a Norwegian perspective. For example, in Norway, primary education covers grades one through ten, is equivalent to compulsory education, and has a terminal degree associated with it. In the general DDC framework, primary education covers grades one through six, and compulsory education is handled as a policy issue rather than a cohesive level of education with a terminal degree. We are studying how to improve the general DDC framework to accommodate both views while retaining interoperability. While building the mixed version of another part of the education schedule, we came to the realization that every language edition of Dewey implicitly defines the class 372.6 Language arts (Communication skills) as “language arts and communication skills in the language of this translation.” Work is currently under way to make the implicit explicit here, and to support interoperability by viewing the class in the context of the source edition. Perhaps our most important finding is a reconsideration of the mixed translation model as simply the framework for a certain type of language edition. Might the mixed translation be a vehicle for exposing a translation to users at an early point in the translation process? If yes, at what point does one start exposing the framework? We expect to test the point of exposure in the process of developing the SwedishEnglish mixed edition. Once the working translation is exposed, what role could users play in the translation process? Certainly feedback on terminology is one possible role, but might users also play a more active role in the process? Could a mixed translation serve as a social-networking environment to develop the translation further (and perhaps as a vehicle for crowd-sourcing recommendations for basic improvements to the DDC)? Are there differences in the levels of a mixed translation that can be exposed as a classifier’s tool versus in end-user facing applications? We plan to continue our investigation of mixed translation models in this broader context.

Mixed Translations of the DDC

89

References Beall, Julianne. 2003. “Approaches to Expansions: Case Studies from the German and Vietnamese Translations.” Paper presented at the World Library and Information Congress (69th IFLA General Conference and Council), 1-9 August 2003, Berlin. Available at http://archive.ifla. org/IV/ifla69/papers/123e-Beall.pdf Dewey, Melvil. 2002. Deweys Desimalklassifikasjon. 5th ed. Edited by Isabella Kubosch. Oslo: Nasjonalbiblioteket. Green, Rebecca, and Joan S. Mitchell. 2009. “Rethinking the Abridged Edition.” EPC Exhibit 13136.1. Paper presented at Meeting 131 of the Decimal Classification Editorial Policy Committee (EPC), Dublin, Ohio, 10-12 June 2009. Gustavsson, Bodil. 2000. Konverteringstabell mellan Dewey Decimal Classification (21. ed.) och Klassifikationssytem för Svenska Bibliotek (7. uppl.). Lund: Bibliotekstjänst. Hudon, Michèle. 1997. “Multilingual Thesaurus Construction: Integrating the Views of Different Cultures in One Gateway to Knowledge and Concepts.” Knowledge Organization 24, no. 2: 84-91. IFLA Section on Classification and Indexing. Working Group on Guidelines for Multilingual Thesauri. 2009. Guidelines for Multilingual Thesauri. IFLA Professional Reports 115. The Hague: International Federation of Library Associations and Institutions. Konverteringstabell mellan Dewey och SAB. http://export.libris.kb.se/DS/ Mitchell, Joan S., Ingebjørg Rype, and Magdalena Svanberg. 2008a. “Mixed Translation Models for the Dewey Decimal Classification (DDC) System.” In Culture and Identity in Knowledge Organization: Proceedings of the Tenth International ISKO Conference, 5-8 August 2008, Montréal, Canada, edited by Clément Arsenault and Joseph T. Tennis, 98-104. Würzburg: Ergon. ―. 2008b. “Mixed Translation Models for the DDC.” Presentation at Tenth International ISKO Conference, 5-8 August 2008, Montréal, Canada; also presented at Dewey Translators Meeting, World Library and Information Congress (74th IFLA General Conference and Council), Québec City, 12 August 2008. http://www.oclc.org/dewey/news/conferences/isko_11_ mitchell_et_al.ppt Rype, Ingebjørg, and Magdalena Svanberg. 2008 “Blandet Utgave av Dewey. Presentasjon av en Pilotundersøkelse.” Infotrend 63, no. 4: 88-95. Available at http://www.sfis.nu/sites/ default/files/dokument/infotrend/2008/blandet408.pdf Svanberg, Magdalena. 2006a. “Övergång till Dewey Decimal Classification. Vad Skulle det Innebära? Delstudie 3 i Katalogutredningen.” [Stockholm]: Kungl.biblioteket. Available at http://www.kb.se/Dokument/Om/projekt/avslutade/katalogutredning/delst3_slutrapport.pdf ―. 2006b. “Swedish Switch to DDC.” Paper presented at the Dewey Translators Meeting, World Library and Information Congress (72nd IFLA General Conference and Council), 23 August 2006, Seoul, Korea. Available at http://www.oclc.org/dewey/news/conferences/swedish_ switch_ifla_2006.doc. ―. 2008. “Mapping two Classification Schemes – DDC and SAB.” In New Perspectives on Subject Indexing and Classification in an International Context: Essays in Honour of Magda HeinerFreiling, 41-51. Leipzig, Frankfurt am Main, Berlin: Deutsche Nationalbibliothek. —. 2009. “Dewey in Sweden: Leaving SAB after 87 Years.” Paper presented at Looking at the Past and Preparing for the Future, an IFLA satellite preconference sponsored by the Classification and Indexing Section, 20-21 August 2009, Florence, Italy. Viktorsson, Elisabet, ed. 2006. Klassifikationssystem för Svenska Bibliotek. 8th ed. Lund: Btj förlag.

Animals Belonging to the Emperor: Enabling Viewpoint Warrant in Classification1 Claudio Gnoli Abstract Recent research in knowledge organization has emphasized the need for representing different local perspectives, synthesized in Beghtol's principle of viewpoint warrant. A typical case is the taxonomy of animals: folk or non-Western taxonomies differ from those of academic biology, an extreme example being Borges's paradoxical “Chinese” classification. On the other hand, global services require interoperability between different viewpoints. The Integrative Levels Classification (ILC) project is working at the basic structure of a general, interdisciplinary, freely-faceted system. Among its features is a set of special classes (deictics) that acquire different meanings according to the local context, thus allowing for interoperability between different local extensions of the scheme. Examples of their application to the classification of animals are shown.

Introduction The theory and practice of intellectual tools like classifications, thesauri, subject heading lists, taxonomies, and ontologies is collectively known today as knowledge organization. Research on knowledge organization addresses the principles and techniques by which knowledge items can be ordered. In recent years, much knowledge organization research has focused on a critical examination of these principles and techniques. While having developed from practical needs, like the arrangement of books on library shelves, or of bibliographic records in directories and catalogues, knowledge organization systems (KOS) have brought various theoretical biases with themselves. Indeed, a KOS is an expression not only of the structure of the real world (ontological dimension), and of our means of perceiving it (epistemological dimension), but also of the cultural milieu and pragmatic purposes providing the context for its development (sociological dimension) (Hjørland and Hartel 2003). Classical KOSs, like the Dewey Decimal Classification, the Universal Decimal Classification or the Bliss Bibliographic Classification, have adopted a universal perspective, basically expressing the first two dimensions, ontological and epistemological, which can be assumed as common to any individual human user. However, the third dimension, sociological, seems to be another unavoidable component of any 1

I am grateful to Clare Beghtol, Thomas M. Dousa, and Riccardo Ridi for providing useful suggestions and references.

92

Claudio Gnoli

KOS. Some even think that this component be the main one, making any attempt of universality problematic (Maniez 1997; Hjørland 2004). This view can be too pessimistic (Szostak 2008), leading to a relativistic way of thinking, in which the only possible task of knowledge organization research would be a sociological analysis of how communities working in given domains produce their own KOS. In principle this would even imply the impossibility of building such interoperability tools as multilingual thesauri or top-level ontologies, although in practice these are produced and used in some form.

Viewpoint Problems In any case, contemporary authors generally agree that the perspective of a KOS, including its philosophical assumptions, its cultural origins, and its pragmatic purposes, should be made explicit, rather than remain implicit hence potentially misleading for its users. This idea has been received in formulating the theme for the next conference of the International Society for Knowledge Organization: “Paradigms and conceptual systems in knowledge organization” (Gnoli and Mazzocchi 2010). A wide discussion of bias in classification, both scientific and bibliographical, has been opened by Bowker and Star (1999), who showed the effects of classifying e.g. diseases in a way or another. This theme has been received by Ridi (2010) to warn information users about the importance and the implications of using classifications and taxonomies, both in bibliographic searches and in everyday life. The terminology used to label subjects itself can be biased towards culturally dominant groups, like middle-class white males: various social prejudices can thus be hidden in such a widely spread KOS as the Library of Congress Subject Headings (Olson 2002). This can make its use problematic for different perspectives, like women's studies and feminism (Kublik et al. 2004). Another possible kind of bias is political. In the thesauri of international organizations, the term development is defined only in its economic meaning, suggesting that developing countries should “develop” in a capitalistic sense, but not in social, educational, artistic, or spiritual senses (Severino 2005). The Library of Congress Classification is remarkable for treating military sciences and naval sciences as two main classes, where another KOS could represent them as subclasses or facets of political sciences. Naval sciences are particularly irrelevant for countries without any coast in their territory. Also evident are biases in the Soviet and Chinese library classifications, adopting Marxism, Leninism, Maoism etc. as their first main classes, like a presupposition for any other form of knowledge. Cultural differences in classification are also reported on small scale. While studying knowledge of potato varieties in the traditional agriculture of Liguria (Italy), Angelini (2005) observes that “different local names can be referred to the same variety, but also, on the contrary, different varieties are called by the same name. To Giacumin from Vobbia, the same quarantina potatoes that are cultivated in Croce, Pentema, and Montoggio, villages just few kilometers away, are completely different varieties that he calls by different names, and I don't know how but he is able to tell them apart!”

Animals Belonging to the Emperor

93

Probably the most basic kind of bias in knowledge organization is that produced by profound differences between cultures that developed separately, like the Western vs. Eastern ones. Relevant cases have been studied by Kwaśnik and Chun (2004), reporting how the Korean version of the Dewey Decimal Classification required a new subclass of 700 “arts” for calligraphy, as in Far-Eastern culture this is listed among the major arts. Kinship structures also need to be represented in different ways according to the cultural context (Kwaśnik and Rubin 2004). In the second edition of the Bliss Classification, a radical choice has been made to organize Eastern philosophy by different facets than Western philosophy (Biagetti 2009); still, the separation of philosophy from religion, built in the main classes of all Western classifications, seems itself to be an unnatural representation of Eastern wisdom, and the same could be said for Medieval Western wisdom. The ultimate example of unexpected categories in an exotic classification is that described by Borges (1964, 101-106), claiming that it comes from a Chinese encyclopedia. Actually this is presented to serve as a summa of all kinds of inconsistency that can be found in real classifications (see fig.1).

Figure 1. Animal classes according to Borges.

This classification is often quoted to discuss problems of inconsistency. Indeed, each class appears to be the result of applying a different characteristic of division, a different facet, a different perspective. The whole scheme is thus extremely idiosyncratic, suggesting that any other classificationist, or even the same classificationist in another moment, could produce a different scheme. This can seem quite discouraging when we are looking for an optimal KOS to be shared on the global scale.

Matching Local and Global KOSs Despite these problems, scholars dealing seriously with a given phenomenon usually agree on many parts of its classification, leaving aside those aspects that are not yet clear at the current stage of research. No zoologist starts her classification by main classes like Borges's ones: she will rather mention such general groupings as mol-

94

Claudio Gnoli

luscs, annelids, arthropods, chordates. Is this arbitrary? Is it only the expression of a particular academic community, having imposed its KOS over those of the minorities for some accidental or tendentious reason? Probably most zoologists would agree that the standard taxonomy is rooted in real relationships between animals that can be understood in some reliable way by our means of knowledge, though conceding that many details could be corrected and developed in future. In other words, the ontological dimension seems to play a major role in determining scientific taxonomy, as compared with the epistemological and the sociological. A method to check this assumption is to compare the standard scientific taxonomies with those used in cultures that had little or no contact with the modern West, so that they have presumably not been influenced by it. Diamond (1966) reports that the Fore people in New Guinea used 110 specific names to identify birds: these largely corresponded to the 120 species identified by zoologists, with 93 one-to-one correspondences and most of the remaining names referring to strictly-related species or to male and female forms of species with a great sexual dimorphism. Even stronger correspondence was found by Mayr with bird names used in the Arfak mountains, also in New Guinea, although the same people made no distinction between the many species of ants identified by biologists in their region (Wilson 1992). Berlin et al. (1966) compared 200 plant names in the Tzeltal language spoken by a community in Chiapas (Mexico) with the respective species in the standard botanical nomenclature: 82 names resulted to be underdifferentiated as compared to the botanical species, 68 (including 40 introduced after the Spanish conquest) exactly matched them, and 50 were overdifferentiated. Such results generally suggest that both the taxonomical units identified by biologists and those identified by native people do have a natural foundation; on the other hand, natives are less specific or precise for organisms that lack any practical interest for them, like in the case of ants. Taxonomies of organisms thus have both ontological and pragmatic bases, which should be reflected in knowledge organization in some way. There is a need both for a general way to refer to concepts as objectively as possible, and for ways to represent local uses and perspectives. The latter requirement is considered by Beghtol (1998) as a kind of warrant: in the same sense as the traditional classification principle of literary warrant recommends that classes reflect the occurrence of topics in actual documents, viewpoint warrant should ensure that they reflect the occurrence of concepts, and the relationships between them, in actual cultures. This is also an ethical principle (Beghtol 2002), as in the global information context no culture should be privileged by knowledge organization, rather everyone should find its own perspective represented. How can viewpoint warrant be enabled in practice? One simple way is to develop systems explicitly reflecting the particular perspective of a community of knowledge users, as is recommended in the domain analytic approach. This, however, conflicts with the other need that information can be shared on a global basis, that is, with the requirements of interoperability. In order to establish connections between different classifications, indeed, one needs some way to refer each couple of classes coming from different schemes to a common frame, independently from their specific viewpoints, domains, contexts. In other words, for mapping two KOSs that adopt special viewpoints, a third “neutral” KOS should exist, at least in the minimal form of concept identifiers to

Animals Belonging to the Emperor

95

which concepts of each KOS can be referred (Coates 1970). Although complete neutrality can be viewed as utopian, some intendedly neutral scheme is needed for technical purposes. In order to minimize its biases, such a switching scheme should adopt a maximally general and objective viewpoint. This will then be distinguished from explicitly biased KOSs, aiming at reflecting knowledge from particular viewpoints. Indeed, Beghtol (1998) suggests that a system enabling viewpoint warrant should “be able to support multiple perspectives in a looser structure”; it thus “would presumably have the advantage of providing infinite hospitality for adding any viewpoint – cultural, multidisciplinary, disciplinary, or sub-disciplinary – that might arise in future”. A similar structure was attempted already by Wåhlin (1974), who worked at an “AR-Complex”, that is “a coherent complex of classification systems” composed of a Reference unit (R) to which many different Adapted systems (A) could be attached: he created two adapted systems, one for products and another for building trade documentation. Parsons (1996; 2002) looked in the same direction for his MIMIC system, to manage “multiple views” in data modeling. Discussion about an “international comprehensive KOS” is still current in the context of digital information sharing (Boteram 2009).

The ILC Project Integrative Levels Classification (ILC) is an international research project aimed at developing the basic structure of a general, interdisciplinary, freely-faceted KOS, able to serve as a reference scheme for organizing any kind of information collection. ILC follows the principle, recently expressed in the León Manifesto (ISKO Italia 2007), of representing the objects treated (phenomenon), the perspective under which they are treated (aspect), and the information medium (carrier) as three separate dimensions. Viewpoint refers to the aspect dimension, which includes communicative function, modality, application, discipline, theory, method, place and epoch of the recorded knowledge. The basic structure of ILC is a tree of phenomenon classes, expressed by lower cases. These generally follow the standard knowledge of phenomena held in contemporary sciences. Thus, animals are a subclass of organisms, and in turn have various subclasses (only the main ones shown here):

96

Claudio Gnoli

Figure 2. Animal classes in ILC.

Viewpoint in ILC Among the features of ILC is a set of special classes that acquire different meanings according to the local context. In linguistical terms they are deictics, that is expressions that change their meaning according to the present situation, like words such as “you”, “here”, or “tomorrow” do. Deictics are represented in ILC by capital letters. Therefore, while mq always means “animals”, F can mean anything, depending on how it has been defined. In other words, any scheme adopting a particular viewpoint can potentially be represented by ILC classes A, B, C, D..., their subclasses AA, AB..., etc. Deictics can also occur as subclasses of standard classes: mqA, mqB etc. will mean animals of some type according to a local context. If a KOS is maintained as a database, as is the case with ILC, separate tables can be used for the general reference schedule and for any local schedule expressing particular viewpoints. For any class containing deictics, like A, a special field will be filled with the equivalent class in terms of the standard scheme (written in square brackets in schedule display). Such equivalent classes can be just a simple class, e.g. mqvo “birds”, that is stated to be equivalent to A for practical convenience, like having a more manageable notation, and for local relevance: as in the ASCII character set capital letters are ordered before small ones, computers will list them before the standard classes, as the “favoured host classes” (Ranganathan 1967, Section DG3435) of the present information system. The equivalent class can also be a compound, defined as the syntactical combination of various facets, that under the local viewpoint takes the status of a single whole. In the extreme case of Borges's taxonomy, the first subclass of “animals” is

Animals Belonging to the Emperor

97

defined by the combination of relationships “belonging to the emperor [of China]”. The relationships of these concepts would be represented in the standard scheme by quite complex combinations of facets: t55a2kv29p u8mq5i

governments, with chief, in China, empire economies, of animals, by private owner

from which the combination mq98u(5i(955(a)t(2kv29p))) “animals, being a good, by private owner, being chief, of government, in China, empire” can be constructed. Clearly, this compound notation is not very practical to be managed by users adopting the viewpoint of the Chinese encyclopedia. This can then be defined as equivalent to the first subclass of animals in this viewpoint, mqA. In the same way it would be possible to define the other subclasses, so to produce a representation of this classification as mqA, mqB, mqC, etc.: mqA [mq98u(5i(955(a)t(2kv29p)))] animals belonging to the emperor mqB [...] embalmed animals mqC [...] animals that are trained ... In the current version of ILC, letter A is used to mean the favoured host class (or subclass) according to the present viewpoint; letters B to T for other favoured classes; and letters U to Z for other special meanings also depending on the local context: mqU mqX mqXX mqY mqZ

the typical animals some animals what animals? the individual animal (e.g. Laika) the mentioned animal (anaphoric or cataphoric)

The deictic U is of special interest in the present discussion, as it allows to express the general viewpoint of mankind, rather than that of a specific community. It can be described as the anthropocentric favoured class. For example, the phenomenon “stars” in ILC is hk. A general, neutral taxonomy of stars would list them according to some general astronomical principles. In this general perspective, our Sun is just one star among an immense number of others, therefore it would get a very specific notation, say hkxxxxxx. But human documents deal with the Sun much more often than with any other star, and consider it as by far the most relevant star. Shortened notation hkU then allows to represent it in a shorter way, and to list it before all other stars, except any one of specific interest to a local context (a research centre focused on the Dog Star could represent it as hkA, preceding hkU in ordered displays). This is still a biased viewpoint, as the Sun in itself has no special character as compared to all the other stars; still, as the bias is the same for all human users, it does not need to be changed according to specific information contexts. Similar cases have already been identified in ILC for expressing concepts like “air” as a special gas mixture, “water” as a special chemical compound, “the Earth”, “the continents of contemporary Earth”, “currently prominent languages” such as English and Spanish, “contemporary countries” (the viewpoint of the last examples is actually dependent on time, but only on a big scale).

98

Claudio Gnoli

Testing and Development The ILC system is currently being tested in indexing on-line bibliographies and other Web resources in different domains, including human geography, bioacoustics, chemistry, and facet analysis. In particular, the system of deictics has been used until now in a bibliography on traditional culture and geography of a mountainous area in North-Western Italy, where deictics A to T stand for valleys in the Apennine range, which would have a much more complicate notation if had to be represented by their standard notation for landforms in the whole Earth (Gnoli 2008). The citation order of compound classes including deictics follows the inversion principle of analytico-synthetic schemes, prescribing that classes listed before in the schedules be cited last within a single compound class. A document concerning the Curone valley H and its animals mq will thus be notated mq H, rather than H mq. This appears to be an effective solution, as deictic classes expressing the local valleys are less discriminating, in the context of a bibliography on that very region, than classes expressing other phenomena. Documents concerning animals will thus be primarily listed all together, and only subsequently differentiated according to any specific valley (although digital search will also allow to extract all records containing H, that is, concerning the Curone valley in any way). This experience of indexing is providing some idea of how viewpoint can actually be represented and managed in a classification system. Still the case described is quite simple. While the technique assumes that any concept in a special KOS can be translated into some combination of facets of the general KOS, cases of problematic translation could be encountered in the actual work, and should then be analyzed in detail. Also, the complete mapping of a local scheme with a greater number of classes, including complex compounds like those of Borges's taxonomy, requires that the general reference scheme be developed at a further stage. For example, class mqB “embalmed animals” cannot be really defined until the general scheme lacks a class for the rather specific meaning “embalming”. This is a good example of how many components of the system, like the general scheme of phenomena, the way to express local meanings, and the syntax of faceted compounds, are all connected, hence need to develop slowly all together. Although basic principles for representing viewpoints have been described in this paper, more work is required in order to develop the system in a complete way.

References Angelini, Massimo. 2005. “Varietà tradizionali, prodotti locali ed esperienze”. L'Ecologist italiano 1, no. 3: 230-275. Beghtol, Clare. 1998. “General Classification Systems: Structural Principles for Multidisciplinary Specification”. In Structures and Relations in Knowledge Organization: Proceedings of the 5th International ISKO Conference, Lille, 25-29 August 1998, edited by Widad Mustafa el Hadi, Jacques Maniez, A. Steven Pollitt, 89-96. Advances in Knowledge Organization 6. Würzburg: Ergon,. —. 2002. “A Proposed Ethical Warrant for Global Knowledge Representation and Organization Systems”. Journal of Documentation 58, no. 5: 507-532.

Animals Belonging to the Emperor

99

Berlin, Brent, Dennis E. Breedlove and Peter H. Raven. 1966. “Folk Taxonomies and Biological Classification”. Science 154: 273-275. Biagetti, Maria Teresa. 2009. “Philosophy in Bibliographic Classification Systems”, in The Philosophy of Classifying Philosophy, edited by Cristiana Bettella and Massimiliano Carrara. Knowledge Organization 36, no. 2-3: 92-102. Borges, Jorge Luis. 1964. Other Inquisitions, 1937-1952, translated by Ruth L. C. Simms. University of Texas Press. Boteram, Felix. 2009. “Semantic Interoperability in an International Comprehensive Knowledge Organisation System”. In Content Architecture: Proceedings ISKO UK Conference, London, 22-23 June 2009. Aslib Proceedings, in prep., also ISKO UK available at http://www.iskouk. org/conf2009/proceedings.htm. Bowker, Geoffrey C., and Susan Leigh Star. 1999. Sorting Things Out: Classification and its Consequences. Cambridge and London: MIT Press. Coates, Eric J. 1970. “Switching Languages for Indexing”. Journal of Documentation. 26, no 2: 102-110. Diamond, Jared M. 1966. “Zoological Classification System of a Primitive People”. Science. 151: 1102-1104. Gnoli, Claudio. 2008. “Potential of Freely Faceted Classification for Knowledge Retrieval and Browsing”. Paper presented at the 7th NKOS Workshop, Aarhus, 19 September 2008. Available at http://www.comp.glam.ac.uk/pages/research/hypermedia/nkos/nkos2008/. Gnoli, Claudio and Fulvio Mazzocchi eds. 2010. Paradigms and Conceptual Systems in Knowledge Organization: Proceedings of the 11th ISKO Conference, Rome, 23-26 February 2010. Advances in Knowledge Organization 12. Würzburg: Ergon. Hjørland, Birger. 2004. “Theory of Knowledge Organization and the Feasibility of Universal Solutions”. Paper presented at the 8th ISKO Conference, London, 13-16 July 2004, DLIST. Available at http://dlist.sir.arizona.edu/arizona/handle/10150/105303 Hjørland, Birger and Jenna Hartel. 2003. “Afterward: Ontological, Epistemological and Sociological Dimensions of Domains”. Knowledge organization 30, no. 3-4: 239-245. ISKO (International Society for Knowledge Organization) Italia. 2007. “The León Manifesto”. Available at http://www.iskoi.org/ilc/leon.htm. republ. in Knowledge Organization 34, 2007, no. 1: 6-8. Kublik, Angela, Virginia Clevette, Dennis Ward and Hope A. Olson. 2004. “Adapting Dominant Classifications to Particular Contexts”. In Knowledge Organization and Classification in International Information Retrieval, edited by Nancy J. Williamson and Clare Beghtol, 13-32. Binghamton: Haworth. Kwaśnik, Barbara Hanna and You-Lee Chun. 2004. “Translation of Classifications: Issues and Solutions as Exemplified in the Korean Decimal Classification”. In Knowledge Organization and the Global Information Society: Proceedings of the 8th ISKO Conference, London, 13-16 July 2004, edited by Ia C. Mc Ilwaine, 193-198. Advances in Knowledge Organization 9. Würzburg: Ergon. Kwaśnik, Barbara Hanna and Victoria Rubin. 2004. “Stretching Conceptual Structures in Classifications across Languages and Cultures”. In Knowledge Organization and Classification in International Information Retrieval, edited by Nancy J. Williamson and Clare Beghtol, 33-48. Binghamton: Haworth. Maniez, Jacques. 1997. “Database Merging and the Compatibility of Indexing Languages”. Knowledge Organization 24, no. 4: 213-224. Olson, Hope A. 2002. The Power to Name: Locating the Limits of Subject Representation in Libraries. Dordrecht: Kluwer. Parsons, Jeffrey. 1996. “On the Relevance of Classification Theory to Database Design”. In Advances in Classification Research: Proceedings 5th ASIS SIG/CR Classification Research Workshop, edited by Raya Fidel, Barbara Hanna Kwaśnik, Clare Beghtol and P. J. Smith, 131140. Medford: Information Today.

100 Claudio Gnoli —. 2002. “Effects of Local Versus Global Schema Diagrams on Verification and Communication in Conceptual Data Modeling”. Journal of Management Information Systems. 19, no. 3: 155183. Ranganathan, S.R. 1967. Prolegomena to Library Classification. 3rd ed., with assistance of M.A. Gopinath, Bangalore: SRELS, also in DLIST available at http://dlist.sir.arizona.edu/arizona/ handle/10150/106370 Ridi, Riccardo. 2010. Il mondo dei documenti: Cosa sono, come valutarli e organizzarli. RomaBari: Laterza. Severino, Francesca. 2007. “The Term Development in the Thesauri of International Organizations”. The European Journal of Development Research 19, no. 2: 327-351. Szostak, Rick. 2008. “Classification, Interdisciplinarity, and the Study of Science”. Journal of Documentation 64, no. 3: 319-332. Wåhlin, Ejnar. 1974. “The AR-Complex: Adapted Systems Used in Combination with a Common Reference System”. In Conceptual Basis of the Classification of Knowledge, edited by Jerzy A. Wojciechowski, 416-449. Pullach bei München: Verlag Dokumentation. Wilson, Edward O. 1992, The Diversity of Life. London: Penguin, and Cambridge (Mass.): Belknap Press.

Dewey in Sweden: Leaving SAB after 87 Years Magdalena Svanberg Abstract The National Library of Sweden has decided to switch from SAB, the Swedish classification system, to Dewey Decimal Classification (DDC). Several research libraries have decided to follow and a switch is also discussed among the public libraries. This paper explains the reasons for the decision, gives the background and reports on the current Swedish DDC project, where mappings between SAB and DDC and between the Swedish subject heading language Svenska amnesord and DDC are important parts.

Decision to Switch to DDC In late 2008, the National Library of Sweden decided to switch from SAB, the Swedish classification system, to Dewey Decimal Classification (DDC). This far, about 30 research libraries have decided to follow The National Library in the switch, and more decisions are expected. The libraries that have made a formal decision are all big general university libraries, and it seems likely that a joint switch to DDC is on its way in Sweden, at least among the research libraries. This is a unique situation, with the libraries of a country together abandoning a national scheme, used and maintained solely within their country, and adopting an international one. When it comes to the public libraries, no library has made a formal decision yet, but some big libraries have expressed interest in a switch. The Swedish Library Association has stressed the importance of using the same classification system throughout Sweden, and as a consequence of the decision at the National Library, recommended that Swedish libraries in general consider adopting the DDC. A proposal at the annual meeting of the association demanded a more exhaustive analysis of the consequences of a switch for the public libraries. Special consideration should be taken to the children’s perspective. The analysis was published in 2010 (”En svensk övergång till DDK. Vad innebär det för folk- och skolbibliotek?” 2010). Bibliotekstjänst, the company supplying most public libraries in Sweden with bibliographic records, has stated that they will add DDC numbers to the records if there is a demand for it among their customers. They also stated that they plan to continue offering SAB codes and use both schemes in parallel for a long period. This paper gives the background for the decision and reports on future DDC plans. It is about how we are looking at the past and preparing for the future.

102 Magdalena Svanberg

Creating SAB In the beginning of the 20th century, the question of national and international classification schemes was on the agenda in many countries. In Scandinavia, Finland and Denmark decided to develop their own national classification schemes, based on UDC (Universal Decimal Classification) and DDC respectively. Norway chose the most international path and started to use DDC in an adapted Norwegian version. Sweden took the most national approach and decided to develop a classification scheme of its own. The main reason was the possibility to get a scheme well adapted to Sweden and Swedish phenomena. The notation was another reason. The DDC numbers were considered long and difficult to use, and the choice to develop a new scheme made it possible to base the notation on letters instead and try to keep the length of the codes down (Hansson 1997). The Swedish classification scheme Klassifikationssystem för svenska bibliotek, generally known as SAB, was first published in 1921 and came to be used by all public libraries and in the national bibliography already in the 1920s. By the 1960s it had also been adopted by the national library and most research libraries. According to numbers in a Catalogue survey conducted in 2006, SAB codes are found in 67 % of the records in LIBRIS, the Swedish union catalogue.

Maintenance of SAB Five editions of SAB were published between 1921 and 1963, but only some of them included major revisions. The situation was analysed in a report in 1981 (Delegationen för vetenskaplig och teknisk informationsförsörjning 1981). Due to lack of maintenance and revisions, SAB had become outdated. If SAB should have a future, there was a need for a thorough revision. Another alternative to consider was a switch to an international classification scheme: DDC, UDC (Universal Decimal Classification) or LCC (Library of Congress Classification). A switch to an international scheme would eliminate the need for maintenance of a national classification scheme. In this report, the possibility to use the classification codes in foreign bibliographic records was also mentioned as an advantage. The negative effects of a switch would be problems with the shelving and with keeping two classified catalogues with different classification schemes. This was considered to be primarily a problem for the public libraries, since the research libraries would probably handle the difference between the catalogue and the shelving more easily. Even though the report listed several advantages in using an international scheme, the conclusion was that too much effort was needed to make a switch to an international classification scheme possible, and that a thorough revision of SAB was the best way to go. To avoid similar situations in the future, the importance of regular revisions of the scheme was stressed, with a new SAB edition expected to be published every ten years and minor changes in between. The sixth edition was published in 1984 and the latest edition, the 8th, in 2006 (Viktorsson 2006).

Dewey in Sweden 103

Cooperation - Nationally and Internationally Cooperation on bibliographic control has a long history in Sweden, but when it comes to subject access, there has been a lack of interest among Swedish libraries during some periods. Every Swedish library applied subject headings and classified using their own principles, stressing the specific needs of their users. A bibliographic record could contain several slightly different SAB codes and several slightly different subject headings, using different synonyms or constructed in different ways. A lot of changes in the area of bibliographic control took place in Sweden around the turn of the millennium. The format in LIBRIS changed, from a Swedish adaption of UK MARC, called LIBRISMARC, to MARC 21 and a new library system, Endeavor Voyager, replaced an old solution. International principles for authority works were adopted. The new library system made it necessary for the libraries to collaborate on subject access. This, together with efforts made for example by the Section on Bibliographic Cooperation and Development at the National Library, resulted in a different approach: every work is classified once and indexed once, using the subject heading language Svenska ämnesord (SAO). It turned out that the differences between users of different libraries were not so big after all. The work on SAO also proved that national and international cooperation was possible, saved resources and gave better access. More on the development of SAO, a Swedish subject heading language based on international principles and Library of Congress Subject Headings (LCSH), is reported in Berg and Leth (2004; 2008). All these changes were important steps towards international cooperation since it made exchange of records easier and showed the value of using international standards.

Why DDC? In the last decades there has been a growing international interest in DDC once again. DDC is now used in more than 135 countries and has been translated into a diversity of languages, Norwegian, Vietnamese and German just to mention a few. This international interest spread to Sweden, as many Swedish libraries were no longer satisfied with a national classification scheme that made international cooperation impossible. There has also been a growing discontent with SAB, especially among research libraries. SAB is not updated frequently enough, and the level of specificity provided is not satisfying for libraries holding big collections. In 2005, the National Library of Sweden invited Joan Mitchell, OCLC and Unni Knutsen from Norway to hold a seminar and a week-long course. This increased the interest in DDC, and placed the question firmly on the agenda. A feasibility study of a Swedish switch to DDC was treated as part of the Catalogue Survey, conducted by the National Library of Sweden in 2006 (Svanberg 2006). The study came to the conclusion that there is a lot to be said for a Swedish switch to DDC. The most important argument is that a switch will make international cooperation easier. When Sweden starts using an international classification scheme we can use classification data in foreign bibliographic records. This will save a lot of resources. More than 80% of the foreign literature bought by Swedish research li-

104 Magdalena Svanberg braries is published in countries where the national bibliography uses DDC. When Sweden switches to DDC there will be no need for us to classify those works. The need to spend resources on maintenance of the classification scheme will also be reduced. DDC is widely used in bibliographic records, and it is also widely used in mappings between different knowledge organisation systems. DDC and LCSH together with the French subject heading language RAMEAU, the German subject heading language SWD and several others form a network that encompass a great part of the world’s literature and other documents. With a Swedish translation of DDC, and mappings between SAB and DDC and between SAO and DDC, Sweden and the Swedish literature will also be part of this network. By making the different knowledge organisation systems and the library data available on the Semantic web there is also a potential of an even bigger network, not restricted to libraries but with wider usage and relationships. Both SAO and LIBRIS are available on the Semantic web, and so are LCSH and RAMEAU.

Giving up SAB One of the pros of a national classification scheme is that it is tailor made to suit the needs of the libraries in the country. In many cases, SAB works very well for Swedish books about Swedish phenomena. There is a class in SAB for the winter sport bandy, traditionally an important sport in Sweden. There is a class for reindeer herding, the traditional lifestyle of the Sami people in the north of Scandinavia and Russia. However, luckily enough not every book in Swedish libraries is Swedish and not every Swedish book is about Swedish phenomena. Sweden is a part of the world. Cricket, the typical Anglo-American game, has no class of its own in SAB, since the game has not been commonly played in Sweden. However, in LIBRIS, the union catalogue of the Swedish research libraries, the number of hits for cricket approximates the number of hits for bandy. A specific library can also have needs that are not well met by SAB. The Nordic Africa Institute Library does not hold a single book on reindeer herding, but numerous on camel husbandry. Camels as domestic animals do not fit well into the SAB structure but are lumped together with silk worms, homing pigeons and yaks under the caption Miscellaneous domestic animals. In DDC, with is international usage, camels as domestic animals have their own number close to llamas and bisons. As a result of proposals from the Swedish Dewey project there are now Dewey numbers for reindeer herding and bandy. Switching to DDC does not necessarily imply a difficulty to provide for Swedish needs. A classification system may be international, and adapted to phenomena specific to a certain culture at the same time. Reindeer herding and bandy are both straight forward examples, and naturally there are other areas where it is much more difficult to find solutions that are applicable all over the world. Divisions like 370 Education and 340 Law need to be general enough to fit a variety of educational and jurisdictional systems. DDC users all over the world will have to work hard to achieve a classification scheme that is consistent as well as hospitable to different points of view. A good example of such collaboration is the European DDC Users’ Group (EDUG).

Dewey in Sweden 105

SAB is in a way a very flexible classification scheme since there are few strict rules and no limitations on the possible number of SAB codes in a bibliographic record. This has some advantages since it makes it possible to express different aspects of a work in the classification. It does however also have disadvantages. The opportunity to add several SAB codes has been used to such an extent that the added SAB codes add very little value. It has been used as a replacement for subject headings, even in cases where subject headings would form much better access points. SAB has, finally, often been used in a very inconsistent way. This means that we are welcoming the more strict rules of DDC, and that we plan to pretty much stick to the old rule of one DDC number per record, even thought new developments in the MARC formats open up for the use of additional DDC numbers.

The Swedish DDC Project The Swedish DDC project started in June 2009. The major parts of the project are: • • • • •

Translating DDC to Swedish using the mixed model Mapping the Swedish subject headings in Svenska ämnesord (SAO) to DDC Updating and revising the existing conversion table SAB-DDC Training of librarians Developing tools that make it possible for the end user to benefit from the classification data for subject access

The mixed model is explained in detail in Rype and Svanberg (2008) and Mitchell, Rype and Svanberg (2009). The development of tools is still under planning. Therefore, the rest of this paper will concentrate on the mapping parts of the Swedish DDC project. The training has just started, and is therefore not discussed in this paper.

Mapping Svenska ämnesord (SAO) to DDC The project includes mapping of the Swedish subject headings SAO to DDC. The mappings will serve several purposes: • • •

serve as an entry vocabulary to DDC for both librarians and end users be one of the sources for the terminology in the Swedish translation of DDC, both for the content in the schedules and for the index, be used behind the scene in retrieval systems to improve subject retrieval

There are about 30 000 main headings in SAO, and another 10 000 authorized precoordinated combinations of main headings and subdivisions. The focus of the mapping part of the project will be to map the most used of these subject headings, together with other common combinations of main heading and subdivision. There are plans to use data from the Swedish union catalogues LIBRIS to show the frequency of a certain subject heading. There are existing mappings that could also be used to get automatic mappings to revise manually. Almost all main headings in SAO are mapped to SAB, and about 75 % of the headings are mapped to LCSH. One possible path would be from SAO to

106 Magdalena Svanberg LCSH and then from LCSH to DDC. The other would be from SAO to SAB and then from SAB to DDC. The DDC numbers will be added to the authority records of the subject headings. In this work we will follow the guidelines set up by the Dewey editors for DDC numbers in LCSH authority records. In short, these guidelines mean that a topic has to be explicitly mentioned in DDC, to be added to the authority record. There are a few exceptions; it is for example also allowed to add a DDC number for a geographic entity with an implicit relation to the DDC class. The idea is that the relationship between the term and DDC is recorded in the DDC, and can be looked up there (Mitchell 2006). We are however also planning to use the authority records for DDC numbers that do not fit these criteria, what in the paper is called mapping. We do not totally agree on the importance of keeping them separate and we would like to make all the mappings available both as part of WebDewey and as part of SAO in its different forms (on the web, in the authority file of the union catalogue, in the semantic web presentation) For these more loose mappings we are trying to find out if it is a good idea to record the relationship between DDC and the heading using Simple Knowledge Organisation System (SKOS) (W3C 2009). SKOS is a standard way to represent knowledge organization systems, such as thesauri, classification schemes, subject heading systems and taxonomies. One of the uses of SKOS is to represent the relationship between entities in different knowledge organization systems. The SKOS properties narrowMatch and broadMatch are the most commonly used in the mapping SAO-DDC. NarrowMatch means that the DDC class is narrower that the SAO term and BroadMatch means that it is broader. The SKOS property is recorded in the 083 subfield c in the MARC record. 083 $a 958.1045 $c BroadMatch 150 $a Afghanistankriget 1979-1989 958.1045 is a broad match to Afghanistankriget 1979-1989 (in English Afghanistan war 1979-1989) and there are works in the class that are not about the Afghanistan war. Subfield c is also used to specify the different aspects when more than one DDC number is added to the record. Interdisciplinary and comprehensive numbers are left without specification in subfield c. Three different DDC numbers have been added to the authority record for Aborter (in English Abortion). Two of them have specifications in subfield c and one of them does not since it is the interdisciplinary number. There are also separate authority records with DDC numbers for some precoordinated subject headings with Aborter as the main headings, for example Aborter religiösa aspekter (In English Abortion - Religious aspects). 083 $a 362.19888 083 $a 618.88 $c Medicine 083 $a 618.29 $c Nonsurgical methods - medicine 150 $a Aborter

Dewey in Sweden 107

Mapping SAB to DDC Another important part of the DDC project is the mappings between SAB and DDC. No library is planning to reclassify their collections manually, and this means that the need for automatic methods to handle the change of classification scheme is big. Initially DDC numbers and UDC number were added next to SAB classes directly in the classification scheme. Those numbers disappeared in 6th edition of SAB, and a separate conversion table was published (Hansson 1987). The second, and latest, printed edition between SAB 7 and DDC 21, was published in 2000 (Gustavsson 2000). There is an updated version on the web, mapping DDC 22 and the 8th edition of SAB (Konverteringstabell mellan Dewey och SAB). The conversion table is a very important tool in the Swedish switch to DDC and it needs to be revised and updated to be of high quality. For this work, data on cooccurrence of the codes from the Swedish union catalogue LIBRIS is used. The mappings will serve several purposes: • • •

Form the basis for a joint search interface for SAB and DDC classed material Add DDC numbers to bibliographic records including SAB codes, and possibly vice versa. In the initial phase, help librarians understand DDC and its differences and similarities compared to SAB.

These new usages of the conversion tables need to be taken into account in the revision process. Before a Swedish switch to DDC was under way, the conversion table mainly helped cataloguers find a suitable SAB code for a work with a DDC number, or vice versa. The cataloguer could look numbers up and intellectually decide whether the given mappings were appropriate or not. Using the mappings to automatically add notation in the bibliographic records is another thing. No human that can evaluate the number is involved, and this puts higher demands on the data in itself. In the revision process, some mappings are removed because they would not work in an automatic setting. In cases where there are several mappings to a DDC number or SAB code, one of them is marked as the primary. This will be the notation added to records when automatic methods are used. The relationship between the DDC and the SAB classes are recorded using four different SKOS properties: exactMatch, narrowMatch, broadMatch, relatedMatch.

108 Magdalena Svanberg

Figure 1. Search results for SAB code Dodb in the Conversion table.

In the current display of the conversion table, the SKOS properties are shown as symbols. ExactMatch (=) is used when the classes are equivalent or very close to equivalent. DDC class 152.14 and Dodbc are, for example, equivalents. BroadMatch is shown as the mathematical symbol for “superset of”. SAB class Dodb is a BroadMatch to DDC class 152.1. That means that Dodb includes elements that are not included in 152.1. NarrowMatch is shown as the mathematical symbol for “subset of”. SAB class Dodbö is a NarrowMatch to 152.18. This means that 152.18 includes elements that are not included in Dodbö. When there are several mappings to a class, one of them is marked as the primary. This is shown as an arrow, either pointing in both directions, or in one of the directions. 152.1 and Dodb are for example a primary match, valid in both directions. Sometimes the relationship between the classes is easy to analyse and express with SKOS. Sometimes it is more complicated. In Svanberg (2008) some of the problems in the mapping of SAB and DDC are described and discussed.

Conclusion Adopting DDC is another step to facilitate international cooperation and become part of a big network of interlinked knowledge organisation systems and, behind those, categorised resources. It is time to abandon SAB after 87 years.

References Delegationen för vetenskaplig och teknisk informationsförsörjning. Arbetsgruppen för klassifikationsfrågor. 1981. Klassifikationsfrågor: Rapport från DFI:s arbetsgrupp. DFI-publikationer 1981:5. Stockholm: DFI. Dewey, Melvil. 2003. Dewey Decimal Classification and Relative Index. 22 ed. Dublin, Ohio: OCLC Online Computer Library Center.

Dewey in Sweden 109 ”En svensk övergång till DDK. Vad innebär det för folk- och skolbibliotek?”. 2010. Stockholm, Svensk biblioteksförening, 2010. Available at http://www.biblioteksforeningen.org/organisation/ dokument/pdf/RapportDDK_20101124 Gustavsson, Bodil. 2000. Konverteringstabell mellan Dewey Decimal Classification (21. ed.) och Klassifikationssystem för svenska bibliotek (7. uppl.). Lund: Bibliotekstjänst. Hansson, Joacim. 1997. "Why Public Libraries in Sweden Did Not Choose Dewey". Knowledge Organization 24, no. 3, s. 145-153. Hansson, Lars-Olof. 1987. Konverteringstabell mellan Dewey Decimal Classification (19. ed.) och Klassifikationssystem för svenska bibliotek (6. uppl.). Lund: Bibliotekstjänst Konverteringstabell mellan Dewey och SAB database available at http://export.libris.kb.se/DS/ Leth, Pia, and Ingrid Berg. 2004. “Subject Indexing in Sweden: the Creating of a National System Based on International Standards in a Country that Often Wanted to Go Its Own Way.” Paper presented at the World Library and Information Congress (70th IFLA General Conference and Council), 22-27 August 2004, Buenos Aires. Available at http://archive.ifla.org/IV/ifla70/papers/ 041e-Leth_Berg.pdf —. 2008. “Subject Indexing in Sweden”. In: New Perspectives on Subject Indexing and Classification : Essays in Honour of Magda Heiner-Freiling. 179-183. Leipzig: Deutsche Nationalbibliothek. Mitchell, Joan S. 2006. “Dewey Numbers in Authority Files.” Discussion paper presented at Meeting 126 of the Dewey Classification Editorial Policy Committee (EPC), Washington October 11-13, 2006. Available at http://www.oclc.org/dewey/discussion/papers/epc_126-35.doc Mitchell, Joan S., Ingebjørg Rype, and Magdalena Svanberg. 2009. “Mixed Translations of the DDC : Design, Usability, and Implications for Knowledge Organization in Multilingual Environments.” Paper presented at Looking at the Past and Preparing for the Future, an IFLA satellite preconference sponsored by the Classification and Indexing Section, 20-21 August 2009, Florence, Italy. Rype, Ingebjørg, and Magdalena Svanberg. 2008. “Blandet Utgave av Dewey. Presentsjon av en Pilotundersøkelse”. Infotrend 63, no. 4: 88-95. Available at http://www.sfis.nu/sites/default/ files/dokument/infotrend/2008/blandet408.pdf Svanberg, Magdalena. 2006. “Övergång till Dewey Decimal Classification. Vad Skulle det Innebära? Delstudie 3 i Katalogutredningen.” [Stockholm]: Kungl. biblioteket. Available at http:// www.kb.se/Dokument/Om/projekt/avslutade/katalogutredning/delst3_slutrapport.pdf —. 2008. “Mapping Two Classification Schemes - DDC and SAB.” In New Perspectives on Subject Indexing and Classification : Essays in Honour of Magda Heiner-Freiling. 41-51. Leipzig, Frankfurt am Main, Berlin: Deutsche Nationalbibliothek. Viktorsson, Elisabet, ed. 2006. Klassifikationssystem för Svenska Bibliotek. 8th ed. Lund: Btj förlag W3C (World Wide Web Consortium). 2009. “SKOS Simple Knowledge Organization System. Reference. W3C Proposed Recommendation 15 June 2009.” Available at http://www.w3. org/TR/2009/PR-skos-reference-20090615/

Enhancing Information Services Using Machine-toMachine Terminology Services Gordon Dunsire Abstract This paper describes the basic concepts of terminology services and their role in information retrieval interfaces. Terminology services are consumed by other software applications using machine-to-machine protocols, rather than directly by endusers. An example of a terminology service is the pilot developed by the High Level Thesaurus (HILT) project which has successfully demonstrated its potential for enhancing subject retrieval in operational services. Examples of enhancements in three such services are given. The paper discusses the future development of terminology services in relation to the Semantic Web.

Terminology Services A terminology server is defined in Wikipedia (2010c) as “... software providing a range of terminology-related software services through an Applications Programming Interface to its client applications.”. The services are not intended for endusers. Instead, they are to be used by computer programmers to improve client applications; that is, specific end-user services such as subject-based information retrieval interfaces. A client application will typically submit data to the terminology server along with a request for them to be processed in a specified way and the results returned to the application. The application may then further process the results before displaying them to, or otherwise interacting with, an end-user. The application software is run on the client computer, which is not the same machine as the terminology server computer. The interaction between the two sets of hardware and software is known as machine-to-machine (m2m) processing. Terminology services have been defined as “Web services involving various types of knowledge organization resources [vocabularies], including authority files, subject heading systems, thesauri, Web taxonomies, and classification schemes … Web services are modular, Web-based, machine-to-machine applications that can be combined in various ways.” (Vizine-Goetz, D. et al. 2004). An example service is given as mapping from a term in one vocabulary to one or more terms in another vocabulary. The OCLC Terminology Services project (OCLC, n.d.) has developed a set of simple services involving various English subject heading systems including Library of Congress Subject Headings (LCSH) (Library of Congress 2009), Medical Subject Headings (MeSH) (United States National Library of Medicine 2009), and Thesaurus for Graphic Materials (TGM) (Library of Congress 2007), although it does not currently include any mappings between these vocabularies. The services accept

112 Gordon Dunsire a client term (or its identifier) and return data about matching terms in a vocabulary specified by the client. Related terms from the vocabulary are included in the process. The client also specifies the format of the returned data, chosen to suit the needs of the client software; one of the available formats is Simple Knowledge Organization System (SKOS) (W3C 2009b), a component of the Semantic Web. The previously cited Wikipedia article suggests several categories of m2m terminology service which can be expressed in terms of application functions as: • • • •

Matching user-defined text with lexical resources, including dictionaries, authority files, and thesauri. Translations from one language to another, using controlled vocabularies and semantic mappings. Semantic relationships within specific vocabularies used in Knowledge organization systems (KOS). Semantic relationships between specific vocabularies using ontology mappings.

These functions can be used in client applications to improve subject information retrieval interfaces for end-users. Work with the Scottish Collections Network (SCONE 2011) and CAIRNS (2011) has suggested examples of enhancements that would benefit users. One is spell-checking user input, to trap typing errors or match spelling variants. This might be done transparently, or with feedback to the user as a “Did you mean ... ?” message. Another example is clarifying a user’s search term when it is ambiguous relative to the one or more KOS involved: does the user intend “tree” to refer to forest or family? This process has been referred to as “disambiguation”; it perhaps comes as no surprise to see that Wikipedia (2011a) has to disambiguate it with the entry “Disambiguation (disambiguation)” although the default definition of “Word sense disambiguation” is the basis of its usage in relation to KOS. A further example is switching an uncontrolled user term to a controlled vocabulary term; as before, this may be achieved automatically, without reference to the user, or with intervention by means of a “Use: ...” message display. An important enhancement for union catalogues such as CAIRNS is the ability to match a user-supplied subject term to the equivalent term in each of the different vocabularies used for subject access in the different library catalogues in scope. This helps control the precision of the subsequent “one-stop” search across multiple heterogeneous subject headings.

HILT: High-Level Thesaurus Project The High-Level Thesaurus (HILT) project (HILT 2009) started in 2000; its fourth phase was completed in May 2009. The project was funded by the UK’s Joint Information Systems Committee and supported by OCLC. Its overall scope was to provide subject interoperability in a multi-scheme environment via inter-scheme mapping, with an additional goal of identifying a generic approach that could be developed through distributed collaborative action. The main objectives of the fourth phase were to research and develop pilot solutions for problems in cross-searching multi-subject schemes. A terminologies server

Enhancing Information Services Using Machine 113

was implemented using the Dewey Decimal Classification (DDC) (OCLC 2011a) as a “switching language” between different Anglophone subject schemes and other vocabularies, including the DDC captions and relative index, Art and Architecture Thesaurus (Getty Research Institute, n.d.), UNESCO thesaurus (UNESCO 2003), LCSH, MeSH, and several others. Most of the mappings are partial, created for test purposes. Some non-English terms are also mapped for similar purposes. Several m2m protocols are used by the server; in particular, its output is made available in SKOS format. The project also developed pilot embedding of some the terminology services in the user interfaces of several operational information services. These were SCONE, Intute (2009) and The Depot (OpenDepot.org, n.d.).

HILT Case 1: SCONE SCONE is a service which uses metadata for collections located in Scotland, from all heritage domains such as archives, libraries, and museums. The interface allows users to identify and locate Scottish collections, and access finding-aids such as catalogues which describe the items held within them. Collections with a specific subject focus are classified with DDC and assigned LCSH entries to allow subject retrieval; multiple DDC notations as well as LCSH entries are used if necessary. The collection-level descriptions include metadata about the subject scheme used by a collection’s finding-aids. An early experiment in the use of HILT showed that it was possible to direct different, but semanticallyequivalent, terms from MeSH and LCSH to corresponding CAIRNS catalogues. The result was an improvement in recall, rather than precision, because in all cases both vocabularies were consolidated into a single search index within target catalogues. That is, the local subject index combined MeSH and LCSH terms in a single list. The experiment was not investigated further because these were the only vocabularies used by CAIRNS catalogues and available from the HILT server. One enhancement developed for SCONE as a pilot during the HILT project accepts a subject term input by the user, and then displays the DDC caption hierarchies which match the term. The match is primary if the term is present in the caption, or secondary if the term is found in another vocabulary mapped to DDC. Figure 1 is a partial screen-shot where the user has entered the term “teeth”, and is presented with a set of captions giving the different hierarchical contexts of the term in the DDC. Note that the last two captions displayed do not contain the user’s term, so they are secondary matches. Note also that two distinct high-level contexts are given for the term, so this example of disambiguation is significant.

114 Gordon Dunsire

Figure 1: Partial screen-shot of search term disambiguation in the SCONE HILT pilot.

The user can now select a highlighted term from one of the captions and use it to identify collections matching the term. The software achieves this by using the DDC notation for the selected term and matching it against those assigned to the collection descriptions. If no match is found, the DDC notation is shortened by one digit and the process is repeated. This is equivalent to broadening the semantic of the notation because it is decimal; a shorter notation usually implies a broader concept. The process is repeated until several collections have been found, or the top of the notation hierarchy is reached. For example, the DDC notation for the fourth caption hierarchy in figure 1 is 612.311. There are no collections in SCONE classified with this notation, so the notation is truncated by its last digit to give 612.31, and the search repeated. This is done successively through 612.3, 612, and 610, at which point several collections are matched and displayed, as in Figure 2.

Enhancing Information Services Using Machine 115

Figure 2: Partial screen-shot of SCONE collections matched by truncating the DDC notation 612.311.

HILT Case 2: Intute Intute is an online finding-aid for web resources supporting study and research. The resources are selected by academics. Metadata is created by subject-focused component catalogue services, each of which uses its own subject scheme. High-level subject retrieval is supported by a scheme of 19 categories. The Intute pilot enhancement using HILT accepts a subject term input by the user and displays up to 10 related terms which can be used for another subject search. The service also displays any results from a search on the input term. The user can examine the results and select one of the related terms to redo the search if required. If no results are obtained, the pilot displays up to five terms with spellings related to the input term. Any of these can be selected by the user to carry out another search. The interface also displays DDC notations and captions related to the subject term, to demonstrate the potential of using HILT to identify appropriate terms for searching the different subject schemes used in the Intute component catalogues. This has not been developed further, and the links displayed are inactive. Figure 3 is a partial screen-shot where “tree” has been entered as a search term. The related terms displayed indicate at least two distinct subject contexts, genealogy and forestry. If the current results, displayed at the bottom of the screen, are not in the expected subject domain, the user can click on one of the related terms, for example “genealogy”, and carry out a search on that topic.

116 Gordon Dunsire

Figure 3: Partial screen-shot the Intute HILT pilot showing disambiguation and alternate term suggestions.

HILT Case 3: The Depot The Depot (now OpenDepot.org) is an e-prints repository service aimed at researchers who do not have access to an institutional repository to deposit their papers. It relies on self-deposit, and the depositor is expected to generate metadata as part of the process. In particular, the user is required to assign one or more subject terms taken from the Joint Academic Coding System (JACS) scheme (UCAS. JACS 3, n.d.). The pilot enhancement developed using HILT services helps the user to identify the relevant JACS captions. It accepts a term input by the user and displays all JACS

Enhancing Information Services Using Machine 117

captions containing the term. If the input term is not found in the JACS vocabulary, the pilot searches for it in the DDC captions and displays JACS captions mapped to the corresponding DDC notations. The user is then asked to select one or more of the displayed JACS captions as the subject metadata for the deposit. Figure 5 shows the pilot enhancement inserted into the user-generated metadata workflow at the point where subject classifications are to be added. The depositor has input “teeth” as a search term to identify appropriate JACS captions and notations. Figure 6 shows the results. The term has been matched to the JACS headings by finding it in DDC captions mapped to JACS. The user can further identify context by exposing the code hierarchies, or simply check the box against as many of the captions as deemed relevant.

Figure 4: Partial screen-shot of the The Depot HILT pilot for helping user assignment of subject terms.

118 Gordon Dunsire

Figure 5: Partial screen-shot of JACS captions matched to user input via mappings to DDC.

Beyond HILT The HILT approach is too expensive to scale across all subject schemes, despite the efficiency gains of using the hub-and-spoke architecture of a switching language. In such an architecture direct mappings between pairs of vocabularies are avoided by using the indirect mapping from spoke to spoke via the hub. This was recognized from the beginning of the project and is reflected in the “High level” part of its name. Hub-and-spoke mapping architectures are themselves less expensive to scale than direct, one-to-one, mappings. As each new vocabulary is added, a new set of mappings is required for each existing vocabulary, rather than a single set of mappings to the hub or switching vocabulary. General, large-scale terminology services are therefore likely to employ hybrid architectures which complement a basic huband-spoke core by adding one-to-one mappings as cross-spoke links, where such one-to-one mappings are available. In some instances terms from two spokes may be linked directly by such a one-to-one mapping and also indirectly via the hub. In other instances, terms in one spoke may map only to the other spoke and not directly to the hub.

Enhancing Information Services Using Machine 119

Figure 6: Hybrid architecture of hub-and-spoke and direct mappings between vocabularies.

In the example of a hybrid architecture given in Figure 6, KOS A is a hub for the vocabularies of the KOS B, C, D, and E spokes. There is also a direct mapping between KOS B and KOS E. KOS F is mapped directly to KOS C, and therefore indirectly to the KOS A hub. But KOS F is itself a spoke, along with KOS H and KOS J, to another hub KOS G. Mappings between vocabulary terms are usually created with human intervention to ensure that nuances of meaning within and between different languages are preserved. This is the major component of the cost of developing and maintaining mappings, which could be significantly reduced with the use of machine-processing. Statistical analysis of associations of terms from different vocabularies used to index the subjects of the same resource can be used to determine strong correlations between terms, as in the mapping between DDC and LCSH found in OCLC's (2011b) WebDewey service. Such analyses require a critical mass of test data, and become more accurate as the amount of data increases. OCLC's (2010) Classify service shows that consensus about the correct DDC number for a resource emerges from analyzing sufficiently large numbers of separate records for the resource. It might therefore be expected that the number of machine-generated mappings between terms in different vocabularies will increase as more and more metadata records are brought together in union catalogues and digital library aggregations. Another source of mappings may lie in user-generated metadata. Users are encouraged in many social networking websites such as the photograph-sharing service Flickr (2010) to "tag" information resources with their own terms describing what the resource is "about". Again, statistical clustering techniques can be used to ignore terms used very few times and arrive at a group consensus. It is not difficult to imagine that the hybrid architecture of Figure 6 will involve hundreds (or more) of sets of mappings between controlled and uncontrolled vocabularies in an effective terminology service covering a general range of subjects.

120 Gordon Dunsire

Semantic Web The Semantic Web is "a group of methods and technologies to allow machines to understand the meaning … of information on the World Wide Web" (Wikipedia 2011b). It therefore relies on machine-processing of metadata, or data about data, as a source of "meaning" or aboutness. Machine-processing requires that the metadata is marked-up and identified for programmes supporting semantic-based services. The Semantic Web uses Resource description framework (RDF) (W3C 2004b) as a metadata model for the most basic possible type of metadata statement: something hassome-property (with a value of) something else. This three-part statement is known as a triple. Triples can be chained together by using special types of identifiers for each part, to create webs of so-called linked data. The markup of metadata into simple triples is essentially conceptual. The property forming the central part of a triple can be given a human-readable label, definition and scope note to ensure that cataloguers and retrieval system developers apply it in the correct semantic context. These meta-properties of label, definition and scope note are also essential to KOS and are available as RDF properties in SKOS. SKOS was primarily designed for RDF representation of terms in thesauri, classification schemes, subject heading lists and taxonomies. As its name suggests, it can model simple relationships between terms, such as equivalence and hierarchy, but it does not provide capabilities for advanced structures such as faceted classification and subject heading schemes. These, however, can be marked-up using other RDF applications such as RDF schema (RDFS) (W3C 2004a) and Web Ontology Language (OWL) (W3C 2009a). The Semantic Web therefore offers a number of features of use to terminology services: • • • •

An environment optimized for machine-processing. An underlying framework (RDF) that can be scaled from single to multiple controlled vocabularies. A model for representing simple structures within and between vocabularies (SKOS). A means of representing more complex structures within and between vocabularies (OWL).

Terminology Maintenance Services The contents and structure of many controlled vocabularies and subject schemas change through time, as a result of the need to accommodate new subject topics, and expand or contract the definition and scope of existing topics. The Semantic Web discourages the deletion or removal of anything that has been identified and published, to prevent the breaking of established links. Instead, terms which are no longer current should be retained, but marked as deprecated for future use. An important consideration for terminology services is the currency of a vocabulary; how does the programmer of an application based on the terminology service find out about old and new versions of a term, and how can use of the current version be ensured?

Enhancing Information Services Using Machine 121

Version control is therefore important, and terminology services should be prepared to identify the date of last update of their constituent vocabularies within the service, and store similar information about the amendment of individual terms within a vocabulary by its maintainers. In essence, the vocabulary itself and each term it contains should be assigned a series of time-stamp properties which is available to developers and end-users of applications based on the terminologies. RDF offers a useful mechanism for maintaining translations of vocabulary terms into other languages. A machine-readable identifier for a single term can be associated with labels, definitions, and scope notes in multiple languages, using a simple auxiliary identifier for the language. This allows applications using a terminology service to switch languages by applying the auxiliary identifier without altering the underlying programmes requesting output from the service. An application displaying a list of terms in English from a particular vocabulary can easily switch to displaying the Italian translations of those terms, if such translations exist. This approach is not useful, however, for the mapping of terms from one controlled vocabulary to another controlled vocabulary in a different language, because the relationships between terms within each vocabulary must be preserved to maintain semantic integrity and cohesion. Instead, the service needs to maintain a mapping between two different terminologies as a separate component which may require amendment if a term in either of the vocabularies is changed. In other words, the relationship of a term to its equivalent in another language can be modelled intrinsically within a single vocabulary and set of identifiers, while the relationship between terms from different vocabularies, whether in the same or different languages, can be modelled extrinsically using SKOS or some other set of mapping properties. Another important consideration in the maintenance of terminology services is quality assurance. This is partially met by version control, but information about the source of a vocabulary is also an indicator of quality. Vocabularies maintained by large professional organizations such as the Library of Congress are likely to be of higher quality than those from small amateur organizations, and application developers may want to be able to identify and prefer or avoid some vocabularies in favour of others. This requirement is likely to increase if machine-generated and usersourced vocabularies are part of the service. This does not imply that such vocabularies will always be of lower quality, but applications must have sufficient information to allow the appropriate vocabulary to be chosen to meet their functional requirements. The Semantic Web environment itself provides no level of quality control or indication. RDF is not designed to ascertain the truth of a triple: the simple statement "pigs" is a narrower term of "flying animals" can be expressed as a valid RDF triple. A semantic reasoning application would detect a conflict with the statement "Pigs cannot fly", but by itself would not be able to determine which is true and which is false. A human programmer would need to know that, say, the first triple came from a user-generated vocabulary about cartoons and the second statement from the Pig Breeders' Association before accepting or rejecting the metadata for the application.

122 Gordon Dunsire

Conclusion Terminology and vocabulary services have an important role to play in computerassisted information retrieval systems. They effectively bridge the semantic gap between humans and machines by encoding intellectual concepts and their organization into machine-processable representations that human programmers can use to build subject-based applications for end-users. In particular, the Semantic Web requires such services to develop utility from large numbers of basic metadata statements about terms and the relationships between them. Terminology services can provide complex building-block functions for interfaces matching user input to metadata about information resources, including disambiguation and monolingual and multilingual translation between specific vocabularies on a global scale. General terminology services themselves require access to as many vocabularies as possible, including fully-controlled terminologies and mappings from professional organizations, semi-controlled or uncontrolled terminologies from amateur, end-user sources, and machine-generated mappings from critical masses of metadata. These vocabularies are best represented in RDF in order to exploit and contribute to the utility of the Semantic Web. Open-access publishing of vocabularies and schemas in Semantic Web formats is likely to encourage uptake and development of terminology services. Several important vocabularies in wide-spread use in legacy metadata records are already available as part of the linked open data environment, as shown in the Linking open data cloud (Cyganiak and Jentzsch 2010), including LCSH, Rameau subject headings in French, and the Schlagwortnormdatei (SWD) subject headings in German, together with mappings between them. The addition of linked data versions of other major subject heading and classification schemes in widespread use, and the development of terminology services, are essential to unlocking the world’s subject catalogues and indexes for the benefit of the Semantic Web and its users.

References CAIRNS (Co-operative Information Retrieval Network for Scotland). 2011. [Homepage]. Available at http://www.scotlandsinformation.com/cairns/ Cyganiak, Richard and Anja Jentzsch. 2010. “Linking Open Data Cloud Diagram.” Available at http://lod-cloud.net/ last modified September 22, 2010, accessed January 15, 2011 Flickr. 2010. Available at http://www.flickr.com/ Getty Research Institute. n.d. Art & Architecture Thesaurus Online. Available at http://www.getty. edu/research/tools/vocabularies/aat/ Accessed January 15, 2011. HILT (High-Level Thesaurus Project). 2009. [Homepage]. Available at http://hilt.cdlr.strath.ac.uk/ index.html Intute. 2009. [Homepage]. Available at http://www.intute.ac.uk/ Library of Congress. 2007. “Thesaurus for Graphic Materials.” Available at http://www.loc.gov/ rr/print/tgm1/ —. 2009. “Library of Congress Authorities”. Available at http://authorities.loc.gov/ OCLC (Online Computer Library Center). 2010. “Classify: an Experimental Classification Web Service.” Available at http://classify.oclc.org/classify2/ Accessed February 2, 2011.

Enhancing Information Services Using Machine 123 —. 2011a. “Dewey services”. Available at http://www.oclo.org/dewey/ Accessed February 2. —. 2011b. “WebDewey”. Available at: http://www.oclc.org/dewey/versions/webdewey/ Accessed February 2. —. n.d. “Terminology Services”. Available at http://www.oclc.org/research/projects/termservices/ Accessed January 15, 2011. OpenDepot.org. n.d. [Homepage]. Available at http://opendepot.org/ Accessed January 15, 2011. SCONE (Scottish Collections Network). 2011. [Homepage]. Available at http://www.scotlandsinformation. com/scone/ UCAS (Universities and Colleges Admissions Service). JACS (Joint Academic Coding System) 3. n.d. Available at http://www.ucas.com/he_staff/courses/jacs/jacs3 Accessed January 15, 2011. UNESCO (United Nations Educational, Scientific and Cultural Organization). 2003. “UNESCO Thesaurus”. Available at http://www2.ulcc.ac.uk/unesco/ United States National Library of Medicine. 2009. “Medical Subject Headings”. Available at http://www.nlm.nih.gov/mesh/ Vizine-Goetz, D., Carol Hickey, Andrew Houghton and Roger Thompson. 2004. “Vocabulary Mapping for Terminology Services.” Journal of digital information 4, no 4. (2004). Available at http://journals.tdl.org/jodi/article/viewArticle/114/113 W3C (World Wide Web Consortium). 2004a. “RDF Vocabulary Description Language 1.0: RDF Schema.” Available at http://www.w3.org/TR/rdf-schema/ —.2004b. “Resource Description Framework (RDF).” Available at http://www.w3.org/RDF/ —. 2009a. “OWL 2 Web Ontology Language: Document Overview.” Available at http://www.w3. org/TR/owl2-overview/ —. 2009b. “SKOS Simple Knowledge Organization System - Home Page.” Available at http:// www.w3.org/2004/02/skos/ Wikipedia: The Free Encyclopedia. 2011a. s.v. “Disambiguation”, San Francisco: Wikimedia Foundation. Available at http://en.wikipedia.org/wiki/Disambiguation_%28disambiguation %29 Accessed February 2. —. 2011b. s.v. “Semantic web”, San Francisco: Wikimedia Foundation. Available at http://en. wikipedia.org/wiki/ Semantic_web Accessed February 2. —. 2011c. s.v. “Terminology server”, San Francisco: Wikimedia Foundation. Available at http://en. wikipedia.org/wiki/Terminology_Server Accessed February 2.

Session 3 Web Indexing and Social Indexing

Social Bookmarking and Subject Indexing Lois Mai Chan Abstract The main purpose of the paper is to consider and examine social bookmarking as an activity of subject analysis and representation. It includes an introduction to social bookmarking and user-assigned tags and a comparison of social bookmarking and traditional subject cataloging/indexing based on a study of social tags and subject headings assigned to the same books. Issues explored include semantics (term selection) and syntax (tag/subject heading formulation--for example, single words vs. phrases and headings with subdivisions), as well as the depth and exhaustivity in representing subject contents.

1. Introduction Web 2.0 has spurred many web-based social networking activities, including YouTube, Facebook, MySpace, blogs, wikis, Twitter, and folksonomies. For those in the library field, none of the activities made possible by Web 2.0 has provided more challenge and opportunity than social bookmarking. In recent years, social bookmarking has become a very popular activity for Web users, including members of the general public. The rapidly growing phenomenon of social bookmarking may offer the library community – and especially the subject cataloging field – an unusual opportunity. Libraries, with their long history of collaboration and participation, would seem to provide a natural venue for implementing an optional social bookmarking operation as an adjunct to its normal subject cataloging or indexing program. In this respect, social bookmarking can be seen as a way of allowing interested users not only to label documents of interest with terms that they believe would help themselves and others to retrieve them but to offer their opinions of various works.

1.1 Definition of Social Bookmarking and Related Terms For the popular topic of social bookmarking, perhaps it is appropriate to cite a popular source: Social bookmarking is a method for Internet users to store, organize, search, and manage bookmarks of web pages on the Internet with the help of metadata. ... In a social bookmarking system, users save links to web pages that they want to remember and/or share. These bookmarks are usually public, and can be saved privately, shared only with specified people or

128 Lois Mai Chan groups, shared only inside certain networks, or another combination of public and private domains. The allowed people can usually view these bookmarks chronologically, by category or tags, or via a search engine. (Wikipedia, accessed 6/12/09)

There are many neologisms for web-based social networking activities, several of which include the term “tagging.” For “social tagging” there are many variant expressions. A recent Google search on a number of related terms resulted in the following: Collaborative tagging 70,400 Collective tagging 4,620 Social bookmarking 47,600,000 Social indexing 19,900 Social tagging 738,000 (http://www.google.com/, accessed July 4, 2009)

The term “social bookmarking” was adopted for use in this study because it is the most frequently used term for the activity, indicated by the statistics above. The purpose of social bookmarking is to provide subject representation and access to facilitate information retrieval for the benefit of users. This is also the purpose of subject indexing, the traditional method of providing subject access to information by way of formal subject cataloging and indexing activities, typically with the use of a controlled vocabulary such as a thesaurus or subject heading list, in libraries and abstracting and indexing services. Automatic keyword indexing is the predominant approach to information retrieval on the Web. Controlled vocabulary indexing is the primary method for providing subject access and retrieval in library catalogs and in many of abstracting and indexing services and products. In controlled vocabulary indexing, assigning controlled vocabulary terms is an intellectual process, requiring professional training and considerable mental acuity. Between automatic indexing and subject indexing is social bookmarking, which requires effort, thought, insight, and judgment but does not require knowledge of indexing principles or controlled vocabulary. Like automatic indexing, social bookmarking uses natural language; unlike automatic indexing, social bookmarking requires individual effort. In social bookmarking, an assigned bookmark is also called a tag. The following definition of “tag” is found in delicious.com, one of the most popular social bookmarking sites: A tag is simply a word you can use to describe a bookmark. Unlike folders, you make up tags when you need them and you can use as many as you like. The result is a better way to organize your bookmarks and a great way to discover interesting things on the Web. (http://delicious.com/, accessed June 30, 2009)

Another term that often appears in conjunction with social bookmarking is folksonomy. Vander Wal recalled the origin of the term “folksonomy” which he invented in 2004: I am a fan of the word folk when talking about regular people. ...if you took “tax” (the work portion) of taxonomy and replaced it with something anybody could do you would get a folksonomy. I knew the etymology of this word was pulling in two parts from different core sources (Germanic and Greek), but that seemed fitting looking at the early Flickr and del.icio.us. (Vander Wal 2007).

In 2007, Jakob Voss offered the following definition of “folksonomy”:

Social Bookmarking and Subject Indexing 129 Folksonomy (also known as collaborative tagging, social classification, social indexing, social tagging, and other names) is the practice and method of collaboratively creating and managing tags to annotate and categorize content. In contrast to traditional subject indexing, metadata is not only generated by experts but also by creators and consumers of the content. Usually, freely chosen keywords are used instead of a controlled vocabulary. (Voss 2007)

Trant sums up the distinctions between folksonomy, tagging, and social tagging in the following terms: We can think of tagging as a process (with a focus on user choice of terminology); of folksonomy as the resulting collective vocabulary (with a focus on knowledge organization); and of social tagging as a socio-technical context within which tagging takes place (with a focus on social computing and networks). (Trant 2009)

1.2 Examples of Folksonomies The popularity of social bookmarking has resulted in many folksonomies. Among the best known are: Delicious (http://www.delicious.com/) for webpages, Flickr (http:// www.flickr.com/) and YouTube (http://www.youtube.com/) for multimedia, Technorati (http://technorati.com/) for weblogs, and LibraryThing (http://www. librarything.com/) for books. Some folksonomy websites include tag clouds, which are lists of tags using font type and size to indicate frequency of term use, for example, the cloud from LibraryThing (Figure 1). 1893 tect

19th Century

2005 2006 2007 America American

architecture

history

Audiobook

American history

Biography book club borrowed

Chicago World's Fair columbian exposition

crime

chicago

archi-

Chicago

Daniel Burnham Ferris Wheel

fiction historical historical fiction history Illinois library murder mystery nf non-fiction Novel own read serial killer tbr Thriller to read true crime united states unread us history world fair world's columbian exposition

world's fair

Figure 1 The Devil in the White City by Erik Larson. 2004. (841)

For more examples, see Appendix A.

2. Methodology of the Study This study was based on a review of earlier studies (Campbell 2006, Golder 2006, Guy and Tonkin 2006, Kipp and Campbell 2006, Quintarelli 2005, Rader 2007, Rolla 2009, Smith 2007, Spiteri 2007, Trant 2009, and Voss 2007) on social bookmarking and an examination of a group of 20 published nonfiction books (listed in Appendix B) which had been assigned bookmarks through the social bookmarking website LibraryThing (http://www.librarything.com/). LibraryThing was selected as the venue for the study not only because its tagging is specifically applied to books but

130 Lois Mai Chan because it is one of the most popular websites for social bookmarking. Furthermore, its information and statistics are carefully documented. The study focused on the general characteristics of social bookmarking, which were then compared to those pertaining to subject indexing. The aims of the study were: (1) to examine the practice of social bookmarking and user-assigned tags, (2) to compare social bookmarking and traditional subject cataloging or indexing, and (3) to explore, in each activity, issues relating to semantics (term selection) and syntax (grammatical form of term or phrase). The study explores the following research questions: (1) How does social bookmarking differ in principle from traditional subject cataloging and indexing, particularly with regard to semantics and syntax? (2) What are the advantages and disadvantages of social bookmarking vs. controlled vocabulary indexing? (3) Are there ways of bridging the two methods of indexing in order to achieve the best of both worlds for subject access? The following information collected from the LibraryThing website was used as the basis of the study: (1) (2) (3) (4)

Examples of Social Bookmarking (Appendix A) Most often tagged non-fiction (20) (Appendix B) Top 75 Tags (tag cloud) (Appendix C) Top 50 Long Tags (Tags with more than 20 letters) (Appendix D)

After a review of the social tagging literature (see References), the bookmarks assigned to the 20 nonfiction books noted above were examined in order to determine their general characteristics. Next, the Library of Congress (LC) subject headings assigned to the same books by the Library of Congress and/or OCLC member libraries were collected from OCLC WorldCat. The subject headings, too, were examined for eventual comparison with social bookmarks. Finally, based on conclusions reached in previous studies by other investigators and on an examination of records from LibraryThing, the retrieval effectiveness of the two ways of indexing were compared and assessed in respect to both semantics and syntax. Conclusions were drawn on characteristics, similarities and differences, and advantages and disadvantages. Finally, responses were made to the original questions posed for the study.

3. Discussion In social bookmarking, no training in cataloging or indexing is required, thus enabling broad participation. Most assigned terms are free-text keywords, usually short expressions in the form of single words and short phrases (see Appendix C). These reflect the predominant approach to indexing and retrieval on the Web, and so mirror a pattern that is already familiar to end users. Users find it easy to assign tags because they can see what others have done. Furthermore, there is no limit to the num-

Social Bookmarking and Subject Indexing 131

ber of terms that may be assigned per document, and there are usually no established policies or guidelines other than prohibition of inappropriate or unacceptable expressions. Thus, users are allowed almost full freedom in term choice; spontaneity is the rule of the day. Another common practice is to assigned terms that are subjective expressions that provide both access points and additional information. Many bookmarkers add comments such as “wishlist” or “tbr” (for “to be read”). Some bookmarkers use coined words that do not help others with retrieval, and few seem to edit what they submit. The major characteristic of traditional subject indexing, on the other hand, is that it is based on standardized controlled vocabulary lists or thesauri that afford synonym and homonym control as well as cross references to broader, narrower, and correlative terms. Another characteristic, de facto rather than inherent, is that the number of subject headings or index terms assigned per work is relatively small. In the early days of subject cataloging in particular, the number of headings that could be assigned a given work was limited by the spatial restrictions of the three by five card and the prevailing policy that a subject heading should, as far as possible, represent “the whole book.” But, in recent years, the average number of headings assigned per item by library catalogers has been gradually increasing. A third characteristic is objectivity in assignment. It has been generally understood that catalogers or indexers should attempt to represent the subject contents of the resources objectively and not to express personal biases or judgment on what is being represented. There are thus clear differences between these two ways of assigning subject terms to information resources. In the context of social bookmarking, the provider and the consumer of subject information are usually the same people, whereas in the library-information environment, the provider and the user are typically different individuals. In social bookmarking, there has been almost unlimited freedom; subject indexing is restricted to the use of established terms, applied following firm rules. Professional catalogers apply relatively few headings per item; social bookmarkers provide an abundance of tags. Social bookmarkers are often highly subjective in their tags and comments; catalogers, by long custom, are objective. Professional cataloging is very expensive; social bookmarking is done by volunteers. Social tags may be unpredictable; professional indexing is consistent and predictable.

3.1 Advantages and Disadvantages of Social Bookmarking The popularity and rapid acceptance of social tagging seems to indicate dissatisfaction with the retrieval aids previously available. That social bookmarking goes some way to abating that dissatisfaction is one of its major advantages. Another advantage is that, with so few restrictions, bookmarking is quick and easy, with little elapsed time between posting or publication and a work being tagged. Also advantageous is that social bookmarking is very popular, with the result that many people are involved and therefore many books or other items can be tagged without delay. One interesting observation has been, from the evidence of terms assigned, that taggers appear to develop a sense of community, a good thing in itself and good for others because loyalty to the group increases responsibility toward the task. Moreover, an

132 Lois Mai Chan overall study of assigned tags suggests that most participants assign a large number of tags per item, providing information that may be helpful to many others. And a semantic study of assigned tags shows that they are likely to reflect up-to-date terminology. Another observation is that a large percentage of the most frequently assigned tags are form or genre labels (see Appendix D), a fact that may be of interest to controlled vocabulary and metadata designers. Another important contribution of social bookmarking is that user-assigned tags reveal their assigners’ perceptions of and approaches to subject access and information retrieval. This is important information for theorists of information retrieval. Catalogers, indexers, and creators of controlled vocabularies base their decisions about term selection in part on what is generally believed to be the “typical” user behavior in respect to information retrieval. User-assigned tags provide glimpses of “users’ real needs and language” (Quintarelli 2005). They can go a long way toward providing the data sought in earlier user studies, if social bookmarking behavior is paid due attention. Last but by no means least, because social bookmarking is done by volunteer end-users and requires no special training, basically no costs are involved. However, social bookmarking also has disadvantages. Researchers have identified the lack of vocabulary control as its greatest one. There may be multiple tags for the same concept, the same tag may have different meanings, and multiple inflections of the same word may be used. Near synonyms abound. Also, relationships among terms (broader, narrower, correlative consistency in expression) are not indicated. Searching on a given topic is difficult, and retrieval results are spotty. A related problem arises from the tendency of some bookmarkers to use ad hoc coined words; such terms, known only to their coiners, cannot play a retrieval role, nor do they often convey much information to other users. Moreover, many bookmarkers are careless spellers. For example, the tags and expressions (identified and collected on the LibraryThing website) in Figure 2 have been found among user tags for the concept of “nonfiction.” Tag info: non-fiction Includes: non-fiction, *non-fiction, *sachbuch, @nonfiction, A:unfiction, Genre: non-fiction, Non Fictioin, Non Fiction, Nonfiction, Non-Fiction **, Non-Fiction., Non-Fiction;, Non-fictie, Non-fiction , Not-fiction, "non fiction", ^Nonfiction, facklitteratur, genre - non fiction, no-ficcion, nofiction, non fic, non-fcition, non-fic, non-ficion, non-ficition, non-ficiton, non-fictin, nonfictional, non-fictios, non-ficton, non-fistion, non-fitction, nonfic, nonficion, nonficition, nonficiton, nonfictin, nonfiction, nonfiction., nonficton, não-ficção, sachbuch, sakprosa (what ?) Tag and its aliases used 1,673,568 times by 28,970 users. (http://www.librarything.com/tag/non-fiction, accessed June 30, 2009) Figure 2

The fact that many social bookmarks are subjective has already been pointed out. Bias in indexing has always been deplored, although for those interested in search

Social Bookmarking and Subject Indexing 133

behavior, self-referential tags may provide clues that are not easily available through other means. There is also the concern about the scalability of social tags as the content of social bookmarking sites grows larger and larger: I also wonder how well Flickr, del.icio.us, and other folksonomy-dependent sites will scale as content volume gets out of hand. Even now, for example, uploading your summer vacation photos to Flickr and tagging them "summer" will group them with over 6,000 other photos. Hard to browse now, harder when there are 60,000 photos a year from now. And it's a safe bet that no one will bother to go back and re-tag their photos with more precise terms. (Rosenfeld 2005)

Adam Mathes summarizes the pros and cons of folksonomy in the following terms: A folksonomy represents simultaneously some of the best and worst in the organization of information. Its uncontrolled nature is fundamentally chaotic, suffers from problems of imprecision and ambiguity that well developed controlled vocabularies and name authorities effectively ameliorate. Conversely, systems employing free-form tagging that are encouraging users to organize information in their own ways are supremely responsive to user needs and vocabularies, and involve the users of information actively in the organizational system. (Mathes 2004)

3.2 Advantages and Disadvantages of Subject Indexing Traditional approaches to subject indexing also have their advantages and disadvantages. The trained personnel who follow established policies and use controlled vocabularies to perform subject analysis and indexing have been recognized as the source of the major advantages of traditional subject cataloging and indexing. In subject cataloging in particular, the complex syntax of many subject headings enables the provision of context and so renders an access point more expressive. In addition, terms in a controlled vocabulary are linked in a hidden ontology that is implied by broader-term, narrower-term, and related-term relationships, thus providing a hierarchical framework. Because of synonym control and established application policies, searchers can depend on the consistency and greater predictability in retrieval results. Most of the disadvantages of current subject indexing practice lie in its high costs, costs that include training professional catalogers and indexers, who, in turn, merit high pay because they are specialists. Some of the rest lie in how much time normally passes between the publication of a work and the appearance in catalogs of its fully cataloged record. Building and maintaining a thesaurus, a subject headings list, or a classification system is a time-consuming and intellectually demanding undertaking. Such work is thus costly and unavoidably somewhat behind the times, often discouragingly so. Also, the complexity of the indexing and retrieval systems sometimes discourages searchers so that they are unable to benefit from all that it offers. Another shortcoming is the absence of a direct way to gauge users’ viewpoints. Nevertheless, for well over a century, the subject heading system run by the Library of Congress in the United States has provided searchers with dependable, inclusive search results with high precision and good recall.

134 Lois Mai Chan

4. Conclusion – Future Prospects and Potentials The discussion above seems to lead us to a crucial question: should social bookmarking or subject indexing prevail? But must this question be an either-or proposition? This study has shown that each way of assigning subject terms to information resources has its strengths and its weaknesses, and each has something to offer the other. Can we not achieve the best results by combining the two approaches, thus maximizing the benefits of both? In the search for answers to these questions, writers on social bookmarking and folksonomy have proposed various measures. One is to incorporate social bookmarks into library catalogs (Rolla 2009). Another recurring proposal is to incorporate or implement controlled vocabulary into folksonomy. Mathes offers the following observation: Overall, transforming the creation of explicit metadata for resources [for example, controlled vocabularies] from an isolated, professional activity into a shared, communicative activity by users is an important development that should be explored and considered for future systems development. (Mathes 2004)

The desirability of incorporating controlled vocabulary, or at least its features (synonym and homograph control and related terms) is echoed by other writers on folksonomy (Rosenfeld 2005). How best to bridge the social bookmarking and subject indexing in order to maximize the benefits of both is a challenge for all those in the field of providing the most efficient and effective subject access tools. There are different ways of achieving this goal. Two possible approaches come to mind. From the perspective of social bookmarking operations, my suggestion is to match – automatically - user-assigned tags with controlled vocabulary terms. There are two ways to do this. The first is to develop a mechanism that maps existing userassigned tags to controlled vocabulary terms. The second is to suggest controlled vocabulary terms to users during the tagging process. Ideally, a parallel operation would make it possible to include cross-references from controlled vocabularies in the search engine. Incorporating the advantages of controlled vocabulary should greatly facilitate the process of social bookmarking and at the same time enhance its value and usefulness. From the perspective of subject indexing, controlled subject access could be substantially enriched by consulting social bookmarks with three goals in mind: to gain users’ perspectives, to understand users’ searching behavior, and to enhance subject access terms. Furthermore, for those responsible for creating and maintaining controlled vocabularies, social bookmarks provide a rich source for suggesting terms (both valid terms and lead-in terms) for inclusion in the thesauri or subject heading lists. In conclusion, both social bookmarking and subject indexing have the same ultimate goal -- to provide the most efficient and effective method for information storage and retrieval. These two methods represent considerably different approaches, but there is potential for user benefit in capturing the best from both. Social bookmarking is an interesting and significant phenomenon; it behooves the library community to consider its power.

Social Bookmarking and Subject Indexing 135

References Campbell, D. Grant. 2006. “A Phenomenological Framework for the Relationship between the Semantic Web and User-Centered Tagging Systems.” In Advances in Classification Research, Vol. 17: Proceedings of the 17th ASIS&T SIG/CR Classification Research Workshop (Austin, TX, November 4, 2006), edited by Jonathan Furner and Joseph T. Tennis. http://hdl.handle. net/10150/105357. Golder, Scott A. and Bernardo A. Huberman. 2006. “Usage Patterns of Collaborative Tagging Systems.” Journal of Information Science 32(2): 198-208. Guy, Marieke, and Emma Tonkin. 2006. “Folksonomies: Tidying up tags?” D-Lib Magazine 12(1): http://www.dlib.org/dlib/january06/guy/01guy.html. Kipp, Margaret E.I. and D. Grant Campbell. 2006. “Patterns and Inconsistencies in Collaborative Tagging Systems: An Examination of Tagging Practices.” In Proceedings of the ASIST Annual Meeting, 2006. http://hdl.handle.net/10760/8720. Library of Congress Working Group on the Future of Bibliographic Control. 2008. On the Record: Report of the Library of Congress Working Group on the Future of Bibliographic Control. http://www.loc.gov/bibliographic-future/news/lcwg-ontherecord-jan08-final.pdf. Mathes, Adam. 2004. “Folksonomies – Cooperative Classification and Communication Through Shared Metadata.” Accessed January 17, 2011. http://www.adammathes.com/academic/ computer-mediated-communication/folksonomies.html. Quintarelli, Emanuele. 2005. “Folksonomies: Power to the People.” Paper presented at the ISKOUniMIB meeting, Milan, Italy, June 24. http://www.iskoi.org/doc/folksonomies.htm. Rader, Emilee and Rick Wash. 2007. “Collaborative Tagging and Information Management: Influences on Tag Choices in del.icio.us.” In CSCW ’08 Proceedings of the 2008 ACM conference on Computer Supported Cooperative Work. http://dx.doi.org/10.1145/1460563.1460601. Rolla, Peter J. 2009. “User Tags versus Subject Headings: Can User-Supplied Data Improve Subject Access to Library Collections?” Library Resources & Technical Services 53(3): 174-84. Rosenfeld, Louis. 2005. “Folksonomies? How about Metadata Ecologies?” Weblog entry. LouisRosenfeld.com, January 6. http://www.louisrosenfeld.com/home/bloug_archive/000330.html. Smith, Tiffany. 2007. “Cataloging and You: Measuring the Efficacy of a Folksonomy for Subject Analysis.” In Proceedings 18th Workshop of the American Society for Information Science and Technology Special Interest Group in Classification Research, edited by Joan Lussky. http://hdl.handle.net/10150/106434. Social bookmarking. Wikipedia. http://en.wikipedia.org/wiki/Social_bookmarking (accessed June 12, 2009). Spiteri, Louise F. 2007. “The Structure and Form of Folksonomy Tags: The Road to the Public Library Catalog.” Information Technology & Libraries 26(3): 13-25. Trant, J. 2009. “Studying Social Tagging and Folksonomy: A Review and Framework.” Journal of Digital Information 10(1): 1-44. http://journals.tdl.org/jodi/article/view/269 Vander Wal, Thomas. 2007. “Folksonomy.” Weblog entry. vanderwal.net, February 2. http://www. vanderwal.net/folksonomy.html. Voss, Jakob. 2007. “Tagging, Folksonomy & Co - Renaissance of Manual Indexing?” In Proceedings of the 10th International Symposium for Information Science. Cologne, Germany. May 30-June 1, 2007. http://hdl.handle.net/10760/9657.

136 Lois Mai Chan

Appendix A Examples of Social Bookmarking (1) A Short History of Nearly Everything by Bill Bryson. 2004 (957) biology bryson geology history history of science humor humour library natural history nature non-fiction own owned physics popular popular science read Reference tbr to read travel universe unread wishlist world history 2003 2005 2006 2007 American

astronomy

chemistry cosmology earth essays

Audible audio

evolution

AudioBook

favorite funny general science

science

650 _0 Science $v Popular works. Summary: In this book Bill Bryson explores the most intriguing and consequential questions that science seeks to answer and attempts to understand everything that has transpired from the Big Bang to the rise of civilization. To that end, Bill Bryson apprenticed himself to a host of the world's most profound scientific minds, living and dead. His challenge is to take subjects like geology, chemistry, paleontology, astronomy, Read more *********** In this book Bill Bryson explores the most intriguing and consequential questions that science seeks to answer and attempts to understand everything that has transpired from the Big Bang to the rise of civilization. To that end, Bill Bryson apprenticed himself to a host of the world's most profound scientific minds, living and dead. His challenge is to take subjects like geology, chemistry, paleontology, astronomy, and particle physics and see if there isn't some way to render them comprehensible to people, like himself, made bored (or scared) stiff of science by school. His interest is not simply to discover what we know but to find out how we know it. How do we know what is in the center of the earth, thousands of miles beneath the surface? How can we know the extent and the composition of the universe, or what a black hole is? How can we know where the continents were 600 million years ago? How did anyone ever figure these things out? On his travels through space and time, Bill Bryson encounters a splendid gallery of the most fascinating, eccentric, competitive, and foolish personalities ever to ask a hard question. In their company, he undertakes a sometimes profound, sometimes funny, and always supremely clear and entertaining adventure in the realms of human knowledge.

Social Bookmarking and Subject Indexing 137

(2) Fast Food Nation by Eric Schlosser. 2005 (919) agriculture

america American American Culture

animal rights

business

capi-

consumerism corporations cultural studies culture Current Affairs current events diet eating economics environment expose fast food food industry food politics food writing health history journalism labor mcdonalds nutrition obesity own owned political politics pop culture read restaurants science social commentary society united states unread usa vegetarian talism

food

non-fiction

so-

ciology

650 _0 Fast food restaurants $z United States. 650 _0 Food industry and trade $z United States. 650 _0 Convenience foods $z United States. 650 _6 Restaurants-minute $z États-Unis. 650 _6 Aliments $x Industrie et commerce $z États-Unis. 650 _6 Aliments précuisinés $z États-Unis. Contents: American way: The founding fathers. Your trusted friends. Behind the counter. Success -- Meat and potatoes: Why the fries taste good. On the range. Cogs in the great machine. The most dangerous job. What's in the meat . Global realization -- Epilogue: Have it your way -- Afterword: The meaning of mad cow. *********** (3) Blink: the power of thinking without thinking by Malcolm Gladwell. 2005 (782) 2005 2006 science

2007 audio audiobook blink brain

behavior ideas

mind ence

non-fiction

own

perception

psychology read

social science society

unread

business

cognition

cognitive

decision making decisions economics gladwell human interesting Intuition Leadership library management Marketing

culture

sociology

philosophy pop psychology popular sci-

science

self-help social Social Psychology

subconscious tbr

wishlist

650_ 0 Decision making. 650 _0 Intuition. No summary ***********

thinking

thought to read

138 Lois Mai Chan (4) The Tipping Point: How Little Things Can Make a Big… by Malcolm Gladwell. 2002 (739)

business

2007 borrowed causation change contagion cultural current affairs current events economics epidemic essays fads Gladwell ideas influence innovation leadership library management 2005 2006

studies

culture

networks nf

non-fiction

science social Social Science read

Own

Philosophy

social commentary social networks social theory

society

trends unread wishlist

marketing psychology read

social psychology

sociology

tbr thinking tipping point to

650 _0 Social psychology. 650 _0 Contagion (Social psychology) 650 _0 Causation. 650 _0 Context effects (Psychology) 650 _2 Social Behavior. 650 _2 Psychology, Social. 650 _2 Diffusion of Innovation. 650 _2 Leadership. 650 _2 Marketing. 650 _2 Group Processes. Summary: Ideas, products, messages and behaviors "spread just like viruses do." Behavior can ripple outward until a critical mass or "tipping point" is reached, changing the world. Gladwell develops these and other concepts (such as the "stickiness" of ideas or the effect of population size on information dispersal) through simple, clear explanations and entertainingly illustrative anecdotes. Contents: The three rules of epidemics -- The law of the few: connectors, mavens, and salesmen -- The stickiness factor: Sesame Street, Blue's Clues, and the educational virus -- The power of context (part one): Bernie Goetz and the rise and fall of New York City crime -- The power of context (part two): the magic number one hundred and fifty -- Case study: rumors, sneakers, and the power of translation -Case study: suicide, smoking, and the search for the unsticky cigarette -- Conclusion: focus, test, and believe -- Afterword: tipping point lessons from the real world.

Social Bookmarking and Subject Indexing 139

Appendix B Most often tagged non-fiction (20) Freakonomics [Revised and Expanded]: A Rogue Economist… by Steven D. Levitt (1143) Guns, Germs and Steel by Jared Diamond (1021) Eats, Shoots & Leaves: The Zero Tolerance Approach to… by Lynne Truss (985) A Short History of Nearly Everything by Bill Bryson (957) Fast Food Nation by Eric Schlosser (919) The Devil in the White City by Erik Larson (841) Blink: The Power of Thinking Without Thinking by Malcolm Gladwell (782) The Tipping Point: How Little Things Can Make a Big… by Malcolm Gladwell (739) Nickel and Dimed: On (Not) Getting By in America by Barbara Ehrenreich (734) A Brief History of Time by Stephen Hawking (727) In Cold Blood by Truman Capote (726) The Diary of a Young Girl by Anne Frank (678) The Professor and the Madman by Simon Winchester (664) Into Thin Air: A Personal Account of the Mt. Everest… by Jon Krakauer (663) On Writing: A Memoir of the Craft by Stephen King (663) Into the Wild by Jon Krakauer (646) The Elements of Style by William Strunk (642) A Walk in the Woods by Bill Bryson (637) Stiff: The Curious Lives of Human Cadavers by Mary Roach (619) Reading Lolita in Tehran: A Memoir in Books by Azar Nafisi (615)

(LibraryThing, accessed June 30, 2009)

140 Lois Mai Chan

Appendix C Top 50 Long Tags (Tags with more than 20 letters) contemporary fiction (8856), 20th century literature (4480), contemporary romance (4276), 20th century fiction (2813), mass market paperback (2746), political philosophy (2683), Fantasy/Science Fiction (2394), Science Fiction/Fantasy (2279), childrens literature (2257), international relations (1841), children's picture book (1670), personal development (1565), contemporary literature (1564), intellectual history (1559), artificial intelligence (1425), philosophy of science (1363), British Crime Fiction (1302), software development (1280), science fiction and fantasy (1254), 19th century literature (1245), permanent collection (1243), history of philosophy (1229), contemporary fantasy (1108), latin american literature (998), children's nonfiction (974), Christian Nonfiction (957), juvenile non-fiction (951), philosophy of religion (903), dungeons and dragons (897), classical literature (891), programming languages (887), African American Fiction (857), literature in translation (852), short story collection (827), philosophy of language (826), information technology (802), comparative religion (774), biography/autobiography (765), 19th century fiction (764), african-american literature (754), historical linguistics (750), software engineering (727), children's nonfiction (726), victorian literature (715), books i want to read (708), young adult literature (695), Atlantian Reference Library (660), computer programming (630), illuminated manuscripts (615), American Contemporary (570) (LibraryThing, accessed July 9, 2009)

Social Bookmarking and Subject Indexing 141

Appendix D Top 75 Tags (1) tags with usage data fiction (2,695,675), fantasy (905,349), history (758,368), non-fiction (536,284), mystery (505,376), read (483,593), science fiction (476,567), nonfiction (424,887), poetry (337,330), biography (336,907), unread (312,539), novel (293,044), reference (286,609), own (252,531), (250,312), romance (246,922), literature (244,753), philosophy (244,249), art (228,035), religion (217,804), short stories (209,441), humor (207,349), sf (205,037), tbr (202,983), science (201,927), historical fiction (181,676), children's (162,795), series (159,130), travel (152,040), horror (151,169), manga (150,157), children (137,791), comics (136,400), classic (135,527), music (132,880), politics (123,969), young adult (123,731), paperback (122,902), anthology (118,822), classics (117,758), memoir (116,544), 20th century (114,377), theology (107,553), crime (103,066), psychology (103,028), picture book (102,308), graphic novel (100,834), american (98,817), england (98,798), cooking (97,057), cookbook (96,893), ya (96,848), essays (96,402), thriller (91,141), drama (90,048), christianity (88,614), british (88,049), humour (87,315), adventure (85,814), historical (82,837), english (82,357), sci-fi (81,728), language (81,337), wishlist (81,031), owned (77,516), childrens (76,364), writing (76,052), animals (75,137), magic (74,779), autobiography (73,894), christian (72,705), 2008 (71,491), hardcover (69,322), photography (69,097), ebook (67,962) (LibraryThing, accessed July 9, 2009) *********** (2) tag cloud

Essays Ethics

Europe European European history Evangelism Evolution exetiquette Europa

library exercise exhibition exhibition catalog

existen-

tialism exploration f fables faerie fairies fairy tale

Family Fantasy fantasy fiction farm farming fashion Favorite favorite author Favorites favourite female author female protagonist Feminism feminist Fic fictie fiction field guide Film Finance Finished finnish First Edition fish fishing fitness Florida flowers Folio Folio Society Folklore folkFairy Tales Faith

families

142 Lois Mai Chan

Food

tales

food and drink football for sale foreign

language forensics Forgotten Realms fr masonry

French

French

French

Literature

History

French

France

Free-

French

language

Revolution

friends

Friendship fun funny furniture future futuristic g gaiman game Games gaming garden Gardening gardens gave away Gay gay fiction gender gender studies Genealogy General General Fiction genetics Geographie Geography geology Georgia German German History German Literature Germany Geschichte geschiedenis ghost stories ghosts gift girl girls glbt globalization God gods Golf gone good Gospels

Gothic

government

Gram-

mar graphic graphic design Graphic Novel Graphic Novels graphics Great Britain Greece Greek

Greek literature greek mythology green grief

Guide guidebook guitar gurps h Hallowhandbook Hardback Hardcover harlequin

growing up een

Harry Potter have have read Hahc healing Health Hebrew herbs hermeneu-

Harlequin Presents waii

hb

tics high fantasy toire

Historia

high school

Historical

torical fiction

novel

hiking Hinduism his his-

historical fantasy

His-

historical mystery historical

Historical Romance

tory

historiography

his-

history of science Hobbies holiday holidays Hollywood Holocaust Home homeschool homeschooling homosexuality

Horror

(LibraryThing, accessed June 30, 2009)

Social Indexing at the Stockholm Public Library Harriet Aagaard Abstract The Stockholm Public Library launched a new website in February 2008, Biblioteket.se (http://www.biblioteket.se), integrated with the OPAC. Biblioteket.se has web 2.0 features that allow users to set tags, and to rate and review books and films. This study reviewed all tags that had been set in the preceding two and half years, and focused more specifically on those that reflect subjects and that had been used more than ten times. The grammatical form of these tags was compared to the grammatical form used for subject headings. Several methods, including two focus groups with teenagers, sought to understand why people tag and why they do not and to determine whether there are differences between tagging by librarians and by nonlibrarian users. The article concludes with recommendations to improve the tagging experience and use of tags set by others.

1. Introduction The Stockholm public library launched a new web 2.0 website in February 2008, Biblioteket.se (http://www.biblioteket.se/), with an integrated OPAC. At this website it is possible for users to set tags, give ratings on, and review a book, film, etc. Biblioteket.se was made available as a beta-version in February 2007 but at that time it was only possible for staff and for invited “beta-testers” to log into the website and create tags. Recently, there has been a lot of discussion about social indexing in library catalogues but not many research reports. In this paper, I examine the tagging at Biblioteket.se.

2. Method For each tag, I have obtained information about when the tag was first set and how many times it has been used. Information about who has set each tag is only possible to obtain by searching Biblioteket.se and stumbling upon the user through the function for reviews. A user visible at Biblioteket.se has an alias and is anonymous. At the user’s site at Biblioteket.se I can see whether the user has tagged any books (Figure 1).

144 Harriet Aagaard

Figure 1. “Etiketter” is the Swedish word for tags. “Clevery Klipsks etiketter” shows the tags set by the user Clevery Klipsk.

The date set for the review tells something about the user. Prior to 2008-02-15, tags were set only by librarians.1 After that date, it could be either a librarian or a nonlibrarian. I studied all tags to see whether they are private or reflect personal opinions, and have also compared them to the rules for setting subject headings, but I have not actually compared each tag with the subject headings used in our catalogue. I studied the sixty-six tags that were used more than ten times by • • •

comparing the tags to subject headings for fiction and non-fiction looking at the kind of book and media the tag is used for, and trying to determine whether each tag was set by a librarian or a non-librarian.

I also wanted to know more about why people tag. To answer this question, I set up two focus groups with teenagers, advertised at the website for views on tagging, emailed staff at the Stockholm public library, and asked a small group of people to tag.

3. Tagging at biblioteket.se To be able to tag users need to log in to the website. When doing this for the first time they will have to agree to certain rules and create an alias. This is to avoid abuse. There is also a possibility for other users to report abuse. For example, bad

1

I use “librarians” for all employees at the Stockholm public library for practical reasons, not only for trained librarians. “Non-librarians” are users not working at the library.

Social Indexing at the Stockholm Public Library 145

language can cause a user to be banned from further access. So far this has been successful. We have not had any problems with abusive language, etc. Individual tags are separated by a space and can only consist of one word. To form a multi-word tag, you have to leave out the spacing, e.g., LasVegas, or use a connecting character, e.g., Las_Vegas or Las-Vegas. Tagging is not very visible on the start page, but most searches on Biblioteket.se come from Google for specific books or subjects. In this instance, the user comes directly to a view that encourages tagging (Figure 2).

Figure 2. Search on Google for “Flyga drake recension” (the seventh hit in a search performed 2009-06-29).

Figure 3. After logging in you can start setting tags (Etiketter in Swedish).

There is information about how to tag from the “Skriv och tyck” (Write your opinion) page but the information is not easily available from the page shown in Figures 2 and 3. The tags can be searched by a special search interface or by browsing the most used tags. It is not possible to combine searches for tags with information from the bibliographic record. We do not use a tag cloud.

146 Harriet Aagaard

4. Subject Headings The subject headings are quite hidden at Biblioteket.se. Users will have to click on “More information” about a book etc. to see the subject headings and not all of the subject headings are clickable for further searches. Since the subject headings are so hidden, I did not compare subject headings to tags set for a specific title. I did compare tags to the subject headings but not at the record level. The Stockholm Public Library has used subject headings for nonfiction according to the Library of Congress Subject Headings (LCSH) since 1932. We began setting subject headings for fiction around 1990. Subject headings are also generated from the classification system for nonfiction, and additional subject headings for fiction are supplied by a vendor. This makes some records overloaded with subject headings. To help users search by subject is an important task for a library and we plan to improve this, but so far we have not managed to do so.

5. Tags Tagging has been available to librarians for two and a half years. From December 18, 2006 to February 15, 2008 only librarians could tag. Tagging was one of the tasks in our “23 things project” (November 2007). One hundred and fifty-two people completed the tagging thing / task. At least 1,272 tags have been set by librarians, i.e., more than 60% of the tags. All users with a library card and a PIN code have been able to set tags since February 15, 2008. The Umeå district library has 4,300 unique tags. They have not studied the tags, but know that one librarian has been a very active tagger.2 Umeå is the only other library in Sweden that allows users to set tags, but many libraries are planning to replace their old OPACs with new library 2.0 style websites. The Ann Arbor district library has 22,000 unique tags used on 11,000 titles and set by 537 different users. The possibility to tag has been there for two and a half years. Around 5% of the records have tags and less than 1% of the users have set tags.3 The Stockholm public library has 625,000 bibliographic records and 200,000 active users (165,000 if children and elderly people are not included). It is not possible to get information about how many of the records that have been tagged, nor how many different users that tag. The tags 2006 -2007 1221 2008 +602 2009 Jan. 1 – June 15 +316 Total number of unique tags 2139 Total number of taggings 53254 650 (32%) of the tags have been used more than once. 2 3 4

E-mail from Jenny Eklund dated 2009-05-26. E-mail from Eli Neuburger dated 2009-07. Information on the total number of taggings is from 2009-08-19 and figure 1 is based on this data. Information on unique tags is from 2009-06-15.

Social Indexing at the Stockholm Public Library 147

An achievable goal for the future may be one percent of active users setting tags (1,650); I estimate that there are around 200 at the moment, or 0.1%. 1600 1400 1200 1000 800 600 400 200 0 1

2

3

4

5

6

7

8

9

10

11

Figure 4. Tag use. Column 1 shows that as of August 19, 2009, 1400 tags are unique, having been used only once. Column 11 shows the number of tags used more than 10 times.

All 2,139 tags have been studied and classed as5 • • •

Private tags, e.g., “To_read” Tags reflecting personal opinions, e.g., “Smashing” Subject tags

All tags

Private

Reflect opinions

Subjects

2139

92

128

1919

100 %

4.3 %

6.0 %

89.9 %

Most tags reflect subjects and are not for private use or personal opinions. This is a bit surprising, but may be due to the large number of librarians among the taggers. It may also be because the users do not know how to use tags for private purposes. Among the tags set by known non-librarians, 9.9% reflected opinions, and none were for private use. While looking at lists of the tags I found that many of the tags dealt with geography, genre or prizes and awards and I chose to study them. Subject tags

Genre

Geography

Prizes and awards

Other

1919

106

152

32

1633

100 %

5.5 %

10.8 %

1.7 %

82 %

Three hundred and thirty-four tags, or 17.3% of the subject tags, have a grammatical form that differs from the rules for subject headings. For librarians these rules are natural, but for non-librarians they are not obvious. 5

I have also been inspired by the articles by Spiteri (2008) and Mendes (2009) in classing the tags.

148 Harriet Aagaard •

• •

62.9% are nouns in the singular form, “dog” instead of “dogs”. Though librarians do know the subject heading should be “dogs” they may chose to set “dog” to make it easier for non-librarians to search. An example of this is a librarian who has set both the singular and the plural form for an animal to enhance searches. 12.3% are adjectives like “våldsam” (violent). 24.8% have other differences. “Köpa” hus instead of Husköp (buy_house instead House buying), “medeltid” instead of Medeltiden (Middle age instead of The Middle ages).

It seems likely that more tags would have a form different from the subject headings if there were more non-librarian taggers. Among the non-librarians, 25.4% of the subject tags have a different grammatical form.

5.1 The Most Popular Tags The most popular tags are: Usage 145 137 87 65 55 48 47 46 42 40 39

Tag Öknar - Deserts klassiker fantasy Origami Bör-läsas – Should be read humor kärlek – Love Hugo-vinnare – A literary prize provläs – Read to see if you like it Krig - War Relationer

I studied the 66 tags that have been used more than 10 times. The tags were set for fiction, non-fiction, films and music. Fiction

Non-fiction

Film

Music

41

33

19

5

62%

50%

29%

7.6%

I compared the 56 tags that are not for private use and that do not reflect personal opinion to the subject headings for fiction for children and adults and the subject headings for nonfiction. Identical

Almost identical

More specific

More common

Different

29

6

7

11

3

51.8%

10.7

12.5

37.9

5.4

Social Indexing at the Stockholm Public Library 149

Many of the tags are identical to the subject headings. The “Almost identical” differ in grammatical form or use a slightly different word. Examples of “More common” words include “Queer” instead of Queertheory, “usa” instead of The United States of America, and “Progg” (a Swedish radical music movement in the 1970s) instead of Musikrörelsen. “More specific” tags are those set for literary prizes and more specific genres such as Feministic SF. The tags that are “Different” include: • • •

“activities_with_children” - a very useful tag for parents. “Children_in_danger”. The subjects headings for these books are Childabuse, Bullying, Honor killings, Immigrant girls. “Children_in_danger” sets the focus on the child and is a good way of finding all books on this subject. “Alternative_history”.

5.2 Multi-word and Phrase Tags Many users want to form multi-word or phrase tags, and do so with a hyphen, underscore, etc. Quite often they fail and create separate tags instead. For example, “Att bli frisk” (to become well) is set as three tags “att”, “bli” and “frisk”. If it would be possible to form multi word tags the number of unique tags would decrease. It would make it easier to avoid forming meaningless tags like “att” (to). Name for a town like Las Vegas would also most likely always get the tag “Las Vegas” instead of several variations.

6. Users: Librarians and Non-Librarians It has proved very difficult to determine who has set a tag. I have found only 22 nonlibrarian users who have set tags, and have studied the tags set by the 5 non-librarians that I asked to tag. The 27 non-librarians studied have set more tags that reflect personal opinions, 9.9% compared to 6.0% of all tags. The subject tags also differ more often in grammatical form, 25.4% of tags set by non-librarians compared to 17.3 % of all tags. It would be very interesting to perform a larger test with non-librarians. I have tried to make users contact me, but this has proven difficult.

7. Why People Tag and Why They Do Not Most non-librarian users at Biblioteket.se do not tag even when they write reviews and give ratings. It is not uncommon for a user to have rated 50, 70 or even 120

150 Harriet Aagaard books without tagging any of them. Tagging does not seem to be the most popular way to participate at Biblioteket.se, so what do people think about tags?6 I have set up two focus groups consisting of teenagers that normally meet at the library.7 A) Participants in role-playing games (4 persons, 15-20 years old) B) Participants in other games (5 persons, all 17 years old) They all use the library, but only use the website if they really want a particular title. Both groups were sceptical to user supplied information like reviews and ratings. They held the opinion that only persons that really liked something would be prone to give ratings and reviews. Many reviews were not very interesting because they were short and only stated, e.g., “This is a wonderful book!” They also said that it was difficult to know if you liked the same books and preferred having reading advice from friends, or to browse the library shelves. Both groups preferred the professional librarian reviews to user supplied reviews and wanted descriptions of books, not simple recommendations. They did not search the catalogue for fiction by subject. They did search for nonfiction by subject if they needed something for school work, and could see the purpose of the subject headings. Since they rarely searched by subject, it is not surprising that they did not really see the value of tags in the library catalogue. None had tagged before. A boy in group B thought that you needed to be highly committed in order to start writing reviews, tags and grades. If you belong to a social community on the web you would feel obliged to participate. He also pointed out that there is a difference between rating books and rating music on the web. To listen to music at Last.fm and tag it after listening to it is quite easy and does not require much time; books take a much longer time to read. The others agreed. At Biblioteket.se it is possible to contact other users and to discuss books but it is not a social community on the web. The Umeå district library started their website Mina bibliotek (My libraries – http://www.minabibliotek.se/) in 2007 with a social community but few people used the community part of the website. Social communities vary in popularity. In Sweden Lunarstorm was very popular with teenagers for several years, but now many use Facebook instead. I think it would be a good idea to make it easier for users to communicate about books, e.g., by virtual book-circles, and to find new friends at Biblioteket.se, but that a full social community has a small chance to become popular and well used. Even though the advertisement has been on the start page of Biblioteket.se and on “Skriv och tyck” (Write your opinion) for three months, no users have contacted me in response to it. I asked eight non-librarians, hereafter referred to as A-H, to add about 10-20 tags8. Only five of them did. “A” reviewed books instead and told me it was impos6

7

Ding et al. (2009) have studied tagging in Delicious, Flickr and YouTube and find that tagging at Flickr and YouTube are done by few users. At YouTube “the role of tagging is overshadowed by rating and commenting” (p 12). The same is true at Biblioteket.se, tagging is not as popular as giving reviews and ratings. The focus groups were both set up at the branch library Medborgarplatsen at 2009-05-13. I would have preferred having more focus groups but there was not enough time to do so.

Social Indexing at the Stockholm Public Library 151

sible to set tags. She needed more words to express herself. “B” and “C” did not set tags though they were reminded several times. “B” (19 year old male) told me it was too much work and “C” (20 year old woman) told me too late that she did not understand what to do. Most of the five persons that did tag found it difficult but interesting and none would have done it if they had not been asked to do so. “D” (22 year old male) set 30 tags and found it quite easy. “E” set only three tags (Fantasy, Drugs, and Trögläst - Slow-to-read) and told me it was too difficult. “D” thought that there are too few tags at the website and that would discourage people to set tags. He also wanted to be able to form expressions and not only unique words. He formed phrases like “svår_att_lägga_ned” (difficult_to_put_down) and “växa_upp” (grow_up). “F” liked setting tags, but thought that it would be better if there were some instructions. She did not find the instructions available at the website. I have also discussed tags with other non-librarians. Many of them did not want to set tags and did not understand how tags could be useful. I have come to the conclusion that all users do not want to classify their reading. Why do librarians tag? Librarians are generally very interested in searching the catalogue in various ways. Most of the librarians do not set subject headings, but setting tags may be interesting even for those who set subject headings. One librarian stated that with subject headings you have rules to follow, but you can express anything with tags - your feelings and opinions about the book, trivia about the book or the author apart from the book itself, private stuff, etc.

8. Öppna Bibliotek (Open Libraries) The Stockholm Public Library is in charge of a project on how to share added value to the bibliographic record by creating APIs, Öppna bibliotek (http://www. oppnabibliotek.se/). The first part of the project makes it possible to share reviews by librarians. Öppna bibliotek makes it possible to use librarian reviews from another collaborating project called Boktips.net (http://www.boktips.net/) at your own website. The next part of the project, not yet released, is to make it possible to share user supplied content, including tags. Since very few users set tags this is something that most libraries want to share. There is a common belief that more tags will generate more interest in tagging. The project Öppna bibliotek is also interesting because involves cooperation among public libraries, university libraries and Libris, the Swedish union catalogue for university and scientific libraries. Today all types of libraries are interested in adopting the library 2.0 concept. Jönköping university library has implemented the APIs from Öppna bibliotek (http://julia.hj.se).

8

They were expected to tag 10-20 tags and to talk to me afterwards about tagging. I did not want to tell them how to tag, but since they did not really know what to do I told them that tags could reflect the subject, give an opinion or be of private use.

152 Harriet Aagaard

9. Future Improvements The most important improvement needed is a way to separate tags with something other than a space in order to form multi-word and phrase tags. Many users try to form phrases; not all concepts can be expressed by a single word. Help when forming tags would also be good for consistency. If I started writing in “andra värl” it would be helpful to get the pop-up suggestion “Andra världskriget” (The second world war). This would help to avoid misspellings and other variations. The tags and the opportunity to tag should be made more visible and more enticing at Biblioteket.se. A tag cloud looks nice but because the majority of tags are used only once, it must be combined with other ways to browse the tags. From “My loans” at “My page” at Biblioteket.se you are encouraged to create reviews and give ratings but not to set tags, while LibraryThing equally emphasizes tags, ratings and comments.9 Making it easier for users to use the bibliographic records and other information for other purposes and to share with others at well-known social communities or social bookmarking sites may also be of importance for using the catalogue in new ways. This may encourage tagging.10 Sharing tags with other libraries will be a good thing. Together with The Umeå district library we will have 6400 tags and as more libraries join Öppna bibliotek there will be even more tags. Many of the tags will be set for the same titles, but there is also hope of getting tags for less popular books. A good example of this is the fact that “Öknar” (Deserts) is the most used tag at Biblioteket.se and it is set by a single user who takes a great interest in the subject. A better administrative tool for studying tags is also important.

9 Sinclair and Cardew-Hall (2008) have done an interesting comparison of the use of a tag cloud and a search box as a search interface. 10 This is pointed out by Gazan (2008).

Social Indexing at the Stockholm Public Library 153

10. Conclusion This report only provides some indication of the use of tagging in our catalogue. I would like to find out more about why users tag and to do a more thorough study of the tags to learn more about the way they tag. More than 60% of the unique tags have been set by librarians. It has proved difficult to find non-librarian taggers and so I have not been able to determinei whether non-librarians would choose different words then our subject headings. I have found that 51.8% of the sixty-six most popular tags correspond to their respective subject heading and that only 17.3% of all tags have a grammatical form that differs from their respective subject heading while 25.4% of the tags set by non-librarian users have a different grammatical form. Very few users set tags. They are more likely to review and rate books etc. Discussions with users imply that this may be because tagging is found more difficult. More thorough research is needed with more users. Since I have only interviewed a few users, I do not know if this is generally valid for all users. That they do not tag may also be because many users do not search for fiction by subject. They find fiction in other ways than searching a library catalogue. The teenagers in the focus groups preferred the professional subject headings to user set tags. I still think that the opportunity to tag is a valuable feature in library catalogues. People are different and the library should offer different ways for them to participate at the website. Even if less than 0.5% of our users tag and a few more users find other user tags interesting that is enough. I did not find a problem with tags being private and not reflecting the subject of the book. More tags could be private and of use for only the person setting the tag as long as the tags are not mixed with the subject headings. At Biblioteket.se only 10.1% of the tags are private or reflect personal opinions. Subject headings are still important for searching by subject. Tags are not consistent enough to provide a good basis for further searches by subject, but we need to use the subject headings in better ways to help users find media by subject. Searching tags can be helpful, e.g., when searches by subject do not give results and browsing tags can help find interesting books, but most of all, tags are useful for the person setting them.

References Ding, Ying, James Caverlee, Michael Fried, and Zhixiong Zhang. 2009. “Profiling social networks: A social tagging perspective.” D-Lib Magazine 15, no 3/4, available at http://www.dlib.org/ dlib/march09/ding/03ding.html (accessed 2009-06-26) Gazan, Rich. 2008. “Social annotations in digital library collections.” D-Lib Magazine 14, no. 11/12, available at http://www.dlib.org/dlib/november08/gazan/11gazan.html (accessed 200906-26) Mendes, Luiz H., Jennie Quiñonez-Skinner, and Danielle Skaggs. 2009. “Subjecting the catalog to tagging.” Library Hi Tech 27, no. 1 Sinclair, James and Michael Cardew-Hall. 2008. “The folksonomy cloud: When is it useful?” Journal of Information Science 34, no. 1: 15-29. Spiteri, Louise F. 2007. “The structure and form of folksonomy tags: The road to the public library catalog.” Information Technology and Libraries 26, no. 3: 13-25.

The Nuovo Soggettario Thesaurus: Structural Features and Web Application Projects* Luciana Franci, Anna Lucarelli, Marta Motta, Massimo Rolle Abstract The main component of the new Italian subject indexing tool (Nuovo soggettario), edited by the National Central Library of Florence (Italy), is a general Thesaurus available on the web since 2007. The Thesaurus, developed in compliance with the standards and regularly updated, comprises nowadays approximately 37,000 terms. It can be used with both pre- and post-coordinated indexing. The coherent development of hierarchies is based on the ‘facet analysis’ method. The terms are linked with those included in old indexing tools and no longer accepted. This Thesaurus supports the new subject indexing practices and manages terminology deriving from collaboration between the BNCF and other libraries. It is evolving in many directions and supporting interoperability. Its structure will be expanded to facilitate the creation of semantic metadata and social tagging.

1. Introduction This paper presents the Nuovo soggettario Thesaurus, a controlled vocabulary for the new Italian subject indexing language, explains the structural and functional characteristics of this tool, and discusses aspects dealt with at the Session A of this Satellite meeting, dedicated to “Systems, tools and standards in subject indexing”. The Nuovo soggettario project was much anticipated by the Italian library community. It is a “system” consisting of codified rules (concerning terminology and syntax) and a general Thesaurus; it is edited by the National Central Library of Florence (BNCF).1 The BNCF has always had a role in the elaboration and updating of indexing tools and is an important node in the National Library Service (SBN), the Italian library network, in which 4,000 government, university, public and private libraries are currently involved.2 The project evolved with the aim to renew the preceding subject indexing tool, Soggettario per i cataloghi delle biblioteche italiane, which was published in 1956.3 It was necessary to adapt it to international standards and principles, as well as to the recent developments in the field of subject indexing * 1 2 3

The paper was updated in October 2010 according to the development of the project. Nuovo soggettario System . The libraries participating in SBN are organized in nodes distributed throughout the national territory, connected to a central system (the Index), which then sets up the general catalogue of the libraries in the network, . Soggettario per i cataloghi delle biblioteche italiane, a cura della Biblioteca nazionale centrale di Firenze. Firenze: Stamperia Il cenacolo, 1956.

156 Luciana Franci, Anna Lucarelli, Marta Motta, Massimo Rolle and to the ever growing needs posed by the new generation of users “confronted with an ever growing amount and choice of information and scholarly resources, both in print and electronic formats”.4 The Nuovo soggettario took advantage of the essential participation of the Italian National Bibliography (BNI), based at the BNCF, as well as from the development and input of Italian and foreign experts. It also has the scientific support of GRIS (the Italian Library Association’s Research Group on Subject Indexing). In 2007, a volume was published containing the rules of subject indexing and a prototype of the online Thesaurus was made accessible on the web.5 The Thesaurus, which is continuously evolving, is much more than a prototype and while originally available on the web by annual subscription, since July 2010 it has been completely free.6 The Thesaurus is steadily growing thanks to the work done by the BNCF, and since 2009, also by external collaborations of several institutions.7 The BNI began to use the Nuovo soggettario in 2007 and now other libraries are gradually moving to the new system, creating a great change in Italian conventional subject cataloguing. We therefore would like to take this opportunity to explain the features, functionality and structural components of the Thesaurus. More general aspects of the project and subject indexing in Italy will be presented at the Classification and Indexing Open Session in Milan on August 27, 2009.8

2. Fundamental Aspects of Nuovo Soggettario 2.1 The System The Nuovo soggettario system: a. has been conceived as a system to be applied in both pre-coordinated and post-coordinated indexing environments. For example the editorial office of “Liber”, a private company with whom the BNCF has created a partnership

4 5 6 7

8

P. Landry, From traditional subject indexing to web indexing: the challenges of libraries and standards, in Il mondo in biblioteca, la biblioteca nel mondo: verso una dimensione internazionale del servizio e della professione. Milano: Editrice Bibliografica, 2010, p. 182-190. Biblioteca Nazionale Centrale di Firenze, Nuovo soggettario: guida al sistema italiano di indicizzazione per soggetto: prototipo del Thesaurus. Milano: Editrice bibliografica, 2006 © (stampa 2007). The Nuovo soggettario Thesaurus, . The institutions that collaborate in the project are: ICCU (Istituto centrale per il catalogo unico delle biblioteche italiane e per le informazioni bibliografiche); Biblioteca nazionale centrale di Roma; Università di Pisa; Biblioteca Mario Rostoni dell’Università Carlo Cattaneo; Università Bocconi; Università degli studi di Milano; Idest s.r.l.; CoBis (Coordinamento biblioteche speciali e specialistiche di Torino); Biblioteche della Conferenza episcopale italiana; Consiglio Nazionale delle Ricerche: ITTIG; and Istituto della Enciclopedia Italiana. A. Cheti, A. Lucarelli, F. Paradisi, Subject indexing in Italy: recent advances and future perspectives, “International Cataloguing and Bibliographic Control”, vol. 39, n. 3 (July-September 2010), p. 47-52.

The Nuovo Soggettario Thesaurus 157

in order to increase the coverage of the BNI Children’s book series,9 has adopted the terms from the Nuovo soggettario Thesaurus for indexing by keywords. This feature is an important advantage for libraries, especially for the ones that lack resources necessary to sustain a pre-coordinated indexing system; b. is based on the analytico-synthetic model, illustrated by Pino Buizza in his paper, Subject analysis and indexing: an "Italian version" of the Analyticosynthetic model. This model allows subject strings to be created using controlled vocabulary terms which are interconnected according to the rules of a conventional syntax; c. is a flexible and modular system for application in both general and specialized information environments, characterized by different type of resources (graphic material, audiovisual, etc.) or by different domains; d. is made up of four interactive components: the Thesaurus, the volume with the rules for the vocabulary control and construction of subject strings, the application apparatus (Manual for users and syntactic notes), and the OPAC subject headings or subject strings.

2.2 The Thesaurus The Thesaurus is the core of the whole system. The search interface (Fig. 1) allows consultation with the digital version of the old indexing tool (Soggettario and its updates), thus providing a complete access to old and new terms.

Figure 1.

9

La bibliografia nazionale dei libri per ragazzi. Campi Bisenzio: Idest, [2006]- . Published every three months as supplement of LiBeR. Edited by the National Central Library of Florence.

158 Luciana Franci, Anna Lucarelli, Marta Motta, Massimo Rolle The Thesaurus is a general vocabulary which is continuously growing in many directions. It is constantly enriched by terms: 1. either derived from the Soggettario (1956) and its updates - that is if they are still useful and supported by literary warrant, or are obtained to complete semantic relationships, and in this case, derived from other references; 2. or are proposed during document indexing by the users of Nuovo soggettario in the BNI and in other libraries that since 2009 have begun to use the Nuovo soggettario proposed terminology, both general and specialized, that is not yet present in the Thesaurus; 3. or are new terms, both general and specialized, proposed by the BNI cataloguers during the daily indexing procedure and by other libraries that since 2009 have begun to use the Nuovo soggettario.

2.3 General Features It is a universal Thesaurus, both because it involves different domains and because it is applicable to the indexing of different kinds of information resources. For example, it has been used for indexing the photographs of a rare twentieth century collection (Fondo Pannunzio) in the National library of Florence. The importance of the Nuovo soggettario Thesaurus lies in the fact that it represents the first Italian general thesaurus. It is true that there are other multi-disciplinary thesauri in Italy, but mostly they cover social sciences, law and economics, and none of them is truly general. Since publication of the new ISO 2596410 is still pending, the Thesaurus complies with ISO 2788-1986, but has also taken into account the most recent standards such as the ANSI/NISO Z39.19-2005 and the British standard BS 8723:2005-2008.11 The user interface has been available online since January 2007 and is periodically updated by migrating records from the management database. The Thesaurus, when first published, contained 13,000 terms. It currently consists of about 37,000 terms, and is expected to reach at least 60,000 terms by 2012, thus showing a good level of development.

2.4 Structural Features Analyzing the structural features of the vocabulary, we can say that: •

it is predominantly a monohierarchical Thesaurus, although in some cases it provides two or more broader terms (BT) for the same term. The monohierarchy is constructed by applying facet analysis and previously specified criteria for the order of preference. Polyhierarchy is rarely admitted (according to

10 ISO 25964 Official Website . 11 ANSI/NISO Z.39.19.2005: Guidelines for the Costruction, Format and Management of Monolingual Controlled Vocabularies. Bethesda (USA): NISO Press, ; BS 8723: 2005-2008 - British Standards Institution, Structured vocabularies for information retrieval. London: British Standards Institution.

The Nuovo Soggettario Thesaurus 159





criteria clearly specified) and always in compliance with the principles governing the hierarchical relationships. For example, it is applied to terms that indicate "literary genres" (e.g., “Italian lyric poetry” has two BTs: Lyric poetry and Italian poetry), or it can be applied to certain types of compound terms (e.g., “Drug addicted workers”, whose BTs are Workers and Drug addicts); the facet analysis has a fundamental role in organizing the terms inside a classificatory structure based on four main categories and on additional characteristics of division. In this way, the development of hierarchies appears orderly and consistent. The facet analysis gives organic unity to the construction of the vocabulary and, as mentioned above, establishes orders of precedence, thus preventing the proliferation of polyhierarchies;12 this rich Thesaurus, as we shall see, is able to easily manage the transition between the old and new form of terms. It has the characteristic of being "transparent", meaning that it submits data that clarifies, to the user, the complex task of semantic analysis performed by those who arranged it. One of its strengths is undoubtedly the large compilation of information apparatus associated with each term, as well as the links to other electronic resources.

The semantic relationships are those determined by the standards; but, in addition to the typical relationships, other relationships are also provided, namely those linking formerly used terms to the new ones: a. the relationship of historical variant links the preferred terms in the Nuovo soggettario to the formerly used terms which are no longer accepted (Fig. 2). This relationship recalls the indication, "Older subject headings" used by the Library of Congress. This relationship expressed, by the acronym HSF (Historical see for), will allow automatic loading of links between old and new subject headings in our OPACs, without having to correct each one (for example: Chords );

12 A. Cheti and F. Paradisi, “Facet analysis in the development of a general controlled vocabulary,” Axiomathes (2008), vol. 18, n. 2, p. 223-241.

160 Luciana Franci, Anna Lucarelli, Marta Motta, Massimo Rolle

Figure 2.

b. the splitting relationship is expressed by the “UF+” acronym and its reciprocal “USE+”. This relationship links complex concepts to preferred terms derived from the splitting and vice versa (for example, Parasites, fig. 3).

Figure 3.

The records of each term provide several notes, such as: definition notes; scope notes (the latter being an alternative to the former ones); orientation notes, which are within the scope notes; historical notes, which give information on formerly used forms, meanings, and use of terms in Soggettario (1956) and its updates; syntactic

The Nuovo Soggettario Thesaurus 161

notes, which give instructions to cataloguers on the use and the citation order of terms in the construction of subject strings. The above example of “Parasites”, as well as the “UF+” relationship, shows that in the search interface, users can easily identify in the Source field of a term record, all the reference resources used to check meanings, morphology and literary warrant of the term itself. These reference resources are subject heading lists (e.g., LCSH, RAMEAU), general and specialized thesauri (e.g., AAT, EUROVOC, MESH), Italian language dictionaries and encyclopedias (e.g., Treccani online13), and other encyclopedias (e.g., Encyclopaedia Britannica, Enciclopedia italiana Treccani, etc.). The number of references is continuously evolving and presently consists of about 400 titles. This list is accessible from the Thesaurus Home Page (by clicking on the “Sigle e simboli” button). However online reference resources that are mentioned in the Source field of the terms are also directly linked to the homepage of the corresponding online resources. In this case, a blue pointer preceded the symbol of the bibliographic resources (for example, Amphitheaters, in fig. 4).

Figure 4.

Moreover, the Thesaurus shows - experimentally - other language equivalents in the records of terms. At the moment there are some English equivalents that correspond to terms in the Library of Congress Subject Headings (LCSH).14 It is also important to specify that language equivalent relationships have been established between Italian and English terms, but links between Italian terms and English subject strings are not provided. The language equivalent relationships have been recorded in a specific field that can be seen in the user interface (for example, Tree of life, fig. 5).

13 Treccani.it . 14 Library of Congress Authorities. [Subject authority headings] .

162 Luciana Franci, Anna Lucarelli, Marta Motta, Massimo Rolle

Figure 5.

We are only at the beginning of developing the multilingual structure of Nuovo soggettario. The Nuovo soggettario Thesaurus constitutes a “separate” and, in some ways, “autonomous” component of the whole system. Nevertheless it is closely connected to the other side of the language: the syntactical side of the system (explained in the 4th section of Nuovo soggettario. Guida al sistema italiano di indicizzazione per soggetto, see note 5). In fact, many Thesaurus terms present a syntactic note giving instructions on how to use them to build subject strings, according to the syntactic rules provided by the indexing language (a pre-coordinate language). The purpose of the syntactic notes is both to facilitate, when necessary, the application of the general syntax rules, and to assist cataloguers in the subject string procedure. They consist of a body of instructions, drawn up in compliance with the “role analysis” method, including all the necessary indications for a given term, such as its syntactical value, logical function, syntactical role and the proper citation order. These provisions build a sort of "bridge" between the semantic and syntactical sphere, offering guidance to indexers for the citation order (for example, Conflict of interests, fig. 6).

The Nuovo Soggettario Thesaurus 163

Figure 6.

A User Manual for cataloguers - a set of practical instructions to help in the application of the syntactical rules governing full categories of terms and/or specific bibliographic cases - is available online through the Thesaurus home page.15 Some terms are linked within the User Manual when they are mentioned as examples. As the figures show, the button “Notizie bibliografiche” links the terms to bibliographic records in the BNCF OPAC where those terms are used as subjects.

2.5 Semantic Interoperability The Nuovo soggettario Thesaurus is based on an "open" structure considering that it has a regular quantitative growth; is able to integrate both specialized and sectorial terminology, although it works within a general context; and is able to communicate and interact with other specialized thesauri and subject indexing tools, through structural models of various kinds, as recent standards outline. We believe that two features are particularly important: the semantic interoperability and the technical interoperability. Regarding the former, we are currently working on several fronts. One important way in which we would like to develop semantic interoperability is through the mapping between the Thesaurus terms and the corresponding numbers of the Dewey Decimal Classification (DDC). This work, which has been presented at other conferences and workshops, is based on criteria and procedures, whose testing is now underway. At the moment, 7,300 terms in the Thesaurus are associated with DDC numbers, assigned through 2008 according to the 21st Italian edition, and from 2009 according to the 22nd Italian edition.16 The mapping refers only to Thesaurus terms, instead of 15 Nuovo Soggettario User Manual . 16 M. Dewey, Classificazione decimale Dewey e Indice relativo, 22th Italian edition edited by Biblioteca nazionale centrale di Firenze. Roma: Associazione Italiana Biblioteche, 2009.

164 Luciana Franci, Anna Lucarelli, Marta Motta, Massimo Rolle what happens in "Web Dewey" between Dewey numbers, Library of Congress Subject Headings (LCSH) and LCSH authority records.17 It was not easy to define the criteria for the mapping process, as the point of view of DDC (based on domains) does not coincide with that of a thesaurus. In most cases, the DDC number assigned to each Thesaurus term is the interdisciplinary one, independent from the semantic structure to which the term belongs. In the absence of an interdisciplinary number, the reference is to the specific semantic context of the term in the Thesaurus. See Table 1 for example. Table 1. Italian subject heading

LCSH

DDC 22nd ed.

Viaggi

Voyages and travels

910 (Interdisciplinary number)

Donne intellettuali

Women intellectuals

305.489631 (Built number)

The Italian version of DDC 22 is not yet available online and therefore the technical interoperability between the two tools is still virtual. The assignment of DDC numbers to our terms has however greatly contributed to better management and control of the Thesaurus, for example in extracting terms belonging to the same discipline, in drawing up statistics, etc.

3. Prospects for Web Indexing and Social Indexing for the Nuovo Soggettario Thesaurus 3.1 Software, Exchange Formats and Protocols The Nuovo soggettario Thesaurus adopted the open source software produced for Agrovoc, the multilingual agricultural vocabulary by the Food and Agriculture Organization of the United Nations. Naturally, arrangements were made to optimize its functions according to our specific needs and requirements. The software is based on the Zthes protocol that offers a model for the implementation of thesauri accessible via an evolution of Z39.50 protocol, SRW (Search/Retrieve Web Service). The model is quite general, allowing implementations with other protocols and formats. The Nuovo soggettario Thesaurus is not only available through a web browser but, through the Zthes protocol, can also interoperate with other applications. The syntax used is SRU (Search-Retrieve via URL), a syntax-oriented to the world of Web applications. With this syntax, an application can encode your request using a URL, for example, and a reply will be given in an encrypted XML format.18 The types of semantic relationships established are essentially the same provided by the standards. 17 Web Dewey 18 G. Bergamin, Speciale Nuovo soggettario: applicazioni web, “Biblioteche oggi”, 25(2007), 6, p. 9495; available on-line: .

The Nuovo Soggettario Thesaurus 165

At this time Zthes is a suitable model for our needs, however we have tested a conversion into SKOS (Simple Knowledge Organization System), as has already been done by the Library of Congress (for LCSH) and by National Library of France (for RAMEAU), in the framework of the TELplus European project.19 Applying the SKOS conversion to the Nuovo soggettario Thesaurus, each concept is expressed by a related term. Each term is identified by an identification number, which represents the Uniform Resource Identifier and which produces an RDF identification code. The results of this conversion into SKOS have been presented at the 4th Italian Summit of Information Architecture.20

3.2 Management of Software Library System of SBN Nodes and SBN Index (Union Catalog) Not all Italian libraries that send bibliographical records to the Union Catalog of the National Library Service21 fully cooperate in subject indexing. In fact, the subject headings have been completely shared only in databases of specific nodes. The descriptors used in indexing do not yet communicate directly with the Thesaurus, but work is underway to make this possible. In a short time, some semantic relationships of the Thesaurus (UF, HSF, UF+) will be visible in the SBN Union catalog. This will be useful for the users as well as for the libraries that are now gradually adopting the Nuovo soggettario. When the Thesaurus is integrated with the SBN OPACs, the users will navigate from the controlled vocabulary to the bibliographic records, as well as through the search term to the subject strings containing it (two-step search). As for the BNCF OPAC, we are studying other ways of strengthening and enhancing the subject search; for example, through categorizations of the results, such as faceted browsing.

3.3 Interoperability Experience of the Nuovo soggettario Many other thesauri and encyclopedias have been used during the semantic analysis of terms as a prerequisite to activating links between the Thesaurus and those reference resources, when available online. This naturally refers to a “basic” interoperability, but advanced and efficient enough to create a deep links, or rather, equivalent links between the terms of the Thesaurus and the same terms in other databases, through their citation in the Source field of our terms. There are already equivalent links between Nuovo soggettario terms and AGROVOC terms, LIUC (Thesaurus of the Rostoni Library at the University of Castellanza) terms, DoGi Database classification system (Legal Literature Abstract of articles published in Italian law journals, 19 RAMEAU . 20 M. Motta, D. Rodighiero, Il Thesaurus del Nuovo soggettario interpreta SKOS, presented at IV Summit italiano di Architettura dell’informazione Pisa, 7-8 Maggio 2010, . 21 SBN Index

166 Luciana Franci, Anna Lucarelli, Marta Motta, Massimo Rolle edited by CNR), and the database of the most prestigious Italian encyclopedia: Treccani.it. In the future we would like to create links with the Italian version of MESH edited by Istituto Superiore di Sanità. Regarding the other references used as sources for the semantic control of terms, surface links to the homepage of each source have been guaranteed.

3.4 Web Indexing We are currently evaluating the possibility of using the Nuovo soggettario Thesaurus as a reference tool for the creation of semantic metadata in web indexing. Authoritative scholars affirm that human-created controlled vocabularies have at their base a structure to which metadata can be easily associated.22 As we know, metadata standards (e.g., the Dublin Core Metadata Element Set)23 have no precise rules regarding the “subject” element. The “metatagger” can enter whatever they consider appropriate without necessarily using standards, classification schemes or subject heading lists. Advocates of this practice believe that it is possible to add semantic metadata in a simple and sufficiently standardized way to the greater part of primary online information. The metadata can also be managed and aggregated in a useful and functional way through the “intelligent” use of appropriate software. But in reality, it is also widely known that not all web users are able to provide and use metadata, let alone semantically describe documents they have produced, catalogued and consulted. Therefore, there is no doubt that traditional library tools, such as regularly maintained and updated general thesauri, continue to play an important role because they are able to guarantee a high-level of uniformity in information treatment and retrieval.

3.5 Testing at the BNCF A project is underway at the BNCF to carry out automated indexing of doctoral theses presented at Italian universities, using the Nuovo soggettario Thesaurus. Theses in digital format are deposited in the BNCF institutional repository, according to recent legal deposit laws.24 Matching will be done between the text of the abstract, the keywords chosen by the authors and the Nuovo soggettario Thesaurus terms, producing, in the case of exact matching, automated and post-coordinated indexing. We will take into account analogous consolidated international experiences in order to evaluate parameters such as the cost-effectiveness of this kind of indexing.25 22 V. Broughton, Essential thesaurus construction. London: Facet, [2006], p. 30-31. 23 Dublin Core Metadata Element Set, Version 1.1, . 24 This test refers the Magazzini digitali project: G. Bergamin, M. Messina, Magazzini digitali: dal prototipo al servizio, “Digitalia”, 5 (2010), 1, p. 115-132; available online: . 25 We refer to one of the most important experiences: Networked Digital Library of Theses and Dissertations (NDLTD), an international organization dedicated to promoting the adoption, creation, use, dissemination and preservation of electronic analogues to the traditional paper-based theses and dissertations, .

The Nuovo Soggettario Thesaurus 167

3.6 Social Tagging Web 2.0 is designed to allow users to add value to online services themselves. This process is called social tagging (ethnoclassification26, collaborative tagging) and is a system of content classification (content categorization) generated by users through a bottom upward consensus. In fact, what is important to establish is the value of the vocabulary used in the tagging, given that the linguistic and semantic characteristics related to the various disciplinary or cultural contexts are not indicated. Moreover, tag vocabulary does not have a basic structure (as do thesauri) that allows it to be updated and applied with regard to interoperability and the organization of federated search. The strength of traditional subject indexing tools lies in their hierarchical structure and in their organization of semantic relationships between terms; particularly regarding the control of polysemous terms, synonyms, and graphic variants. The research by means of folksonomies has less effect in the precision and recall of relevant documents. It neither recognizes the relationships between terms (given that it lacks semantic relationships), nor makes unambiguous the concepts according to context. Nor does it eliminate synonyms, homophones, homographs and homonyms which are very common in natural language. The search results for the user are often not very precise or pertinent, even when carried out in an open and inclusive context. Social tagging allows for the expression of personal assessments on a particular document. This is completely different from traditional subject indexing where the librarian’s mediation is carried out without subjective evaluations and judgments making them more likely to index the information content intended by the author. An unquestionable advantage of social tagging is the provision of new terms that express new concepts and meanings to those who are involved in the updating of controlled vocabularies. It will be necessary to work hard to make these two spheres work together rather than separately, while ensuring that the adoption of social tagging and folksonomies maintain the benefits offered by the use of controlled vocabularies. Software already exists (e.g., Aquabrowser Mydiscoveries) which gather terms, proposed by users, that are then evaluated by librarians and made available to the public.

4. Conclusion Our intention is that the Nuovo soggettario Thesaurus will create a kind of “community” where the users (indexers and researchers who access the Thesaurus) can write general or specific comments, propose new terms and suggest changes to those already present. By following this approach, the Nuovo soggettario Thesaurus will become a kind of "social thesaurus", but with essential characteristics that cannot be renounced, rather than a vocabulary that self-feeds without control. We will have a tool with a strong semantic structure, in which the socialization and the sharing of content become an important factor in the process of enrichment. 26 P. Merholz, Metadata for the masses, “Adaptive path”, October 19, 2004, .

168 Luciana Franci, Anna Lucarelli, Marta Motta, Massimo Rolle

References ANSI/NISO Z.39.19.2005: Guidelines for the Costruction, Format and Management of Monolingual Controlled Vocabularies. Bethesda (USA): NISO Press . Bergamin, Giovanni, Maurizio Messina. 2010. “Magazzini digitali: dal prototipo al servizio.” Digitalia 5, no. 1:115-132; available online: . Bergamin, Giovanni. 2007. “Speciale Nuovo soggettario: applicazioni web.” Biblioteche oggi 25, no. 6:94-95; available online: . La bibliografia nazionale dei libri per ragazzi. Campi Bisenzio: Idest, [2006]- . Published every three months as supplement of LiBeR. Edited by the National Central Library of Florence. Biblioteca Nazionale Centrale di Firenze, Nuovo soggettario: guida al sistema italiano di indicizzazione per soggetto: prototipo del Thesaurus. Milano: Editrice Bibliografica, 2006 © (stampa 2007). Broughton, Vanda. 2006. Essential Thesaurus Construction. London: Facet, 30-31. BS 8723:2005-2008 - British Standards Institution, Structured vocabularies for information retrieval. London: British Standards Institution. Cheti, Alberto, Federica Paradisi. 2008. “Facet analysis in the development of a general controlled vocabulary.” Axiomathes 18, no. 2:223–241. Cheti, Alberto, Anna Lucarelli, and Federica Paradisi. 2010. “Subject indexing in Italy: recent advances and future perspectives.” International Cataloguing and Bibliographic Control 39, no. 3:47-52. Dewey, Melville. 2009. Classificazione decimale Dewey e Indice relativo, 22th Italian edition edited by Biblioteca nazionale centrale di Firenze. Roma: Associazione Italiana Biblioteche. Dublin Core Metadata Element Set, Version 1.1, . ISO 25964 Official Website . Landry, Patrice. 2010. “From traditional subject indexing to web indexing: the challenges of libraries and standards” in Il mondo in biblioteca, la biblioteca nel mondo: verso una dimensione internazionale del servizio e della professione. Milano: Editrice Bibliografica, 182-190. Library of Congress Authorities. [Subject authority headings] . MACS Merholz, Peter. 2004. “Metadata for the masses.” Adaptive path, October 19, . Motta, Marta and Dario Rodighiero. Il Thesaurus del Nuovo soggettario interpreta SKOS, presented at IV Summit italiano di Architettura dell’informazione Pisa, 7-8 Maggio 2010, . Nuovo soggettario System . Nuovo soggettario Thesaurus . Nuovo soggettario User Manual . RAMEAU . SBN Index . Soggettario per i cataloghi delle biblioteche italiane, a cura della Biblioteca nazionale centrale di Firenze. Firenze: Stamperia Il cenacolo, 1956. Web Dewey .

Język Hasel Przedmiotowych Biblioteki Narodowej (National Library of Poland Subject Headings) - From Card Catalogs to Digital Library: Some Questions About the Future of a Local Subject Headings System in the Changing World of Information Retrieval Wanda Klenczon Abstract JHP BN (National Library of Poland Subject Headings) is a structured controlled vocabulary developed by the National Library of Poland since the 1950th and used in the Polish current bibliography and in the majority of Polish libraries. In order to make our indexing tool easier to use, understand and maintain subject headings will be applied with a simpler syntax and application rules. One of the goals is to develop a subject headings scheme compatible with metadata scheme used in the national digital library.

1. Introduction The National Library of Poland Subject Headings (JHP BN) is an indexing language developed and maintained by the National Library of Poland, used in the Polish current bibliography, in the national library (NL) catalogues, in the OPAC of the majority of Polish public and educational libraries, and also in many other libraries (research, academic, school, etc.) and in other memory institutions. It is also used as an indexing and retrieval tool in numerous bibliographic databases available online such as regional bibliographies and periodical indexes1. Subject cataloging has a long tradition in Poland. The theoretical background and guidelines for subject cataloguing were established in Poland for the first time in the 1920s, and the first subject catalogue was created also then2. These guidelines, consistent with international standards and instructions accepted today, were adopted in the early 1950s when the staff of National Library of Poland made a decision to build its own subject heading system with the goal of indexing the publications registered in the Polish national bibliography and in NL’s catalogues. The subject catalogue – card-based and later automated – became the National Library’s central descriptive catalog in 1969. Two decades later, in 1989, the National Library published the first edition of Słownik języka haseł przedmiotowych Biblioteki Narodowej [Subject

1 2

Wanda Klenczon and Anna Stolarczyk. “Subject headings of the Polish National Library (JHP BN)”, Polish Libraries Today vol. 7:2007 p. 60-64. Adam Łysakowski. Katalog przedmiotowy [The Subject catalogue], vol. 1 Teoria, Wilno 1928.

170 Wanda Klenczon Headings Thesaurus of the National Library – below referred as Thesaurus], that contained ca 9,000 headings (topics and subdivisions). JHP BN is a poly-hierarchical system based on main headings and subdivisions organized in the network of semantic relationships among terms (hierarchical and associative references). The vocabulary covers all subject areas, and pre-coordinated indexing enables the expression of complex subjects. The printed Thesaurus was published every four years until 2005, and contained topical headings and subdivisions3. Names of persons, corporate body names, uniform titles and geographic names are maintained and controlled in an authority file in the MARC21 format. This authority file has been increasing in size since the mid1990s. All types of headings are created in compliance with the relevant international or national standards and instructions4. In 1995 the list of subject headings was converted into machine-readable form and was transferred into the integrated library system (INNOPAC/Millenium). Since then the JHPBN authority file in the MARC21 format is available online, free of charge and without restriction5. The public accessible authority file is updated weekly. Currently it contains over 123,000 terms (including approximately 40,000 topics, 25,000 geographical names and 1,500 topic and form subdivisions). About 14,000 new records are created annually (including approximately 3,500 topics). The authority file of JHP BN strings contains more than 550,000 records and has been growing by about 55,000 annually. The JHP BN subject headings system was primarily designed as a controlled vocabulary for indexing the books from the National Library of Poland collections. Gradually its usage has become broader, and it is now used to for all types of documents: monographs, periodicals and other continuing resources, articles from serial publications (newspapers and scientific magazines), sound and video recordings, motion pictures, printed music, cartographic material, two-dimensional graphic representations and artifacts, electronic documents and ephemera. The National Library of Poland OPAC6 is providing access to 1.3 million bibliographic records related to various types of documents (as of June 25, 2009), and this entire collection is cataloged using JHP BN. Last year more than 100,000 documents in our NL were indexed with JHP BN headings (including the systematic retroconversion of card catalogues).

3 4 5 6

Słownik Języka Haseł Przedmiotowych Biblioteki Narodowej, oprac. Joanna Kędzielska, Wanda Klenczon, Anna Stolarczyk, 5th edition, Warszawa, 2005. Polish standards related to bibliographic description consistent with IFLA guidelines, http://www.bn.org.pl/download/document/1277731657.pdf [consulted June 20, 2009]. http://mak.bn.org.pl/w5.htm [consulted June 20, 2009]. Since 2005, all JHP BN headings and controlled subject headings strings are also available in the National Union Catalogue NUKAT, http:// www.nukat.edu.pl/ [consulted June 20, 2009]. http://alpha.bn.org.pl/ [online] [consulted June 20, 2009].

Język Hasel Przedmiotowych Biblioteki Narodowej 171

2. What is the Future of Subject Cataloging in the National Library of Poland? Although JHP BN is well-known and used by Polish librarians, we must be aware of changing user needs, of the impact of “googlization” on their search strategies and of the diminishing role of subject indexes. Subject searching is the most problematic and controversial of all search types in the OPAC because subject headings which reflect individual knowledge, attitudes and preferences of indexers can be subjective, the rules are complicated and sometimes ambiguous, and subject strings are often incomprehensible for end-users and complex rules are also troublesome for reference librarians. Good subject analysis is difficult, time-consuming and expensive; it requires highly qualified specialists ready for continuous learning. Many are of the opinion that the cost of their work is unjustified in times of tight library budgets. Thanks to the popularity of search engines such as Google, keyword searching predominates. The common opinion is that keyword searching is easy, quick and effective. As indexing specialists we know that its effectiveness is sometimes illusory, nevertheless we should try to recognize user needs and establish subject access points that can be achieved as simply as possible. If we don’t react, we will lose contact with the reality and lose our users. For indexers who nowadays have no contact with the public, the analysis of transaction logs from a library’s catalogue would be especially useful and instructive. The systematic examination of the use of indexes and search results in our NL bibliographic database (Fig.1) shows that an estimated 20-25% of library catalog searches are subject searches7.

Fig. 1. National Library of Poland. Indexes used from June 16-25, 2009.

7

The most popular index in the NL OPAC is title index. Two others, the author and subject index, are used with comparable frequency which is up to 25-30% depending on the time of year.

172 Wanda Klenczon

Fig. 2. National Library of Poland. Search results in Subject index from June 16-25, 2009.

End-users retrieve zero results for 15-20% of their searches. The high rate of successful searching is probably caused by a long tradition of subject indexing in our NL. Most librarians in Poland know this indexing tool and are aware that the entire NL collection available via OPAC is indexed using JHP BN. Unfortunately, transaction logs cannot answer the questions about OPAC users’ satisfaction or dissatisfaction with subject headings, subject indexes and information retrieval. The JHP BN team is examining transaction logs to identify subject searches that yielded no hits. These are an important source of information about unsatisfied user needs and about missing equivalents of terms. What else can we observe? The users rarely utilize complete and correctly constructed subject terms or strings. They are unaware of available tools such as an alphabetical list of suggested subject terms, references or advanced searching and they usually use separate words or a combination of two words.

Język Hasel Przedmiotowych Biblioteki Narodowej 173

Fig. 3. National Library of Poland. Search results in Subject index from June 16-25, 2009 - examples of user queries.

The subject index is being used as a simple keyword research tool. In this context we should discuss the usefulness of pre-coordinated languages. The future of indexing based on pre-coordinated subject heading strings is currently a matter of discussion among LIS specialists in Poland. We followed the discussion about the future of LCSH in the context of the pros and cons of pre- versus post-coordination8. The JHP BN team observed with an interest the FAST project which is based on the existing LCSH vocabulary but adapted to a faceted schema with a simplified syntax. In our opinion it could be an interesting solution and a way to satisfy the information needs of our contemporaries. The subject heading syntax is too complex and not flexible enough, one reason that users prefer keyword searching even though many separate concept combinations could result in incorrect or irrelevant queries. We are analyzing obstacles and disadvantages of subject searches in our NL OPAC. The set of JHP BN strings is rising at a tremendous rate and very often a new string is used only once for indexing a single work and never again. This is a disadvantage for subject index users. Nevertheless the pre-coordinated string ensures better relevance, tends to result in high precision and can be useful for both browsing and keyword searching. It has to be taken into consideration that millions of library documents in numerous Polish libraries have been indexed by applying the rules of pre-coordinated syntax.

8

Library of Congress Subject Headings. Pre- vs. Post-Coordination and Related Issues, http://www. loc.gov/catdir/cpso/pre_vs_post.pdf [consulted June 20, 2009].

174 Wanda Klenczon After discussing these pros and cons JHP BN Development Consulting Team have decided to continue the current indexing policy and at the same time to make our indexing tool easier to use at maintain.

3. What Do We Intend To Do? First, we are going to make a general revision of JHP BN vocabulary, including a systematic review and evaluation of all topics and subdivisions, and make modifications if necessary. The terminology should be more current and consistent, and - for inexperienced users- more intuitive. Later we will complete the alphabetical list of terms with the hierarchical list of preferred terms. It will facilitate the control of subject vocabulary and management of relationships. So far we have prepared the hierarchical list of religious, laws and educational terms. Next we will simplify syntax and application rules. Review of rules of attribution of chronological and form subdivisions seems to be a crucial first step in order to make the syntax easier. The aspect of time in subject cataloging is an important element of subject analysis. In order to include the aspect of time, JHP BN uses years and periods. Headings and subdivisions consist of either a single date or a starting and ending date, chronological subdivisions for 20th and 21st centuries reflect mostly political and economical reality (1914-1918, 1918-1939, 1939-1945, 1945-1989). Examples: 651 |a Polska |y 16-18 w [=16-18th century] 651 |a Warszawa |y od 1989 r. [=since 1989] 651 |a Polska |x kultura |y 1918-1939 r. 650 |a Bitwa 1561 r. na Równinie Kawanakajima The introduction of relevant dates seems to be useful for providing greater accuracy and making subject retrieval more consistent. We are discussing the use of free-form chronological terms and subdivisions, that may accurately reflect the time period covered by the content of the item. The JHP BN team decided to continue the current practice to assignation the chronological subdivisions, but the use free-form chronological terms as supplementary subject access point in 648 fields (that may consist of only a single date or a date range) will be analyzed. For example, a document about manners and customs in Warsaw in 1993 could be indexed as: 651 |a Warszawa |x życie codzienne |y od 1989 r. 648 |a 1993 A document about manners and customs in Warsaw in the period 1989-2000 could be indexed as: 651 |a Warszawa |x życie codzienne |y od 1989 r. 648 |a 1989-2000

Język Hasel Przedmiotowych Biblioteki Narodowej 175

Another solution being discussed is a complementary use of a standardized code in the 045 field – Time Period of Content, in which a specific time period may be recorded in the pattern year-month-day (month and day are optional) and preceeded by a code for the era. This solution cannot currently be used because the library system does not yet support the retrieval of information from coded fields. Regarding form subdivisions, we are discussing their usefulness in subject strings. Are they really needed and exploited in information retrieval? These are examples of subject heading strings concerning the history of Polish music:

Is all this information about the specific kind or genre of material important and useful for anyone searching for documents about Polish music in 19th-20th centuries? In our opinion they distract the user from the other bibliographic information in the OPAC. We intend to identify the form headings and form subdivisions that currently exist in the JHP BN authority file and move them to form/genre indexing term in the 655 field (authority record with MARC21 tag 155). It could simplify the subject heading strings, reduce their number and make the subject index more consistent. All important information about form/type of document or its genre will be accessible trough the keyword searching. These projects will be discussed with librarians representing different types of libraries using JHP BN.

4. Digital Library – Subject Headings as a Source of Keywords The syntax simplification is very important in the context of web indexing and the creation of keyword metadata based on the controlled vocabulary. NL of Poland collects and archives some domestic public online documents in the library's Internet archive, but at the moment our experience is still modest. We are more advanced in indexing publications in a digital library. National Digital Library (Cyfrowa Biblioteka Narodowa, CBN) Polona9 was created in 2007. Its mission is to enhance broad and easy access to the digital collections of the NL of Poland, including the most important editions of literature and scientific materials, historical documents, journals, graphics, photography, scores and maps. All documents presented in CBN Polona are being described primarily in the NL catalogue database and are linked (through field 856 in MARC21 bibliographic records) to a catalogue description that 9

http://www.polona.pl/dlibra, [consulted June 20, 2009].

176 Wanda Klenczon includes the complete information about publication, as well as the full subject access points. The retrieval of publications in CBN Polona is a simple search based on keywords used as metadata in Dublin Core Metadata Element Set. These keywords are extracted from the JHP BN subject headings strings used in bibliographic records in the NL OPAC which have been selected by librarians (the most significant and unique words are chosen) and automatically moved to DC Element: Subject.

Fig. 4. Bibliographic record as source of metadata.

Język Hasel Przedmiotowych Biblioteki Narodowej 177

We are going to implement the DCMI Metadata Terms scheme (recently translated into Polish)10 in the digital library. It offers more possibilities for describing digital documents, and also to index and create cross-references. Subject headings and subdivisions will move into following DCMS elements: Topic  Subject Geographic name or subdivision  Spatial Coverage Date from chronological topic or subdivision  Temporal Coverage Genre/Form  Type

Fig. 5. JHP BN subject headings used as keyword in digital library.

Other aspects may be included in elements such as Audience Education Level or Medium. In the field of digitization the principal aim of the JHP BN team is to propose terms to describe objects of cultural heritage, such as manuscripts, rare books, graphics or photographs. Until now some NL special collections were not indexed or classified. Primarily we will provide - in cooperation with potential end-users such as historians, art historians, students - terms for indexing works of art and related images. We have reviewed and analyzed the most important thesauri and classifications used for indexing cultural heritage resources. We discovered that all these indexing tools were too complex for indexing our collection and they could not easily be adapted. We have decided to add new terms to the JHP BN authority file and complete them with equivalent terms from others indexing tools such as Art & Architecture Thesaurus, Thesaurus for Use in Rare Book and Special Collections, Thesaurus of Graphic Materials, or Iconclass. Thanks to these activities more varied, multilingual access to Polish digitalized cultural resources will gradually be provided.

10 http://www.bn.org.pl/download/document/1261049421.pdf, [consulted June 20, 2009].

178 Wanda Klenczon Examples: 150 Zmartwychwstanie |2 jhpbn 750 Resurrection |2 tgm 750 Resurrection of Christ |2 iconclass 155 Misteria i mirakle |2 jhpbn 755 Miracle plays |2 rbgenr 755 Mystery plays |2 rbgenr 155 Drzeworyt chiaroscuro |b jhpbn 755 Chiaroscuro woodcuts |b aat 155 Fotografia lotnicza |2 jhpbn 755 Aerial photography |2 aat 155 Godzinki (księgi) |2 jhpbn 755 Book of hour’s |2 rbgenr 155 Zielnik |2 jhpbn 755 Pflanzenbuch |2 VD17 Multilingual access has never been our principal claim. JHP BN has been created mostly for indexing documents acquired as legal deposit and written or spoken in Polish, which is not a popular language. Digitization of library documents changes our point of view on this matter. The digital library contains and provides access to non-textual cultural objects such as graphics, printed and recorded music, or music manuscripts which can be discovered without language obstacles. Multilingual access points seem to be extremely important for creating useful metadata for nontextual resources. Many digital libraries enable users to describe resources with non-controlled vocabulary. Although the Polish Digital Library has not offered this possibility yet, we are interested in the potential value of social tagging. As we can observe in several databases, tags can complete controlled vocabulary and may become very useful additional subject access points. The potential of “common knowledge” may be enormous and helpful in many cases, especially in indexing the collections of images, sound recordings or moving pictures that are sometimes extremely difficult to index with subject headings. We do not believe that social tagging can replace indexing by librarians (too many tags are of very low semantic value and in the absence of context are quite useless). However, our attitude is open toward a folksonomy in our digital library.

5. Conclusion JHP BN is the local indexing tool, used in Polish libraries. In thinking about its future development we must take into account not only international, multilingual context but – first of all – the technical and financial conditions in Polish libraries, both in the National Library and in small local libraries. How do we deal with existing databases that are indexed with subject heading strings? All decisions that change the

Język Hasel Przedmiotowych Biblioteki Narodowej 179

vocabulary, the application rules, or the indexing policy must be taken with precaution and be widely consulted. Any change causes trouble for many librarians and recataloging is time consuming and expensive. Nevertheless, we are trying to make our index tool easier to use and maintain and able to satisfy information needs of our contemporaries.

FAST Headings as Tags for WorldCat Diane Vizine-Goetz Abstract This paper reports on an investigation to use Faceted Application of Subject Terminology (FAST) as a surrogate for tags in WorldCat, a global catalog of bibliographic records and location information for books, videos, music, and other types of materials found in libraries. FAST is a controlled vocabulary based on the Library of Congress Subject Headings (LCSH). FAST is applied to a copy of WorldCat to explore the potential of generating tag-like information for bibliographic records. The paper provides sample visualizations of FAST headings inspired by social tagging applications.

1. Introduction The growing body of research on social tagging includes a few studies that compare user-entered tags and subject headings. Spiteri (2007) investigates the structure and content of tags evaluating them against guidelines for thesaurus construction and LCSH. She finds that most tags represent things (e.g., concepts, people, places, and organizations) and that nouns are the most common form of term. She reports a high degree of overlap between the concepts expressed in tags and LCSH, but notes that tags differ from LCSH in vocabulary and specificity of terms. Rolla (2009) compares the quantity and nature of user-entered tags in LibraryThing to LCSH in bibliographic records for a sample of 45 titles. He finds that there are many more tags for a title than subject headings. The tags express concepts not found in subject headings for each of the 45 titles; subject headings express unique concepts not expressed in tags for a little more than half of the titles. He also reports significant overlap between the concepts in tags and LCSH and that user tags are often either broader or narrower than LCSH. Both investigators suggest that subject headings can be improved by adopting some of the qualities of tags, e.g., use of more natural language and popular terminology. This paper reports on an investigation to use Faceted Application of Subject Terminology (FAST) as a surrogate for tags in WorldCat. It does not compare user tags and subject headings, but instead seeks to generate tag-like data elements from a controlled vocabulary. In this study, bibliographic records containing FAST headings are clustered into work sets. The headings are then aggregated at the work level and treated like tags. The paper begins with an overview of FAST followed by a discussion of the application of FAST to a copy of WorldCat that is clustered into work sets according to

182 Diane Vizine-Goetz principles of the Functional Requirements for Bibliographic Records (FRBR) model. Examples of headings aggregated at the work level are provided, followed by sample visualizations of FAST headings. The paper concludes with observations on the potential application of FAST in cataloging and end user environments and suggests directions for further research.

2. Overview of FAST FAST is a controlled vocabulary based on the terminology of LCSH. FAST retains the rich vocabulary of LCSH, the synonym and homograph control of its parent vocabulary, and the linkages among terms provided by LCSH’s cross reference structure. FAST consists of headings which are categorized into seven subject facets and one form/genre facet: Topic Place Time Event Person Corporate body Title of work Form/Genre Each heading belongs to only one facet, and facets may be used independently. All headings are enumerated in FAST except time1 making FAST headings easier to apply and validate than headings from schemes that involve synthesis. FAST headings are established by faceting the Library of Congress Subject Authority File and LCSH in WorldCat records. A single LCSH may be broken up into multiple FAST headings: LCSH Psychiatric hospital patients—Massachusetts—Biography FAST Psychiatric hospital patients [Topic] Massachusetts [Place] Biography [Form/Genre] LCSH Women college students—Suicidal behavior—Fiction FAST Women college students—Suicidal behavior [Topic] Fiction [Form/Genre] 1

Authority records for chronological headings are only made when they are necessary to create a reference. See O’Neill, Edward T. and Lois Mai Chan for more information on chronological headings.

FAST Headings as Tags for WorldCat 183

Faceting does not always result in more headings. Chan et al. (2001) give an example in which 47 different geographic LCSH entries are faceted into 19 FAST headings. FAST headings typically have a simpler syntax than LCSH even though FAST retains the use of subdivisions and the hierarchical structure of LCSH. A subdivision in FAST must belong to the same facet as the main heading. Headings are added to FAST according to their usage in WorldCat. Before a heading is added to the list, it is validated to eliminate errors and inconsistencies. In June 2009, the FAST file consisted of 1.6 million headings. For a detailed discussion of the history, structure and development of FAST, see Chan and O’Neill (2010).

3. FAST in WorldCat This study reports on the application of FAST to a copy of WorldCat.2 Approximately 136 million LCSH in 62 million (46%) WorldCat records were faceted into 176 million FAST headings. The average number of LCSH per record is 2.19; the average number of FAST per bibliographic record is 2.84. The entire bibliographic file of 135 million records was then clustered into work sets using the OCLC FRBR Work-Set Algorithm.3 The algorithm collects bibliographic records into groups based on author and title information from bibliographic and authority records. Author names and titles are normalized according to the NACO Authority File Comparison Rules to construct a key for each bibliographic record (e.g., plath, sylvia/bell jar is the key for The Bell Jar by Sylvia Plath). All records with the same key are grouped together in a work set. The clustering process produced 98,960,368 work sets; 44% (43,105,432) had FAST headings. About 1.5 million different FAST headings are represented in the clustered bibliographic file. The distribution of unique headings over the six major subject facets is shown in Table 1. Ninety percent of topical headings and 77% geographic headings are applied to multiple work sets. Nearly twice as many unique personal name headings than topical headings are applied; however, more than 50% of personal names are applied to a single work set. Of the top 50 headings, 26 are topics, 23 are place names, and one is a corporate body. Less than 2% of the headings account for 80% of the use.

2 3

FAST headings were applied to a copy of OCLC WorldCat through the end of March 2009. The algorithm is available for download at http://www.oclc.org/research/projects/frbr/algorithm. htm

.

184 Diane Vizine-Goetz Table 1 Unique FAST

Facet

% of FAST

Multiple Uses

%

Single Use

%

Person

635,088

42%

310,766

49%

324,322

51%

Corporate body

346,535

23%

196,857

57%

149,678

43%

9,964

1%

6,308

63%

3,656

37%

47,654

3%

30,138

63%

17,516

37%

Topic

343,748

23%

310,765

90%

32,983

10%

Place

136,936

9%

105,269

77%

31,667

23%

Total

1,519,925

100%

960,103

Event Title of Work

559,822

The average number of FAST headings per work set is 2.55; work sets consisting of 3 or more bibliographic records average 3.2 headings per work. Work sets for fiction have slightly more headings than other work sets; they average 2.65 FAST headings per work, and 3.78 FAST headings for works with three or more bibliographic records. Records in work sets with three or more records are the most likely to be used by libraries; they average 25 holding codes, or holdings,4 per record and 133 holdings per work set, compared to 11 and 14, respectively, for the entire file.

4. Aggregating FAST Headings This section presents information on the FAST headings in six work sets. Table 2 provides the following information for each work set: • • • • •



4

Title of work and work key (column 1) Total holdings count for the work set (column 2). Total number of languages of publication in WorldCat (column 3). This number includes the language of the original plus languages of translations. Total number of bibliographic records in the work set, i.e., all records with the same author/title key (column 4). Total number of LCSH (column 5). This number is a count of all unique headings coded as LCSH in the bibliographic records. This count was not adjusted for variations in capitalization, punctuation, or subfield coding, or for spelling and typographical errors. Total number of unique FAST headings that resulted from faceting LCSH (column 6).

A holding code identifies a library and indicates that the library owns an item. A holding code is associated with a bibliographic record when a holding is set or an item is cataloged in WorldCat.

FAST Headings as Tags for WorldCat 185 Table 2 Title / Work key

Holdings

Languages

Records

Unique LCSH

Unique FAST

1. The Bell Jar plath, sylvia/bell jar

10,474

23

188

28

17

2. A Brief History of Time hawking, s w\stephen w/brief history of time

11,402

21

162

22

14

3. Fast Food Nation schlosser, eric/fast food nation

7,297

14

60

20

16

4. Housekeeping robinson, marilynne/ housekeeping

3,757

6

53

20

14

5. Girl, Interrupted kaysen, susanna\1948/girl interrupted

4,363

10

30

16

10

6. A Savage War of Peace horne, alistair/savage war of peace algeria 1954 1962

2,450

3

28

2

2

Titles 2, 3, and 6 are titles studied by Rolla (2009, 184). They are included here so that the reader can compare results from the two investigations. All of the example work sets include records for translations of the original title, and all are widely held by libraries; they average more than 6,600 holdings per work. The number of records per work set ranges from 188 records for The Bell Jar, a work of fiction, to 28 records for A Savage War of Peace, a non-fiction work. The Bell Jar has the greatest number of LCSH with 28; A Savage War of Peace has the fewest with 2. The number of unique LCSH tends to increase as the number of records in the work set increases. The average number of LCSH per work set is 18 and the average number of FAST headings is 12.17, much greater than the average of 2.55 FAST headings for the complete file. The FAST headings for each work set are given below. The top eight headings by holdings count are shown in bold type; individual headings are delimited by a semicolon. The Bell Jar American fiction; Autobiographical fiction; College students; College students—Suicidal behavior; Depression, Mental; Mental illness; Plath, Sylvia; Psychological fiction; Suicidal behavior; Suicidal behavior— Treatment; United States; Women authors; Women college students; Women college students—Suicidal behavior; Young adult fiction; Young women; Young women—Psychology

186 Diane Vizine-Goetz A Brief History of Time Astronomy; Big bang theory; Black holes (Astronomy); Cosmography; Cosmology; Einstein, Albert, 1879-1955; Galilei, Galileo, 1564-1642; Hawking, S. W. (Stephen W.); Korean language; Newton, Isaac, Sir, 16421727; Physics; Planets—Origin; Space and time; Time Fast Food Nation Book clubs (Discussion groups); Consumer behavior; Convenience foods; Convenience foods—Social aspects; Cookery, American; Diet; Diet— Health aspects; Fast food restaurants; Food habits; Food industry and trade; Great Britain; Group reading; Packing-houses; Restaurateurs; Social history; United States Housekeeping 1900 – 1999; American fiction; Aunts; Authors, American; Domestic fiction; Eccentrics and eccentricities; Girls; Idaho; Mothers; Mothers— Death; Pacific Northwest; Psychological fiction; Robinson, Marilynne; United States—Northwestern States; Girl, Interrupted 1900 – 1999; Authors, American; Kaysen, Susanna, 1948-; Massachusetts; Mental health; Mental illness; Mentally ill; Mentally ill, Writings of the; Psychiatric hospital care; Psychiatric hospital patients A Savage War of Peace Algeria; Revolution (Algeria : 1954-1962) Each set of headings is an aggregation of the unique FAST headings for a given work set. For example, for Fast Food Nation, 2 records with 26 holdings have the heading, Book clubs (Discussion groups); 9 records with 50 holdings have the heading Cookery, American, 36 records with 6,797 holdings have the heading Fast food restaurants, and so on. Aggregating the headings at the work level brings together all of the headings for a work, including correctly assigned headings and headings that may be inaccurately applied, e.g., Cookery, American.5 Headings that pertain only to translations or particular editions are also part of the set, e.g., Book clubs (Discussion groups), for a book club edition.

5. FAST as Tags A tag cloud is a common way of presenting tags where more frequently used tags are emphasized using different font sizes or colors. The headings for Fast Food Nation are presented in Figure 1 as a tag cloud using holdings counts to weight the headings. 5 6

Rolla (2009, 182) cites Cookery, American is an example of a heading that does not accurately describe the subject of the work. See the project, Work Records in WorldCat, for more information on OCLC Research’s ongoing experimentation with the FRBR model and FAST. http://www.oclc.org/research/activities/

FAST Headings as Tags for WorldCat 187

The resulting cloud makes it easy to see that Convenience foods and Fast food restaurants are the primary subjects applied to this work and that Packing-houses and Consumer behavior are applied less frequently. Book clubs (Discussion groups)

foods

Consumer behavior

Convenience

Convenience foods—Social aspects Cookery, American Diet

Fast food restaurants Food habits Food industry and trade Great Britain Group reading Packing-houses United States Diet—Health aspects

Restaurateurs Social history

Figure 1

Lists and charts are other ways of presenting terms and their frequencies or weights. An alternate graphical visualization of the headings for Fast Food Nation is shown in Figure 2. It consists of a pie chart and a list of the top eight headings ranked by holdings count; only the top five headings are presented in the chart. The value eight was selected based on a study of user tagging by Shirky (2005). He finds that the top eight tags in social tagging applications represent a consensus view of a resource. Research is needed to determine if there is a similar threshold for subject headings and bibliographic records.

1. 2. 3. 4. 5. 6. 7. 8.

United States (7235) Convenience foods (7225) Fast food restaurants (6795) Food industry and trade (6789) Food habits (444) Cookery, American (50) Diet--Health aspects (34) Book clubs (Discussion groups) (26)

Figure 2

6. Conclusion and Future Directions FAST headings, aggregated in the manner described in this study, can give librarians and other users access to a wider variety of headings than is available in individual

188 Diane Vizine-Goetz bibliographic records. When FAST headings are presented as tag clouds, users can quickly see what subjects have been applied to a work. In a cataloging environment, access to aggregations of headings could lead to more efficient workflows, especially for the cataloging of new editions of existing works. Library staff members might also be more likely to notice and correct erroneous headings since they stand out among the correctly assigned ones. In end user environments, FAST could be used to improve browsing and navigation. The next phase of this research will involve prototyping one or more of the scenarios described above and testing FAST with users. Future research may include exploring ways to increase the number and type of headings associated with work sets that have only a few headings, e.g., A Savage War of Peace, and experimenting with techniques to supplement controlled headings with additional entry vocabulary, including words and phrases from user tags.6

Acknowledgements The author is grateful to Kerre Kammerer, Jenny Toves, and Harry Wagner for providing statistics on FRBR work sets and FAST headings, and to Roger Thompson for computer programming support. The author also thanks Terry Butterworth, Carol Hickey, Joan Mitchell, Larry Olszewski, Ed O’Neill, and Melissa Renspie for their assistance in preparing this paper.

References Chan, Lois Mai, Eric Childress, Rebecca Dean, Edward T. O'Neill, and Diane Vizine-Goetz. 2001. “A Faceted Approach to Subject Data in the Dublin Core Metadata Record.” Journal of Internet Cataloging 4 (1/2): 35-47. Chan, Lois Mai and Edward T. O’Neill. 2010. FAST: Faceted Application of Subject Terminology, Principles and Application. Westport, Connecticut: Libraries Unlimited. O’Neill, Edward T. and Lois Mai Chan. 2003. “FAST (Faceted Application of Subject Terminology): A Simplified LCSH-Based Vocabulary FAST.” Paper presented at the World Library and Information Congress (69th IFLA General Conference and Council), Berlin, Germany, August 1-9. http://archive.ifla.org/IV/ifla69/papers/010e-ONeill_Mai-Chan.pdf. Rolla, Peter J. 2009. “User Tags versus Subject Headings: Can User-Supplied Data Improve Subject Access to Library Collections?” Library Resources & Technical Services. 53 (3): 174184. Shirky, Clay. 2005. “Dynamic Growth of Tag Clouds.” Weblog entry. You’re It! a blog on tagging, May 24. http://tagsonomy.com/index.php/dynamic-growth-of-tag-clouds/. Spiteri, Louise F. 2007. “The Use Of Collaborative Tagging In Public Library Catalogs.” Final report : 2006 OCLC/ALISE Library and Information Science Research Grant Project, February 26, 2007. http://library.oclc.org/cdm4/item_viewer.php?CISOROOT=/p267701 coll27&CISOPTR=271.

Contributors Aagaard, Harriet

Cataloguing and systems librarian, Stockholm Public Library, Stockholm, Sweden [email protected]

Balíková, Marie

Head of National Subject Authorities and Subject Cataloguing Department, National Library of the Czech Republic, Prague [email protected]

Buizza, Giuseppe

Head of cataloguing, classification and indexing, Biblioteca Queriniana, Brescia – Italy [email protected]

Bultrini, Leda

Division manager/Planning, development and internal control division – Education and training division, ARPA Lazio (Regional Agency for Environment Protection – Lazio), Roma, Italy [email protected]

Casson, Emanuela

Librarian, Library of Seminario Matematico, University of Padua, Padua, Italy [email protected]

Chan, Lois Mai, Ph.D

Professor, School of Library and Information Science, University of Kentucky, USA [email protected]

Dunsire, Gordon

Consultant [email protected]

Fabbrizzi, Andrea

Librarian, Library of Social Sciences, University of Florence, Florence, Italy [email protected]

Franci, Luciana

Librarian, National Central Library of Florence, Florence, Italy [email protected]

Gnoli, Claudio

Cataloguer, Science and Technology Library, University of Pavia, Italy [email protected]

Jahns, Yvonne

Subject librarian / Law specialist, Department of Classification and Indexing, Deutsche Nationalbibliothek, Leipzig, Germany [email protected]

190 Contributors Klenczon, Wanda

Head, Bibliographic Institute, National Library of Poland, Warsaw, Poland [email protected]

Landry, Patrice

Head, Subject Indexing, Swiss National Library, Bern, Switzerland [email protected]

Lucarelli, Anna

Librarian, National Central Library of Florence Florence, Italy [email protected]

Mitchell, Joan S.

Editor in Chief, Dewey Decimal Classification, Dublin, Ohio, USA [email protected]

Motta, Marta

Nuovo soggettario Project member, National Central Library of Florence, Florence, Italy [email protected]

Nilbe, Sirje

Head, Authority Control Department National Library of Estonia, Tallinn, Estonia [email protected]

O’Neill, Edward T.

Senior Research Scientist, OCLC Research, OCLC Online Computer Library Center, Inc. Dublin, Ohio, USA [email protected]

Roe, Sandy

Head, Cataloging & Metadata Services Unit, Illinois State University [email protected]

Rolle, Massimo

Director, Giunta Regionale Toscana Library, Italy

Rype, Ingebjørg

Senior Librarian, Norwegian National Library, Oslo, Norway [email protected]

Slavic, Aida, Dr.

UDC Editor-in-Chief, UDC Consortium, The Hague, Netherlands [email protected]

Svanberg,Magdalena

Executive officer / Librarian, National Library of Sweden, Stockholm, Sweden [email protected]

Vizine-Goetz, Diane

Senior Research Scientist, OCLC Research, OCLC Online Computer Library Center, Inc. Dublin, Ohio, USA [email protected]