187 50 3MB
English Pages 123 [128] Year 1998
International Federation of Library Associations and Institutions Fédération Internationale des Associations de Bibliothécaires et des Bibliothèques Internationaler Verband der bibliothekarischen Vereine und Institutionen MexcAyHapoAHaA eAepaumi En6.nnoTe4Hi>ix AccouxauHfi h yipeHtflenHft Federación Internacional de Asociaciones de Bibliotecarios y Bibliotecas
I FLA Publications 85
Multi-script, Multilingual, Multi-character Issues for the Online Environment Proceedings of a Workshop Sponsored by the I FLA Section on Cataloguing, Istanbul, Turkey, August 24, 1995
Edited by John D. Byrum, Jr. and Olivia Madison
Κ · G · Saur
München 1998
IFLA Publications edited by Carol Henry
Recommended catalogue entry: Multi-script, multilingual, multi-character issues for the online environment : proceedings of a workshop sponsored by the IFLA Section on Cataloguing, Istanbul, Turkey, August 24,1995 / Ed. by John D. Byrum, Jr. and Olivia Madison. - München : Saur, 1998, IV,123 p. 21 cm (IFLA publications ; 85) ISBN 3-598-21814-1
Die Deutsche Bibliothek - CIP Einheitsaufñahme Multi-script, multilingual, multi-character issues for the online environment : proceedings of a workshop sponsored by the IFLA Section on Cataloguing, Istanbul, Turkey, August 24, 1995 / Ed. by John D. Byrum, Jr. and Olivia Madison. - München : Saur, 1998 (IFLA publications ; 85) ISBN 3-598-21814-1
Θ Printed on acid-free paper The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences - Permanence of Paper for Printed Library Materials, ANSI Z39.48.1984. © 1998 by International Federation of Library Associations and Institutions, The Hague, The Netherlands Alle Rechte vorbehalten / All Rights Strictly Reserved K.G.Saur Verlag GmbH & Co. KG, München 1998 Part of Reed Elsevier Printed in the Federal Republic of Germany All rights reserved. No part of this publication may be reproduced, stored in a retrieval System of any nature, or transmitted, in any from or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher. Printed / Bound by Strauss Offsetdruck GmbH, Mürlenbach ISBN 3-598-21814-1 ISSN 0344-6891 (IFLA Publications)
CONTENTS Introduction John D. Byrum and Olivia Madison Turkish Experiences with Multi-Script, Multi-Character, and Multilingual Cataloguing Issues Ahmet Çelenkoglu Commentary Colleen Hansen Nine Problems Concerning Arabic Charlotte Wien Commentary John Eilts Cataloguing of Documents for Multilingual Catalogues of Libraries in Russia: Analysis of the Problem Situation Natalia N. Kasparova System of Multilingual Catalogues and Problems Arising at the Initial Stage of Electronic Database Creation Ludmila A. Terekhova Commentary Helen F. Schmierer Multilingual and Multi-Character Set Data in Library Systems and Networks: Experiences and Perspectives from Switzerland and Finland Riitta Lehtinen and Genevieve Clavel-Merrin Commentary Eeva Murtomaa The Unicode™ Standard: An Overview with Emphasis on Bidirectionality Joan M. Aliprand Commentary James E. Agenbroad Selected Bibliography 1
INTRODUCTION John D. Byrum Library of Congress and Olivia Madison Iowa State University Library
EFLA's interest in the problems of multilingual and multi-script publications extends well into the past. More than a decade ago, in fact, the first DFLA preconference on this topic was held in Tokyo and resulted in publication in 1987 of Automated Systems for Access to Multilingual and Multiscript Library Materials — Problems and Solutions.1 Although the conference emphasized oriental scripts since it was held in the Pacific region, the issues identified were mostly applicable to other non-roman scripts as well. At the conclusion of the pre-conference, Cynthia Durance summarized several points that had emerged during the meeting as topics requiring priority attention, including: •
The need for more influence on technical systems development and more participation in standards development from countries and areas for which various languages and scripts are in the vernacular.
•
The need for greater standardization of existing transliteration schemes for effective interchange of data.
•
Development of character set standards.
•
Need for standards for sorting and retrieval of all scripts, but in particular ideographic scripts.
•
Adjustment of the ISBDs and of UNIMARC to better accommodate non-roman scripts.
•
Discovery of incentives to persuade vendors to develop software and hardware to better support multi-script applications.2
As a follow up to this session, the IFLA Sections on Information Technology
3
and on Library Services to Multicultural Populations joined with the EFLA Division on Bibliographic Control in sponsoring a satellite meeting in August 1993 just prior to the annual IFLA conference in Barcelona which dealt with issues related to automated systems for access to multilingual and multi-script library materials. This meeting was designed to respond to increasing international interest and concern over potential growing and divergent technological responses to display and access issues for multilingual and multi-script library materials. The satellite meeting's official goal was to "focus on multilingual and multi-script problems in the library service arena, especially in the automated environment and to facilitate the access to this sort of material." The 1993 programme covered such topics as the design of multilingual and multi-script catalogues, multi-script issues and graphic displays for Online Public Access Catalogues (OPACS), and automation of various scripts within online systems. Ms. Eeva Murtomaa wrote a full report of this satellite meeting for International Cataloguing and Bibliographic Control.3 The satellite meeting yielded fourteen resolutions and action items which were approved by the IFLA Section on Information Technology and the IFLA Section on Library Services to Multicultural Populations. The IFLA Section on Cataloguing adopted four of the action items as part of its action plan for its Medium Term Programme: •
Encourage changes to cataloguing rules to include all scripts and languages;
•
Develop recommendations on sorting arrangements across languages and scripts;
•
Develop recommendations for treatment appropriate non-roman scripts; and,
•
Develop recommendations for treatment of abbreviations in bibliographic records to reduce the difficulty of comprehension of data in multilingual databases.
of word division in
The Cataloguing Section's 1992-1997 Medium Term Programme had already featured two relevant goals which provide the essential framework to this action
4
items. •
Propose methods to improve the handling of multi-script records and the integration of such records in various types of bibliographic databases.
•
Assess the problems and prospects of linking various single language and/or multilingual name authority files.
As part of its response to these action items, the Section on Cataloguing's Standing Committee encouraged the work of Ms. Eeva Murtomaa and Ms. Eugenie Grieg who in 1994 published an article in International Cataloguing and Bibliographic Control that discussed the problems and prospects of linking various single-language and/or multi-language name authority files.4 This article was based on discussions held from 1991 to 1993 within meetings of the IFLA Standing Committee for the Section on Cataloguing. The Section's Standing Committee also followed up by arranging, as a featured part of its open programme at the 61st IFLA Conference in 1995, for Mr. John Eilts to present a paper on how North American libraries are dealing with non-roman script materials for automation and international exchange.5 Mr. Eilts concluded his paper with the recommendation that we should begin developing necessary programs to translate relevant standards (e.g., character sets, data formats, subject classification, etc.) "to and from those in use in countries where scripts are used, and in areas where the direct translation of formats would cause a loss of data."4
Workshop on Multi-script, Multilingual, Multi-character Issues for the Online Environment
As a further contribution to consideration of issues raised at the Satellite Meeting held at Madrid, the Section on Cataloguing sponsored the Workshop which is the subject of this publication. At the gathering in Istanbul, six participants presented major papers covering key issues, with commentaries by other participants.
5
Following is a brief synopsis of the papers: The first speaker, Mr. Ahmet Çelenkoglu, begins his paper by noting that Turkey was not prepared for the advent of automation, in terms of having neither agreed-upon national cataloguing or format standards nor established groups to promulgate and maintain them. The National Library of Turkey (NLT) attempted to provide leadership with its TURKMARC project launched in 1981, but academic and public libraries were not willing to follow at first. Nevertheless, the National Library did translate and adapt the 20th Edition of the Dewey Decimal Classification in 1993 and Anglo-American Cataloguing Rules, 2nd edition (AARC2) (1988 revision) in 1994. Recounting experiences of NLT from its inception in 1946, Mr. Çelenkoglu traces its development into the leading institution in the field of Turkish library services. NLT initiated automation in 1982, electing USMARC and AACR1, and by the mid-1990's had developed a bibliographic database containing a half-million records. Mr. Çelenkoglu briefly discusses the ALEPH system and then explains the complications of the Turkish alphabet, which contains 29 letters and is completely phonetic. He concludes by tracing efforts to gain recognition of the alphabet's special characteristics in ISO standards and identifying the hardships resulting from ISO/IEC 10646-1, which complicated the hardware and software needed to process and display bibliographic records in Turkish. Next, Ms. Charlotte Wien's paper focuses on the challenges involved in machine processing of bibliographic records which contain data written in the Arabic script. She notes that there is not one, but several character sets for this language, and that several integrated library systems exist which are capable of displaying records in Arabic. Nevertheless, she believes that several problems impede the widespread incorporation of cataloguing in Arabic, and in her paper she has identified nine specific barriers which need to be overcome. Ms. Wien divides these problems into technical and linguistic categories. The first group includes unique features of the Arabic alphabet: the fact that the script is written from right to left; the multiple nature of the graphical expressions of the individual letters; and, complications related to Arabic numerals. In the linguistic area, the problems she cites are: lack of vowels in most written texts; the roots and patterns that structure the language; complexities arising from "weak radicals"; the lack of case endings except in formal written Arabic and indications of pronouns as suffixes of the words to which they relate; and, the orthography of Arabic. Ms. Wien's paper then reviews research related to these problems and finds that despite this work
6
most difficulties pertaining to records in Arabic script in automated databases remain. She concludes with suggestions for future investigations to resolve them. Ms. Natalia Kasparova addresses in her paper the issues involved in creating and maintaining catalogues in which different languages and character sets are represented. She pragmatically discusses them in relation to cataloguing rules and what their relationships should be towards international standards, such as the ISBDs, and respect for the original languages of the cataloguing agencies. Ms. Kasparova reviews three strategies currently under discussion by librarians and bibliographic agencies. The first strategy is to collocate in one alphabetic file all works of an author regardless of language. The second strategy is to adapt headings to what she refers to as "conditions of Russia." The third strategy is to facilitate the compilation and retrieval of bibliographic records created with multicharacter sets. She then discusses some of the difficulties facing libraries and bibliographic agencies because of the wide variety of their bibliographic practices (including standards for transcription and transliteration) and the various languages used throughout Russia. Ms. Kasparova concludes her paper with five guidelines that could facilitate national and international bibliographic cooperative efforts. Ms. Ludmila Terekhova's paper provides a case study to illustrate the experience of a major Russian research library's effort to automate its multilingual catalogue, that of the Rudomino Library for Foreign Literature (RLFL) in Moscow. She begins by pointing out how the recent social and political changes in Russia have influenced librarianship and the nature of cataloguing itself. The variety of publications issued in numerous languages and various scripts which a library such as RLFL collects as well as the weight of past cataloguing practices greatly complicate automation. Regarding the latter point, she indicates that RLFL's alphabetic catalogues are arranged differently with three different approaches simultaneously implemented: some publications represented by records arranged alphabetically by a single alphabet, some in a file which cumulates in single alphabets all entries in languages using the same script, others in separate alphabetical catalogues for different languages. In addition there is a subject periodical catalogue and various departmental catalogues. She notes that cataloguing practices represented in these files do not necessarily match well with format specifications. Nor has ISBD been rigorously applied. Beyond the technical and technological issues analyzed, automation presents other challenges of an administrative, professional, and human nature.
7
In their paper Riitta Lehtinen and Genevieve Clavel-Merrin discuss in depth the problems associated with multilingual and multi-character access/display in national bibliographic databases. The paper's primary purpose is to provide examples that illustrate some of the questions related to multilingual subject access and multi-character access to bibliographic data that were raised by Dr. Vinod Chachra, VTLS Inc., in his paper given at the Second IFLA Satellite Meeting on Automated systems for Access to Multilingual and Multi-Script Library Materials that was held in Madrid in August 1993. While the authors use experiences encountered in Switzerland and Finland, the questions they raise and the solutions they suggest could be used in a wider context of international networking and access. Ms. Clavel-Merrin first discusses multilingual access in Switzerland, where there are four official languages (German, French, Italian, and Romantsch) and English-language materials are commonplace. She then concentrates on a strategy developed by the Swiss National Library (SNL) and VTLS Inc. for multilingual subject access. The proposed system requires that a user either accepts the systemsupplied default interface language or initially sets it to a different default language. As a result, the language of subject searching would be the selected default language and the screens would only display subject headings in that language. It is important to note that bibliographic records containing headings in other languages that are linked to the default language within the authority structure would also be displayed. Ms. Clavel-Merrin then describes a number of technical issues related to this proposal, including the authority record structure and the need for displays of MARC authority records that contain all language types. She concludes that while this represents a technical solution, considerable work is necessary for implementation, particularly in the creation of a multilingual subject heading list. Ms. Lehtinen next discusses access issues involving multilingual and multicharacter set databases in Finland. Finland has two official languages, Finnish and Swedish, and English is commonly used for international cooperation. German, Estonian, and Somali language materials are also represented in Finnish bibliographic databases. She describes the LINNEA Network, which includes all Finnish university libraries, the Library of the Finnish Parliament, and the National Repository Library. As such, they all contribute to the union catalogue database, LINDA, which contains shared bibliographic and authority records. Primary cataloguing is done in LINDA, and bibliographic and authority records are copied
8
into local databases. Ms. Lehtinen next discusses the possibility of multilingual subject access as described earlier by Clavel-Merrin as well as another option involving the creation of one authority records per concept per language and linking them together via a separate link record. The Finnish section concluded with a discussion on access issues and possible solutions found for multi-character set data that have arisen and resolved in order to provide adequate access to large collections of literature held by LINNEA libraries. Ms. Joan Aliprand's paper, "The Unicode Standard: An Overview With Emphasis on Bidirectionality," strongly endorses the need for an international standard that would enable worldwide distribution of applications thereby affording data portability between system platforms. The Unicode Standard, while not perfect at this stage, may provide this data portability. Ms. Aliprand defines the critical issues illustrating why our current multilingual and multi-script character sets cannot meet the demand of universal sharing of data and why they require major encoding conversion programs. She argues that there should be one universal version that would cover all major scripts as well as be adaptable for a particular cultural need. In her paper, Ms. Aliprand strongly endorses the Unicode Standard as being the most viable international standard because of its room for over 65,000 characters, accommodation for all major international scripts, and governing principles. She then discusses the importance of these governing principles, which include the distinction between character and glyph; the unification of characters across languages; and, the dynamic composition of accented forms from component characters. The importance of the Unicode Standard for cataloguing is reviewed, particularly relating to the cataloguer's need for fonts that permit the transcription of information "from the item itself as faithfully as possible, in the language and script (wherever practical)." Ms. Aliprand reports that major computer companies are now building serious multilingual and multi-script systems as integral parts of their software. She concludes her paper with some pragmatic recommendations for moving ahead in constructing the Unicode Standard. Recommendations from the Workshop Papers While the authors dealt with a variety of different national and international issues of concern, several recurring themes emerged from their papers. The themes involve the increasing need for standards and cooperative sharing of bibliographic and authority data as the only viable methods to solve the seemingly 9
impossible problems associated with multilingual and multi-script electronic bibliographic environments. Within several papers, the authors suggest ways of solving some of the issues related to display and access on national bases that may have international ramifications. Other authors voice the undeniable need for expanded standards, such as UNICODE, that offer the potential to standardize a font that could cover all scripts. Above all, the role of the national bibliographic agency is seen as even more critical than ever in both national and international arenas. While the authors of the papers and commentaries suggest many more specific recommendations related to the individual topics, they do agree on several over-arching themes. In many cases the authors look to IFLA, national bibliographic agencies, developers of standards such as ISO and MARC formats, and the commercial electronic industry to respond to these recommendations. Also, individual librarians and other colleagues in the bibliographic environment have essential roles to play in and outside of official organizations as we make progress on national and international levels. There is broad responsibility for pursuing the following recommendations:
Bibliographic and Authority Files •
Include as essential components of our national and international bibliographic databases the abilities to include and display records with original language scripts. These databases should provide links between different forms of records and should include and display original language scripts and links between different forms of language and scripts of name and subject headings.
•
Jointly create and maintain multilingual and multi-script authority files to be used nationally and internationally. Within these authority files, recognition should be given to different authorized forms of names
Role of Bibliographic National Agencies/I FLA/ISO •
Determine on national and international levels the roles that centralized agencies, such as national bibliographic agencies, should play in the development of shared bibliographic and authority data files (including the verification of uniform headings).
10
•
IFLA and ISO have crucial and continuing roles in the development and promotion of international standards.
Importance of Standards •
Promote national and international creation/use of standards and problem solving related to issues of access, display and maintenance of bibliographic and authority data files.
•
Create a universal character set that would cover all major scripts and promote its general international use.
•
Develop an agreed-upon mapping between current character sets and new universal character sets.
•
Develop a default sorting order for the full character repertoire of ISO 10646.
•
Standardize "filing environments.
rules"
in
multilingual
and
multi-script
•
Create standards to facilitate resource discovery on the Internet and the use of mark-up languages such as SGML and HTML.
•
Reconsider the MARC format and current cataloguing codes and their applicability to UNICODE and multi-character set conventions and requirements.
Education and Research •
Promote within individual countries and regions cooperative bibliographic efforts and the importance of bibliographic standards while recognizing and balancing legitimate needs for local practices.
•
Increase awareness of how technology can be used for resource sharing and cooperative bibliographic control.
•
Support additional research on sophisticated search engines that can
11
cope with multiple scripts and languages and improve recall for natural language queries. In particular continue solving problems associated with the innate differences among major international scripts. (The papers included in this volume provide pertinent examples for Arabic and Cyrillic scripts.) •
Increase emphasis on the need to search and display original scripts.
Many of these issues are ongoing and require much more attention as priority issues for IFLA. As a result, the Section on Cataloguing is continuing its emphases on the multilingual and multi-script issues. The Section's 1998-2001 Medium Term Programme contains an important goal related to the previous recommendations from the satellite meeting: Develop approaches, standards, rules, lists for information that provide access to bibliographic data in all languages and in all scripts. Furthermore, the Section's current action plan includes an action item to "hold a third satellite meeting to review further developments in relation to automated access to multilingual/multi-script library materials in cooperation with appropriate sections and standards organizations (e.g., ISO, Unicode Consortium), and system vendors as well as language specialists."
Acknowledgements The editors take this opportunity to express their gratitude to Anthony Franks, senior cataloguer, and Robert August, automation operations coordinator, Library of Congress, for their exceptional assistance in the preparation of the final text of these proceedings.
Footnotes 1. Automated Systems for Access to Multilingual and Multiscript Library Materials—Problems and Solutions: Papers From the Pre-conference Held at Nihon Daihaku Kaikan Tokyo, Japan, August 21-22, 1986, edited for the Section on Library Services to Multicultural Populations and the Section on Information
12
Technology by Christine Bossmeyer and Stephen W. Massil. (West Germany: Saur, 1987). 2. Ibid. pp. 185-191. 3. Eeva Murtomaa, "Second IFLA Satellite Meeting on Automated Systems for Access to Multilingual and Multiscript Library Materials, Madrid, 18 & 19 August 1993: Report," International Cataloguing and Bibliographic Control 23, no.l (January/March 1994): 9-11. 4. Eeva Murtomaa and Eugenie Grieg, "Problems and Prospects of Linking Various Single-Language and/or Multi-Language Name Authority Files," International Cataloguing and Bibliographic Control 23, no.3 (July/Sept. 1994): 55-58. 5. John Eilts, "Non-Roman Script Materials in North American Libraries: Automation and International Exchange," International Cataloguing and Bibliographic Control 25, no.3 (July/Sept 1996): 51-53.
13
TURKISH EXPERIENCES WITH MULTI-SCRIPT, MULTICHARACTER, AND MULTILINGUAL CATALOGUING ISSUES Ahmet Çelenkoglu National Library of Turkey Brief Account of Cataloguing Experiences in Turkey The first product of standardized cataloguing rules and practices in Turkey was the work entitled Basma Eserler Alfabetik Katalog Kaideleri-Enstruksiyon (in translation known as Alphabetical Cataloguing Rules for Printed MaterialsInstruction (ACRPM-I)). This code was prepared under the leadership of the National Library of Turkey (NLT) in 1957 and incorporates material developed by organizations such as the American Library Association, the (British) Library Association and the British Museum. Because it was so detailed, the ACRPM-I was used only by the NLT as well as by a few larger university and research libraries. Another set of rules, Kitap Kataloglama Kurallari = Cataloguing Rules for Books (CRB), prepared to meet the needs of public libraries and other small libraries, did not achieve standardization. Publication of the Turkish translation of the First Edition of the AngloAmerican Cataloguing Rules: North American Text (AACRI) in 1980 was the most significant indicator of the fact that the effort to achieve national standardization would also be directed at achieving international standardization. Many libraries had started to use AACRI since it was published. During the 1980's, when computers were introduced for use in the libraries, Turkey had to face new dimensions of the prevailing problems. Because cataloguing standards were still developing, centralized cataloguing could not be implemented, and national bibliographic control could not be fully realized. The most important problem was the absence of standards and practices to achieve machine-readable cataloguing (MARC). Turkey had not established working groups for national standardization and was not participating in the international standardization work. As a result, its library community was unprepared in this area. In the absence of a central institution responsible for coordination of activities, libraries started to implement various MARC formats such as USMARC, UNIMARC, and OCLC/MARC.
14
The TURKMARC project was launched by the NLT in 1981 with the aim of solving this problem by providing a common format for Turkish libraries to use. But, it did not receive sufficient support and participation on the national level and has lacked interest on the part of those teaching library sciences. Therefore, TURKMARC has not yet yielded any major results. Failure to attract the interest, participation, and support of the Turkish library community in general can be observed in every project introduced in the field of library service. For example, an endeavor to translate and adapt the Dewey Decimal Classification, 20th edition (DDC20) in 1991 was one the NLT had to start and complete on its own. By publishing and distributing this translation in 1993, the NLT met a significant need, especially for those who do not know English and have had to use a translation of the 15th edition dating from 1962. Another project being pursued by the NLT, preparation of a Turkish translation and adaptation of the Library of Congress Subject Headings (LCSH), is also proceeding under these adverse circumstances. In Turkey, many libraries (including the NLT) do not use subject headings, while others follow individual or local practices that are not compatible with international standards. Conscious of its leadership function as the national library, staff at the NLT recognize the importance of applying national and international standards for furthering the universal sharing of information. Therefore, in 1994 the NLT launched another significant initiative - translation and adaptation of the 1988 revision of AACR2 (AACR2R) - despite the unfavorable national and institutional conditions. The existing Turkish translations of AACR2R were not complete and had been prepared by individuals independent of the NTL. With the completion of NLT's translation and adaptation, it will no longer be necessary to rely on incomplete and outdated translations; in addition, differences in interpretation and implementation resulting from cataloguers' insufficient knowledge of English will diminish. Experience of the National Library of Turkey (NLT) Activities to establish the NLT started in 1946, and the institution began to provide services in 1948. The Library became a legal entity in 1950. With a collection of nearly 1.5 million items and a staff numbering 200, the NLT stands as the largest library in Turkey. Although the NLT has tried to fulfill its leadership
15
role by sponsoring a variety of projects, the Library unfortunately cannot provide full national bibliographic control, coordination, and standardization due to legal obstacles and problems. Assessing the situation in Turkey in the light of the NLT's experience as the leading institution in the field of Turkish library sciences would not be misleading. 1. Experiences before Automation Until 1957, NLT staff processed material using traditional cataloguing methods. During the 29 years between 1957 and 1985, NLT followed the provisions of ACRPM-I which had been prepared under its leadership. Aware of its role and importance in promoting national and international standardization, the NLT began to implement AACR1 in 1985. Because of the many similarities between ACRPM-I and AACR1 and of the extraordinary efforts by the NLT cataloguers, the transitional period did not entail great difficulty; indeed, compatibility and standardization were achieved in a short period of time. Thereafter, books, periodicals, and non-book materials such as posters and post cards were catalogued according to AACR1. The biggest challenge encountered at this stage resulted from authority control issues. Since the authority and bibliographic records in the form of cards prepared according to AACRl were filed in boxes arranged according to ACRPM-1, problems pertaining to order and use arose and they still exist. The situation was complicated further, because, even before the NLT had fully implemented AACRl, the 1988 revision of AACR2 had been published. Despite problems arising from English, cataloguers at the NLT strove to comprehend and implement the necessary changes introduced by this revision. Unfortunately, however, this has led in some cases to some differences in interpretation and implementation. 2. Experiences after Automation The NLT started inputting data pertaining to its collections in 1990, after a long period of planning and preparatory work. The Library gave priority to using its computer facility for preparation and publication of its national bibliographies. For this application, special software was developed. In making the transition from a manual to an automated system, the NLT had to decide on a format and on standards for encoding bibliographic records. This 16
need caused NLT's cataloguing specialists to undertake intensive and quick-term research-and-development activities. These investigations concentrated on UNIMARC and USMARC and resulted in a preference for OCLC/MARC with which it has online links and from which it can easily procure formats and CD collections. But, during the implementation which was carried out later, USMARC took precedence over OCLC/MARC (though to a large extent it had been developed on the basis of USMARC). The NLT wanted to accomplish its automation in a single process that would include the pre-existing database, but it did not have sufficiently trained staff to achieve this goal by itself. Consequently, the Library contracted the retrospective conversion with a private firm. Since NLT also had decided that the automated database should reflect full compatibility and standardization in terms of cataloguing rules, catalogue cards and records in the bibliographies that had been prepared according to ACRPM-I were not used. Despite apparent difficulties, efforts were made to catalogue all material according to the 1988 revision of AACR2. An additional problem developed when approximately 300,000 records that were input with the library software Ultimated Library System (ULS) could not be used because of software incompatibility. This necessitated a new search for software. Digital, the firm that had provided NLT with its new hardware VAX 6610, recommended that the Library use the software known as Automated Library Expandable Programme (ALEPH) which it could provide. As a result, cataloguing and data inputting activities were restarted; and, despite a few defects, records for more than 500,000 books, periodicals, and non-book materials have been completed. ALEPH ALEPH is software that was designed and developed by the Hebrew University in Israel for use in libraries and other data centers. It is a fully integrated and comprehensive program for library sciences applications that is made universal by parameter tables which can be arranged by the user for desired implementations. ALEPH can, therefore, be adapted to the specific needs of various institutions such as libraries, museums, archive centers and is potentially applicable to records for all kinds of material. ALEPH will support scripts written not only from left to right but also those written from right to left (as in Hebrew and Arabic), although NLT is not yet able to take advantage of this feature. Thus, the software enables input and display of characters in Cyrillic and Greek scripts
17
as well as those in the Latin alphabet. ALEPH is a software package that can function on VAX/VMS series computer, and it can support inter-library networks on one or more computers. ALEPH's OPAC functionality can serve the library public and the ISO standard interface, CCL (Common Command Language), can be used. It supports MARC and non-MARC (special) records. ALEPH enables each library to make arrangements on separate systems, and to adapt their registration forms, definitions of access and index fields, screen images, messages and various sets of character and command codes to meet their needs. Turks and Turkish Following this brief evaluation of multi-script, multi-character, and multilingual issues from the cataloguing perspective, it would be appropriate to evaluate these in terms of the special circumstances caused by the Turkish language. Such an evaluation has to take into account both the letters specific to Turkish and the alphabets used and being used by Turks. The written sources of the Turkish culture are found in many different alphabets with different letters. In manual cataloguing the solution to the problem of multiple alphabets appears to be achieved through transliteration, but the issue assumes a greater complexity when machine-readable processing is under consideration. According to different views, the name Turk is assumed to derive from the words "türe-mek" (derivation), "tore" (custom, rule) or "erk" (power) and means "derived", "one who has customs", or "powerful". It is the name given to a people who share a common language and history with communities that were united within a sixth-century federation called Göktürkler. Presently, the total number of Turks and of people of Turkish origin living in a large geographic area extending from China to Europe and from Siberia to Afghanistan numbers more than 200 million people. The impact of past interactions of Turks with numerous countries, civilizations, and cultures shows itself especially in the fields of language and religion. The alphabets of Turkish people who belong to Moslem, Christian, Jewish, Buddhist, or Shaman religions display great variations. There are more than 20 Turkish alphabets based on Latin, Cyrillic and Arabic scripts. These are the alphabets of: Azerbaijan Turkish (North and South); Turkoman Turkish
18
(Turkoman Republic, Iran and Afghanistan); Uzbek Turkish (Uzbek Republic and Afghanistan); Uygur Turkish (China); Crimean Turkish; Kazan Turkish (Tatary); Bashkurt Turkish; Nogay Turkish; Karachay-Malkar Turkish; Kumuk Turkish; Karakalpak Turkish; Kazak Turkish (Kazakhstan and China); Kirghiz Turkish; Altai Turkish; Tuva Turkish; Abakan/Hakas Turkish; Chuvash Turkish; Yakut Turkish; and, Gagauz Turkish. The alphabet used by the 60 million Turks living in Anatolia today is based on the Latin alphabet. It was approved in 1928 by legislation that resulted from reforms initiated by Mustafa Kemal Atatürk, who founded the contemporary Republic of Turkey in 1923. This Turkish alphabet is based on the principles of supplying a different letter for every sound and of providing a sound for every letter. The alphabet is completely phonetic and contains 29 letters, 21 of which are consonants and eight of which are vowels. The letters Q, W and X found in some words imported from foreign languages are listed in their places as they appear in the Latin alphabet: A B C Ç D E F G Ô H I Ï J K L M N O Ô P Q R S Ç T U Û V W X Y Z
a b c ç d e f g § h i i j k l m n o ô p q r s ç t u û v w x y z
International Code Standards and Turkish Characters Considering the extensive work to achieve international standardization in the area of language, it is not possible to state that the Turkish language, in spite of an extended geography and population, received adequate attention. Starting in 1984, the International Organization for Standardization (ISO) divided countries into groups. For each group, ISO developed common code tables and standards. During the years when ISO 8859 series - covering information processing and featuring 8-bit single-byte coded graphic character sets - were prepared, Turkish was placed in ISO 8859-3 (Latin 3) which covers Afrikaans, Catalan, Dutch, English, Esperanto, German, Italian, Maltese, Spanish, and Turkish. This is in contrast to being included in ISO 8859-1 (Latin 1) which covers the 14 languages and 44 countries that constitute 95% of the computer market in the Western world. As a result, objections were expressed based on the communication difficulties which would arise from the fact that the Turkish alphabet would not be handled together with the characters of the languages most commonly used. Subsequently, a code table was also registered as ISO/IEC (the International Electrotechnical Commission) 8859-9 (Latin 5); this table includes Danish, Dutch, 19
English, Finish, French, German, Irish, Italian, Norwegian, Portuguese, Spanish, Swedish, and Turkish. Almost all the software and hardware presently produced support Latin 1. Although Latin 1 contains most letters found in Turkish ~ including: Ç ç, Ö ö, Ü ü ~ it does not contain the Ç ç, G g, î ι. Later, ISO and IEC developed the ISO/DEC 10646-1 (Information technology ~ Universal Multiple-Octet Coded Character Set (USC)), a standard which is considered as the code for the 21st century. This standard aims at covering all the languages of the world within a single code table. But, since Latin 1 was chosen as the basic group instead of Latin 5, the Turkish characters indicated above as not contained in Latin 1 are dispersed in various places within the code table. It would have been more favorable to Turkish and Turkey if Latin 5 had been made the basic group of ISO/IEC 10646-1. In any case, the 8-bit structured characters presently used are entered into this new standard with a length of 16 bits. The characters will have to undergo a change at least by taking ΧΌ0 in front of them. This in turn will lead to a situation where much hardware presently in place will cease to be functional. It will be necessary to create new hardware series, develop separate screen and printer character generators, rearrange the length of the acceptable image of the processing systems and collectors as 16 bits, and reconsider the character type parameters and message fields in all ready orders on the computer, especially the listing software. Further, individual and expensive solutions will be needed for numerous implementations of data processing such as Teletext, ATM (Automatic Teller Machine), EFT (Electronic Fund Transfer) as well as inter-library intra-circular data communication.
20
TURKISH EXPERIENCES WITH MULTI-SCRIPT, MULTICHARACTER, AND MULTILINGUAL CATALOGUING ISSUES Ahmet Çelenkoglu National Library of Turkey COMMENTARY Colleen Hansen Linda Hall Library In 1995, I was privileged to attend the 61st IFLA Conference and the Workshop on Multi-Script, Multi-Character, and Multilingual Cataloguing Issues for the Online Environment. During the workshop I was the facilitator for discussion of Ahmet Çelenkoglu's paper "Turkish Experiences with Multi-Script, Multi-Character, and Multilingual Cataloguing Issues". As a continuation of that role, I am writing this response to his paper. I will summarize the main points of Mr. Çelenkoglu's paper, raise additional issues, and offer some avenues to explore in resolving the problems presented in this paper. Mr. Çelenkoglu writes of the leadership position that the National Library of Turkey has assumed in attempting to establish national standardization of cataloguing policies. The National Library of Turkey was organized in 1946, and throughout its history has encouraged the use of standard cataloguing tools. In 1981, to promote standardization, the Library published the Turkish translation of the Anglo-American Cataloguing Rules, 1st Edition (AACR1) and in 1994 began the translation and adaptation of the 1988 revision of AACR2. The library has also published a Turkish translation of the Dewey Decimal Classification, 20th Edition, and is working on a Turkish translation of the Library of Congress Subject Headings. Other Turkish libraries were asked to assist with the preparation of these tools, but the help was not forthcoming; as a result, the publication of these cataloguing tools was accomplished without the assistance of other Turkish libraries. The lack of interest in standardization of cataloguing tools is not only found in other Turkish libraries but also among the academic staff in the field of library science. Mr. Çelenkoglu writes about the automation of the National Library of Turkey and the problems presented by the conversions from Alphabetical 21
Cataloguing Rules for Printed Materials-Instruction (ACRPM-I) to AACR to AACR2. When these conversions were implemented, there were problems with authority lists and card catalogues. In the first attempt at retrospective conversion, the library was unable to use its printed cataloguing records and hired a vendor to create new records. After 300,000 machine-readable records were created, it was discovered that the information could not be processed and a new system, ALEPH, was selected. Retrospective conversion is now progressing with 500,000 records converted. Mr. Çelenkoglu concludes his paper with a examination of the Turkish language and the difficulties presented by not having Turkish included in ISO 8859-1 (Latin 1). Since ISO 8859-1 has been selected as the subset for ISO/IEC 10646-1, there could be grave problems in the future for not only Turkish libraries but also other aspects of modern Turkish life, such as automatic teller-machines and electronic-fund transfers. Throughout this paper a dominant theme appears: the lack of interest on the part of Turkish libraries in adopting standardized cataloguing practices. The National Library of Turkey is attempting to lead the way to standard cataloguing practices and to standard formats. We must ask why they have not been successful. Is it because many of the other Turkish libraries are older than the National Library of Turkey and have established practices that they wish to continue? Is it because the need for standardization is not recognized? Is it an issue that is avoided due to lack of personnel and time to devote to the standardization projects? Or is it assumed, by Turkish libraries, that it is the function of the National Library to provide the cataloguing tools to the country? Whatever the reason for the lack of interest, what is the solution? How can Turkish librarians be convinced that standard practices are advantageous and how can they be formed into a common interest group to solve their standardization problems. Educating librarians on the advantages of standardization is one possible solution. Standard cataloguing practices must be emphasized in the schools of library science. Library schools should be teaching cataloguing practices based on AACR2, introduce students to MARC formats, and demonstrate the effectiveness of authority controlled subject headings. Graduating librarians should implement this knowledge in their new positions. Workshops, summer institutes and continuing education programs should be developed to meet the educational needs of current library staff. The National Library of Turkey could sponsor seminars in 22
standard cataloguing practices and instruction in MARC formats. Another approach would be to demonstrate some of the technology available to libraries using standard procedures. The Turkish libraries that use OCLC for interlibrary loan could offer workshops and demonstrate their procedures. Observing such a large database, with its powerful retrieving capabilities, should convince many librarians to accept standard practices. The advantages of a having a national bibliographic database should be enumerated. A national database would not only empower libraries to retrieve copy cataloguing for many records, it could be used by the reference departments to locate material and verify citations for the faculty and students. Once libraries have accepted standardized practices, it should be determined if access to an international bibliographic database would aid automation in Turkey. Librarians could form a committee to study the profitability of accessing an existing international database. Is there a sufficient number of records in the database to make joining the bibliographic utility financially beneficial? Should the library decide to join a utility, additional software such as a cataloguing microenhancer could also be considered. The cataloguing microenhancer reduces connect time by entering search keys offline. A library can then connect to a utility, send the search keys, and rapidly retrieve records. The Turkish libraries which are automated or in the process of automating should form working groups to study the problems that arise as a result of moving from a paper to an electronic environment. Guidelines for proceeding with automation and solutions to problems should be shared and published. An informational handbook listing the various software systems used could be published and made available to other libraries. At the present time, libraries may be able to provide services using their current procedures, but the library world is changing; there is great interest in being connected to the Internet. Scholars, students, and patrons, using the Internet will be exposed to the numerous online catalogues available and will soon demand the same services from their own libraries. To join the international library community and to share their material, Turkish libraries will need to meet international standards. The solutions are not easy but they are within the reach of the Turkish library
23
community. Turkish librarians should continue the cooperation that was necessary to host the 61st IFLA Conference. They should continue to meet and discuss library issues and involve library school students in their meetings and workshops. Turkish libraries contain a wealth of material from manuscripts to contemporary publications. However, without unified national standards, much of this material remains unavailable to other Turkish libraries and to the international world of scholars.
24
NINE PROBLEMS CONCERNING ARABIC Charlotte Wien Odense University Introduction Currently we have a number of computer systems that work with the Arabic script. We have various character sets for Arabic -- for example, the Arab Organization for Standardization and Metrology's ASMO 449 (covering Arabic 7bit coded character set for information interchange) and ASMO 708 (dealing with right-hand part of Latin/Arabic alphabet), Unicode, and the International Organization for Standardization's ISO 10646 (Universal Character Set). There is Windows™ in Arabic and Arabic language extensions for the Macintosh, and even advanced database software and word-processing including spell-checkers on the microcomputer level. As a result of these developments, we can work with the written Arabic language in databases, word-processing, and other automated contexts as though it were written with Latin letters. Of course, we also have integrated library systems which are able to write, save, print, retrieve and display records in the Arabic script. This means that we can catalogue and index Arabic documents in the library's databases as if they were just like the other documents represented there ~ at least, that's what we believe! To those who might not know ASMO, Unicode and ISO 10646,1 owe an explanation. ASMO is basically the same as ASCII, which means a unified character set especially for Arabic script. Unicode and ISO 10646 are character sets that contain all the different scripts of the world. Unified character sets are the basis of all data exchange. And, in principle, if everyone used the same character set, all of us would be able to exchange records. When I stated that we have integrated library systems capable of handling Arabic script, I was referring mainly to five bibliographic database systems. The Canadian MINISIS (Beauvais, 1984), the Dutch DOBIS/LIBIS (Khurshid, 1987), the Israeli ALEPH (Panzer, 1992), the American system RLIN (Aliprand, 1992 (a)); and, reportedly, TINLIB is also capable of handling Arabic script (although this has not yet been proved). So really - what is the problem? Unfortunately 25
there are several. First, not all libraries are using one of these systems. Secondly, we cannot just add records in the Arabic script to existing online facilities and expect them to behave like records that are written in the Latin, Greek, or Cyrillic alphabets. My task in this paper is to outline what makes the Arabic language work differently in bibliographic databases.1 In my introduction, I have implicitly raised the question of whether Arabic bibliographic data can be treated just like records in any other language. The problems involved in using vernacular Arabic can be divided into two categories: 1. Technical problems, and 2. "Linguistic" problems. It is a feature of the first category that they are directly linked with technical problems within the computer environment (discussed below as problems 1-4). As for the second group of problems, they are indirectly linked to the computer environment. This second group represent problems that arise when the technical problems have been solved and when records in the vernacular Arabic script are added to bibliographic databases (problems 5-9). In order to clarify the terminology and nature of the problems, I find it necessary to first give a short outline of the differences between the written Arabic language and Indo-European languages (e.g. English). Secondly, I give an outline of previous research on this topic. Next, I will analyze which of the problems concerning the Arabic script in bibliographic databases still need to be solved. Finally, I will discuss whether research and development have brought the retrievability of records in the Arabic script up to the same level as retrievability of records in the Latin script. Written Arabic Language Arabic is considered one of the "large" languages. It is the official language of 18 countries, which have a total population of about 300 million people. Although Arabic is spoken in so many countries, there are huge differences between the written and the spoken language throughout the Arab world. In principle, the written language is the same for all countries.2 This paper will discuss problems related to the written language.3 The first noticeable difference between written English and written Arabic is the alphabet, which consists of 28 letters (29 if the hamza is considered a letter). Of these letters only one is considered a pure vowel (alif) with the value of a long 5. Two additional letters should be mentioned in this context: the yä ' and the v/ctw, the so-called semivowels. Depending on the word involved, these letters can be 26
used as either vowels or consonants. The wäw can have the value of either a ' V or a long "«" and, the yâ' can have the value of either a " / ' or a long "Γ. Knowing which of the values to assign depends on the basic meaning of the word or the grammar. This problem becomes evident only when one tries to read aloud or to transliterate the Arabic alphabet. Furthermore, in the written language there is a need for more vowel values than the three long ones: these values are the short "a", "/'", and "w". These can be expressed in writing by the use of diacritic marks.4 Therefore, the written Arabic language consists of 28/295 real letters and five diacritic marks. Although some of the grammatical endings are not expressed by real letters, they can be expressed by diacritic marks.6 This adds three more symbols to the list of those needed to express written Arabic. The hamzah has a somewhat different position. The hamzah can be considered both a diacritic mark and a consonant: it can best be explained as a "stop-sound" and must be attached to one of the real letters. One can consider the letters which can bear a hamzah as having a different value than the same letter without a hamzah. The hamzah can also be considered as a diacritic mark. If one considers the various combinations of hamzah and its different chairs as having separate values, this adds five more characters to the Arabic alphabet. Finally, it is possible to express the long "a" in two more ways. An alif can be lengthened by adding a maddah, which can also be considered to be a diacritic mark. However, the maddah cannot be used in combination with any other letters. Secondly, the alif can be expressed (graphically) as the alif maqsürah. In terms of computers, 43 character values are needed in order to express the different values of the Arabic alphabet. Compared to the basic Latin alphabet, 26 values are needed if one ignores upper- and lowercase, but 52 are required if one regards upper- and lowercase. The second difference to note is the writing direction. Arabic is written from right to left which is the opposite of the Latin alphabet. In terms of databases, this creates a series of problems: as pointed out by Eilts7 records are stored in "logical order" ~ that is, the order in which they are typed. Every character is thereby saved in accordance to "when it was typed" in the entire character string. In order to display the records it becomes the task of the display device to generate the correct writing direction - right to left for some alphabets, and left to right for others. As stated by Eilts, "This is one of the most head-aching problems involved 27
in multi-scriptual computing."8 The third difference is the graphical expression of the letters. Arabic script is highly calligraphic (Kreipke, 1993) and most of the Arabic letters have three or four different graphical expressions.9 This means that they change their appearance according to their position, i.e., whether they are placed initially, medially, or finally in the word or are isolated. However, they all maintain their phoneme value. This means that the Arabic alphabet consists of more than 100 graphic expressions. To this number one must add the different combinations of hamzah and its chair and the fact that the combination of the letters lam-alif have a specific graphical expression. The fourth difference discussed in this section concerns the numerals. There are two different graphical expressions for numerals: the "Arabic" numbers (1,2, 3, 4, etc.) and the "Hindi" numerals6 Av ^ « ' T f ^Both graphic expressions are used in Arab countries. The Arabic numerals, however, are mostly written and should be read from left to right, which is opposite of the directions for readings of words. "Mostly" refers to the fact that the same writing direction for numerals as for text can be observed.10 So A.D. 1995 can be expressed like "1995" or"ovi\ or m" in written Arabic.
The fifth difference is the lack of vowels in most written texts. In written Arabic, only the long vowels are stated in writing.11 The short vowels and other diacritic marks12 are mostly ignored. To a non-native reader, it seems difficult to understand how Arabic can be read without the presence of vowels. The technique consists of a mixture of guessing from the context and grammar and of recognizing the graphic expressions of the words. Only when it is essential to understand the very specific meaning of the text (as in the Qur'äri) are the vowels (or diacritics) stated. The sixth difference is the basic structure of the Arabic language: the so-called roots and patterns. In a large percentage of cases, Arabic words consist of a three-letter root (Al-Sadoun, 1990). Each root has a basic meaning. In theory, verbs, nouns, adjectives, and adverbs can be derived from each root.13 To some extent, all forms of words derived from the same root will be semantically related. To give an example, the basic meaning of the root "k-t-b" has to do with "writing".14 Within the framework of the basic meaning of a word one can vary its meaning by modifying the pattern to change the vowels, by deleting the vowels, 28
by doubling the root letters or by adding a prefix, a suffix, or an infix. This is done in accordance with very strict grammatical rules. The verb "kataba" means "he wrote", while the noun "kitäb" means "book" and the noun "maktabat" means library. To further exemplify this point, most of the patterns like the roots have a specific meaning: the prefix "ma-" (in "maktabat") has the basic meaning of "the place where..." In the example of "maktabat" the combination of the basic meaning of the root ("k-t-b-": something that has to do with writing) and the prefix ("ma-": something that has to do with a place) the meaning "library" becomes evident. As another example, by doubling15 the root letters, the meaning is intensified: "kattaba" means "to keep up correspondence" or "to exchange letters". The system is fairly simple to use ~ that is, if you know it! A native speaker is able to guess the meaning of words he or she does not know in a text from an often subconscious knowledge of the basic meaning of the roots, the patterns, the syntax, and the context. This method works fine with humans, but is problematic when it comes to computers. As the short vowels are not always written in Arabic, some of the written words in Arabic script have exactly the same appearance. To exemplify this case, the plural form of the word for "book" and the third person, masculine singular perfect tense of the verb "to write" are both spelled "ktb". However, when vowels are added, the words become "kataba" or "kutub", and the only way of knowing which is which is by guessing from the context and grammar of the text. If the words are given in isolation, there is no way of knowing the exact meaning.16 Because automated bibliographic databases store character strings and not meanings, this condition will necessarily affect retrievability. This possibility might lead to the conclusion that the structures of these roots and patterns could be useful tools in terms of retrieving bibliographic records.17 But, in every language there are exceptions to rules, and this leads to the next problem. The seventh problem concerns the problem of the so-called weak radicals. The k-t-b example is one of a strong root: The three-letter combination of ktb is stable and will be recognized in exactly this form in all words and grammatical variations derived from it. Such, however, is not the case for all Arabic words. If any of the letters comprising the root is a hamzah, a semi-vowel, or an alif, the root letters themselves can change during conjugation. A percentage of these weak radicals are considered irregular. Grammatical rules for changing the root values within words exist, but they are many and quite complicated.
29
The eighth problem relates to the case endings and most of the pronouns. Usually case endings are not written in a modern text18 — due to the fact that they are indicated as diacritic marks. They do, however, exist in very formal written Arabic. Only when a case-ending affects the spelling of the word will it always be specified.19 Also pronouns are written as suffixes of the words to which they refer, except for the demonstrative pronoun. If one wants to write the phrase "my book", he or she must attach the personal pronoun Tto the word for "book", i. e., kitâbï, while "her book" will be written as "kitâbhâ" and so on. The ninth problem concerns the orthography of Arabic. This problem arises from the fact that in Arabic some particles as well as the definite article are written without separators. If one wants to express "he came and he left in his car" seven separators (or space-bars) are used in English, while only two will be used in Arabic: "yasul wayatrik wabisayyärtihi".20 In transliterated versions,21 as would appear in many bibliographic databases, separators would have to be added: e.g., "yasul wa-yatrik wa-bi-sayyaratihi". This would also apply sometimes to the particle expressing vocative "ya" as well as to several prepositions, the definite article, and the two different ways of expressing "and" {wa and fa). Review of Previous Research Scientific work to establish a character set for Arabic script was conducted as long ago as 1970 (Aman, 1984). It was not until 1982, however, that the first Arabic character set for bibliographic data processing became available (Musa, 1986). The result was ASMO 449, a 7-bit character set, considered the "ARABIC ASCII", which has formed the basis of all character sets developed thereafter. The next step came with the development of ASMO 708 which provides an 8-bit character set (Ashoor, 1989). For Japanese and other East-Asian languages, neither the 7-bit character set nor the 8-bit character set were large enough to include all characters. There, work was initiated in Japan to develop the 16-bit character set, as reported by Sakai (1986). On a test basis, Arabic too was included in this first 16-bit character set. Meanwhile, ISO and the Unicode Consortium had begun work simultaneously to provide for computer representation of all the scripts of the world (Peruginelli, 1992). Fortunately, the resulting two different standards were merged, and in 1993 the first version of Unicode was released (Ksar, 1993). As for the Arabic script, Unicode - which has become recognized as the international character set standard - contains all the values needed to process data on computers. This standard, in theory, allows for an exchange of bibliographic 30
records from different hardware and software platforms ~ thereby solving the first problem described above concerning the Arabic alphabet. Although Unicode resolves character set problem, the second obstacle concerning the writing direction to a certain extent remains. A report on a project in Egypt to "Arabize" CDS/ISIS stated that this undertaking did not solve the problem of writing direction. Musa later confirmed that most of the so-called Arabized systems still did not enable the correct writing direction and were only able to handle the transcription in the opposite direction during the process of typing (Musa, 1986). This problem still seemed to be present in 1988 (Anees, 1988). As indicated, the problem of writing direction is not one of saving records but of displaying them, and therefore does not affect command systems, but depends on display devices. Aliprand provides a fuller description of the issue and how the Research Libraries Group (RLG) managed to achieve displays of text written in the Arabic script in the appropriate direction in two excellent articles (Aliprand, 1992(a); Aliprand, 1992(b)). As a result, the solution chosen by RLG for the RLIN bibliographic database has solved this problem and thereby enabled displays of vernacular information for materials written in Arabic. The third problem concerning the graphic expression of the Arabic script continues to be a hindrance in some systems today, although to some extent it has been solved through the so-called "Hydriyya method" described by Aman (Aman, 1984). By this method, the basic form of the letters (that is, the graphic expression of the letter in its isolated form) is stored as a code value. An interface generates the correct graphical expression, so that the graphical expression is not actually stored, but is formed by the computer when used. This technique has resolved the problem in the case of more than a hundred different keys for Arabic characters and made it possible to work within the framework of a 7-bit code page to produce the correct graphical expression of the Arabic script. Nevertheless, because the system named TINLBB, which is produced by IME and has been sold to a several libraries for its ability to handle Arabic script in accordance with Unicode, is not able to generate the correct graphical expression of the Arabic script, this problem is not yet fully addressed.22 Very little work has been realized in relation to the fourth and fifth problems, regarding the lack of explicit statement of vowels in modern written Arabic and how to handle this situation in an automated bibliographic environment. In the most recent analysis on this topic, Aliprand has explained how the diacritic marks
31
of the Arabic script can be considered non-spacing characters (Aliprand, 1992(b)), while Al-Sadoun presented the possibility of using the structure of root and pattern for compression of records in Arabic script (Al-Sadoun, 1990). As for the ninth problem, concerning the orthography of the Arabic language, Khurshid acknowledges the problem regarding the definite article which is written as part of the word to which it attaches, although he does not mention problems related to the particles (Khurshid, 1992). In order to solve this problem, work was done in Saudi Arabia to list all Arabic words starting with the letter combination alif-lam (the same combination as the definite article) and to program their database, so that, except for the words listed, the automated system ignores the letter combination alif-lam when written initially in retrieving data. As far as I am aware, no research has been accomplished to solve the problems mentioned above as the eighth (case endings and pronouns) and the fourth (numerals). Analysis In summary, six problems pertaining to records in Arabic script in automated bibliographic databases remain. First and second are the lack of vowels (the fourth problem) which leads to the inability to ascertain exact meaning of many words in Arabic when seen in isolation (the fifth problem).23 Nor has anyone yet solved the problems concerning the case endings and the pronouns written as a part of the word to which they refer (the eighth problem). In addition, neither the problem concerning the orthography of Arabic (the ninth problem) nor that regarding numerals (the fourth problem) have been resolved. Some of these concerns could be addressed relatively easily and only need an organization like IFLA to make a recommendation or decision. Some of them could be solved locally as well. But, the fifth, sixth, and seventh problems will require careful consideration and intensive research. In order to treat these remaining six problems, I shall start with the one which seems to be the easiest one to solve. There are several ways of dealing with the challenges presented by Arabic numeric orthography. The problem is clearly illustrated by the case of an end user who wishes to conduct a search in natural language for literature in Arabic from all the Arab countries about a specific event - for example, "The War of 1973". In the search profile, should the year statement 32
in the user's query be given as "1973", or "mi or iw", or all of these possibilities? Following the practice of Anglo-American Cataloguing Rules (AACR2) would involve all three possible ways of expressing "1973". As the main rule of AACR2 states, information presented in the publication should always form the basis of the bibliographic description. Typing the year statement in any way other than that stated on the chief source of the item would be inconsistent with this basic cataloguing principle. Therefore, the best way to deal with this problem would be to add vernacular scripts to existing bibliographic databases. This step would first of all make the code values of Arabic and Hindi numerals equal for searching purposes. In addition, including them would identify records where the numerals are written from right to left, so that the underlying system would also retrieve records for query in which the year statement is given as "1973" or In relation to the problem regarding Arabic orthography, we are presently working on a solution at Odense University. Because it deals only with the definite article and because the development of word lists is quite labor intensive, we are not applying the approach recommended by Khurshid (Khurshid, 1992). Instead, we have developed a specific structure for MARC fields in bibliographic records for Arabic publications. The basic idea is that during initial cataloguing, a separator is placed between the content bearing words and particles or definite articles in all fields containing natural (Arabic) language data. We developed two scripts. The first deletes the separator and generates the correct form of the word and its particles and/or definite article, while the second deletes all characters prior to the separator. As a result, we generate two versions of all natural language fields for each Arabic record created. The first field includes the particles, and the second excludes them. For searching purposes, we let the end user choose whether to conduct the queries with or without particles or the definite article; this approach avoids initial truncation for every search.24 Whether this solution to the problem of Arabic orthography proves to be the best one is yet unclear, as the work is still in its initial phase. As for the case endings and the personal pronouns (the eighth problem), we have not developed a similar procedure ~ mainly, because the database at Odense is part of the University's Online Public Access Catalogue (OPAC) and records do not include abstracts. Therefore, natural language statements in our database are mainly found in the title and in the author information. Examination has shown that the basic grammatical and syntactical structure of these statements are fairly
33
simple and that case endings or personal pronouns seldom occur. As a result, we have decided to handle personal pronouns and case endings for words in the Arabic script in the same way that case endings are managed for Indo-European languages like German and Greek. This means that end-users will need to use truncation in order to strip the case endings and/or personal pronouns from their searching profiles. As for the problems related to lack of vowels (the fifth problem), the structure of roots and patterns (the sixth), and weak radicals (the seventh), they raise a host of questions that need to be dealt with through future research. First of all, work needs to be done to analyze the recall and precision in the existing databases which contain records in the Arabic script. This research should clarify whether the present systems produce the same degree of success in recall and precision for natural language queries in Arabic as they do for queries in English or Indo-European languages. If not, research should be conducted to determine the possibilities for improving recall and precision of recall for natural language queries in Arabic. This research should involve a closer look at the root and pattern structure of Arabic and an investigation of whether this structure can be used to improve the searching capabilities of automated systems. From an overall point of view, using the root and pattern structures for queries should improve recall25. But, in fact, would this approach also affect precision, and if so, would it produce greater or reduced precision? Whatever the results show, we must also clarify how to deal with the weak radicals26 and with the lack of vowels27 (i.e., should we add them and thereby compromise the main rule of AACR21). These are some of the questions which urgently need answers. Conclusion Research in the use of Arabic script in a computer environment has advanced significantly during the last two decades. It seems as if the basic problems concerning representation of the letters, the writing direction, and the storing direction have been solved. However, for the specific use of the Arabic script in bibliographic databases for retrieval proposes, considerable research remains to be concluded. This research needs to answer the basic question of how well or poorly existing automated systems handle the retrieval process. An assessment could be determined by measuring recall and precision in comparison to the similar data found in studies about retrievability of records in English-language databases.
34
Works Cited Al-Sadoun, Sabah S., and Al-Fedaghi, Humoud Β. 1990: "Morphological Compression of Arabic Text," Information processing & management 26: 303-316. Aliprand, Joan M. 1992(a): "Arabic script on RLIN," Library hi tech 10: 59-80. Aliprand, Joan M. 1992(b): "Nonroman scripts in a bibliographic environment," Information technology in libraries 10, no. 2: 105-117. Aman, Mohammed M. 1984: "Use of Arabic in computerized information interchange," Journal of American Society for Information Science, 36, no. 4): 204-210. Anees, Munawar A. 1988: "Computers - Writing the right way?" Pakistan library bulletin 19, no.l: 1-9. Ashoor, Mohammad Saleh. 1989: "Arabization of automated library systems in the Arab World: Need for compatibility," Libri, 39, no. 4: 294-302. Beauvais, Francois. 1984: "MINISIS a l'Institut de monde arabe," Documentaliste 21: 150. Khurshid, Zahiruddin. 1987: "DOBIS/LIBIS acquisition subsystem in operation at King Fahd University of Petroleum and Minerals," Library acquisitions: practice & theory 11: 325-334. Khurshid, Zahiruddin. 1992: "Arabic online catalogue," Information technology and libraries September: 244-251. Kreipke, Helle. 1993: "Islamisk kalligrafi og arabisk skrift," Mellemostinformation, Mandsoversigt 10, no. 5: 10-14. Ksar, Michael Y. 1993: "Untying tongues," Consensus 4: 14- 17. Madkour, M. A. K. 1980: "Information processing and retrieval in Arab countries: 35
traditional approaches and modern potentials," Unesco journal of information science, librarianship and archives administration 2: 97-104. Musa, F. Α. 1986: "A system for processing bilingual Arabic/English text," Journal of the American Society for Information Science 37, no. 5: 288-293. Panzer, Cecile. 1992: "ALEPH - A multi-scriptual, multilingual library management system." In Conference papers of the 3rd International Conference on Multilingual Computing, University of Durham, Center for Middle Eastern and Islamic Studies 5: 5. Peruginelli, Susanna. 1992: "Character sets: towards a standard solution," Program 26, no.3: 215-223. Sakai Yasushi. 1986: "An experimental system for creating and managing Arabic bibliographic database - a step toward effective international information exchange," Libri 36, no. 4: 259-275. Wellisch, Hans Hanan: 1976: "Script conversion practices in the world's libraries," International library review, 8 : 55-84. Wien, Charlotte. 1994: "Arabic books in the Danish research library system." In Proceedings of the International Conference and Exhibition on Multilingual Computing, London, April 7-9.
Footnotes 1. In presenting basic examples in the Arabic script, I will also use the Library of Congress romanization standard to make them comprehensive. 2. There can be minor variations in the use of single terms and the use of vowels. 3. The following list of problems concerning the written Arabic language in bibliographic databases only applies to natural language and not to controlled vocabulary, although some of these problems should be considered in the future work of establishing such vocabularies.
36
4. The kashrah, fathah, dammah. It is also possible to express the lack of vowels between the two consonants by using the diacritic mark called sukun and to express the doubling of a consonant by using the diacritic mark called shaddah. 5. 29 if the hamza is considered a letter. 6. For the definite nunation, the fathah, kashrah, and dammah are used to express the three case endings, for the indefinite case endings three additional signs must be added to the so-called tanwin. 7. John Eilts, personal communication, 24 August 1995. 8. Ibid. 9. Twenty of the letters have four different graphic expressions and six of them have three different graphic expressions. 10. This is only the case when the Hindi numerals are used. 11. Alif and the semi-vowels. 12. The sukun and the shadda. 13. Not all roots can be expressed in all forms or patterns, and this rule only applies partially in the cases of names of persons and places and to certain "foreign" words - e.g. "Kumbufir" (computer). 14. The basic meaning here is considered third person masculine singular of the verb in the perfect tense. 15. Adding the shaddah. 16. Sometimes an Arabic author might state one of the vowels in order to help the reader. 17. This means if there was a way of using the semantic relationship between the words derived from the same roots for retrieval purposes. 18. The exception to this rule is in the indefinite accusative and in standard expressions.
37
19. This is the case of the indefinite noun in the accusative where an alif must be written after the last root letter (and in very formal Arabic) a diacritic mark will show that this alif is a case ending). 20. Notice that the subject is expressed implicitly in the verb through the conjugation. 21. I shall use space-bar in accordance with how this is used in Arabic and "-" in order to express the division of words and particles. 22. Unfortunately, this is not documented in the literature. The information related here is based on conversations with Mr. Gene Smith from the Library of Congress's Field Office in Cairo during October 1994 and with Mrs. Benedikte Krag Schwarz from the Immigrants' Library in Denmark as well as from an eMail message received from Mr. Anton R. Pierce, Library of Congress, dated March 15, 1995. 23. I shall set aside the problem of the weak radicals until later in this paper. 24. As content bearing words in Arabic also can be generated by adding prefixes to the root, an initial truncation will make the search profile too broad. For example: to search for publications containing the root ktb in the title will produce a retrieval of records for publications containing the words for "library", "office", "typewriter" etc. 25. A search statement for the k-t-b root should retrieve titles containing all variants of words derived from this root. This means several semantically related terms, the plurals, grammatical variations, etc. It is still unclear whether or not this result is an advantage. 26. The question is: can scripts be made for the weak radicals which enable the automated database to identify the basic root of the words? 27. If it turns out that using the root-and-pattern structure in searches would result in reduced precision, how would we deal with the fact that due to the lack of vowel statements in modern written Arabic many Arabic words look the same?
38
NINE PROBLEMS CONCERNING ARABIC Charlotte Wien Odense University COMMENTARY John Eilts The Research Libraries Group, Inc. Ms. Wien has adequately described the characteristics of the Arabic language which pose challenges to the automation of Arabic bibliographic data (although they also apply to other languages using the Arabic script). She has kindly given order to these and fully explained the issues each characteristic raises in automating Arabic. I shall follow her lead and the order presented. 1. The issue of the number of letters and other characters is a technical problem which has been solved - many times over. Not only do we have the various Arab standards (CODAR, ASMO 4498, ASMO 708), United States Standards (USMARC), and international standards (ISO 8859-6, ISO 9036, ISO 10646/Unicode), but we also have a number of computer industry "standards" (IBM and Microsoft codepages). This is a perfect illustration of the old saying: "The wonderful thing about standards is that there are so many of them to choose from." When one is using an application for a single user (a closed system) this is a matter of little importance. The only criterion needs to be internal consistency. But most of us do not live or work in a world of isolation. In order for us to exchange data, we must be concerned with character sets. Although it is true that any character set may be translated into another, this is not always a process which faithfully recreates the original data when run in reverse. For true, reliable data exchange, all partners to the exchange must agree to use a common standard. This may be on the horizon with the acceptance of ISO 10646. 2. The writing (and reading, not surprisingly) direction of Arabic script (right-to-left) is more a psychological problem than a technical one. One only needs to accept two basic principles to see the solution. The first is that all scripts have an inherent directionality that is an integral characteristic ~ be it Latin, Cyrillic, Arabic, or Mongolian. Second, one must accept that data (mere electronic
39
impulses) are not right-to-left, or left-to-right, or top-to-bottom. They are oriented in the logical order of a data stream which can best be described as the order keyed. Any other method of data storage is not only unduly complicated, but also falsely based on the analogy of the fixed width of older output media (e.g., CRT's, and punched cards). Output formatting must be completely divorced from storage order of data. 3. The variant graphical representations of Arabic letters based on adjacency to other characters is another technical problem. The positional forms can be supplied algorithmically; to include this information in the character encoding causes excess, unnecessary overhead. This is more properly handled in software applications. 4. The problem of differing use of Latin style and eastern style numerals is largely solved by allowing the mixing of Latin and Arabic scripts wherever needed, without artificial barriers such as alternate display lines or pre-defined fields valid for one script or the other. Both of these methods are again retrograde, and oriented to the output mechanism. 5. The common practice of omitting vowels in written Arabic is not a problem bibliographically when representing the data in the original script. The problem arises when representing Arabic in romanization. Anglo-American cataloguing practice requires that a cataloguer supply the unwritten vowels in romanization. This often demands a very advanced knowledge of Arabic. Even then, regional and dialectical differences create inconsistencies. Thus, we have the very real need to represent Arabic bibliographic entities in the original script. 6. Normally I would dismiss the issue of the "lexical structure" of Arabic as irrelevant in a bibliographic sense. But, the recent changes in user sophistication and expectations lead me to believe that it is an area that deserves further research. If a search engine can find lexically related terms, would that not be an enhancement to mere keyword searching? I look forward to further research in this area from Ms. Wien. 7. The matter of weak radicals is actually a major complication of the "lexical structure" of Arabic and needs to be considered with it. 8. The case endings are similar to problem 4, in that it is only a problem in
40
romanizatíon and yet more reason to use the original script in bibliographic representations. 9. The issue of the particles in Arabic is, in my opinion, the most important one when we look at implementing retrieval systems. Keyword searching has proved to be the most valuable method of access when searching for unknown items. Keyword searching is often used as a substitute for the lack of fit between the commonly used terminology of the very Western-centric, Library of Congress Subject Headings, and materials representing another culture. Arabic (and any other language with integral particles) makes keyword searching impossible, or extremely cumbersome with multiple boolean modifiers. This problem can be solved in a number of ways, including requiring the creator to supply a visible (or invisible) demarcation between the particle(s) and the significant words. This is done in the romanization of Arabic. But there are better ways to handle this in software that will not require extra effort on the part of the originators or the consumers of the data. Charlotte Wien has done an outstanding analysis of the problems associated with automating Arabic in the bibliographic environment and laid the groundwork for solving them. Her further work in this area should prove beneficial for the just developing field of Arabic language search engines.
41
CATALOGUING OF DOCUMENTS FOR MULTILINGUAL CATALOGUES OF LIBRARIES IN RUSSIA: ANALYSIS OF THE PROBLEM SITUATION Natalia N. Kasparova Russian State Library Libraries, particularly large ones, which catalogue and maintain collections and catalogues representing different languages, are bound to face the issue of what language to use in their bibliographic records. The analysis of practical bibliographic activities of large libraries and science and technical information agencies in Russia shows that the rules for selection of the language of the bibliographic record and its conversion represent one of most difficult problems in the field. For the subsequent retrieval of a document in a record file and the selection of the language for the heading of the bibliographic record or the title is a matter of principle. National rules governing the bibliographic description of documents,1 which are based on the International Standard Bibliographic Descriptions (ISBDs), mainly orient us towards the reproduction in the bibliographic description of the original language and characters used in the document's title page. Problems related to the selection of the language for the bibliographic description arise in the following cases: •
Titles in a number of languages and a text in one of these languages, and discrepancies between the language of the text of the document and the language on its title page; and,
•
The translated title of the document and the original title of the work from which the translation was made.
Russian libraries and bibliographic agencies have extensive experience in the conversion of the language of the bibliographic record. Their general bibliographic goals for description and access have the following strategies: 1.
To find in one alphabetic array all the works of an author, regardless of the language they were published in. For example, the Russian State
42
Library uses the Russian form of the author's name as the heading for the bibliographic record for works of domestic authors when published in foreign languages; 2.
To adapt foreign language texts of bibliographic records to the conditions of Russia (i.e., to convey foreign names, denominations, and geographical names through Russian orthography); and,
3.
To facilitate the compilation and retrieval of bibliographic records in languages which use complicated graphics (e.g., Arabic, Chinese, and Japanese).
National rules in Russia do not make strict demands on the way the languages of bibliographic records will be converted nor on the language into which it will be converted. Therefore, the result of the transcription or transliteration of a foreign language proper name into Russian graphics may differ among records (see Table below). Converting the language of the heading or the title contained in a bibliographic record is becoming more complicated. This is because, among other things, it is difficult to give the same recommendations to all the institutions concerned. Russian libraries and bibliographic agencies have diverse tasks in providing information service to various categories of users, who make different demands on catalogues. In addition, the ways in which libraries and bibliographic agencies organize their catalogues often vary. Thus, the possibilities exist for Russian libraries and bibliographic agencies to create incompatible bibliographic records — a major limitation in the compilation of national and international union catalogues. This incompatibility may also impede information exchange and the retrieval of documents in bibliographic information systems. Foreign-language materials are important to the collections held by Russian libraries. The method libraries use to convert the language part of the data in foreign language bibliographic records is to transcribe the text into Russian characters. This is done through conformity with the norms of a specific foreign language and the various languages of the peoples of Russia. Unfortunately, there is no uniform standard for transcription in existence and, in principle, it cannot be unified because the reproduction of foreign proper names,
43
geographical names, etc., depends upon a specific language and its graphics. Of course, not every cataloguer could cope with this task, even with the help of manuals. The most precise method of converting the language of the bibliographic record would be with the aid of an internationally accepted series of transliteration standards for different language graphics. It can be said that today a uniform international transliteration system is taking shape ~ it is a system that aims to romanize slavic and oriental languages. Russian librarians continue to gain valuable experience in transliteration. It is obvious that the romanization of the Russian language Cyrillic characters, when used in the text of bibliographic records of Russian language documents, has no prospects within Russia. As for foreign language records and Russian language bibliographic records intended for international exchange, their transliteration into roman graphics does make practical sense. Thus, the Russian Book Chamber (which in Russia performs the function of a national bibliographic agency) sends to foreign bibliographic agencies Russian language records that have been transliterated into roman characters. When referring to the specific features of individual information retrieval systems which are oriented towards the Russian user, it is necessary to stress again that the method of transliteration of foreign bibliographic records into the Russian Cyrillic alphabet is not supported by any standard. Therefore, conditions do not ensure retrieval adequate to specific requests because the Russian user normally formulates his or her request either in the original graphics of the language of the document or in a transcribed Russian language form. The problems associated with multilingual catalogues, as raised by EFLA, are urgent and complex, and they call for further investigation. It is necessary to find an approach which would, without depriving any country of its national specificity, make it possible to formulate lucidly and unequivocally the purposes and methods for selection and conversion of the languages of bibliographic records in order to ensure their compatibility on all levels. Specialists at the largest Russian libraries lean towards the following solutions to the problems of the multilingual catalogue and of national and international bibliographic cooperation: •
On the national level (particularly for multi-national countries), multi-
44
language catalogues that include records with original language graphics in accordance with the language of the text of the documents should exist; •
The functions of the international interface should be taken on by multilingual authority files that are capable of ensuring the connectivity between the different forms of the author's name which are cited in the file;
•
Another multi-language base of national or international union catalogues (that unite bibliographic records in different languages and/or graphical characters), may be created on parallel lines;
•
When choosing the language or graphics used for the bibliographic record, priority should be given either to the language of the document or to the national language of the country in which the bibliographic agency is located; and,
•
Responsibility for the establishment of the unified form of the heading for the bibliographic record, including those converted into the graphics of another language, should be given to the national cataloguing agencies or services.
EFLA and ISO Technical Committee 46 should collaborate in solving many of the problems involving multilingual catalogues. In working jointly, they should use the professional national experience and expertise of those countries where work is currently in progress to reconcile access and description issues within their local and national multilingual union catalogues.
45
Table
Original form in Latin graphic
Conversion form Transcription in Russian language graphic
Transliteration in ISO 9
Giovanni Boccaccio
jfcoBaHHH EoKKawo
fMOeaHHH: 6ouuai]HO
Uan C l i b u r n
B s h KjiaAôepH
BaH UmiöypH
Isadora Duncan
ñüceaopa flywtaH
Hcaaopa üyHiiaH
ßftce&opa üaHKaH A r t h u r Guinnes
Aptyp fHHHecc
flpTyp TyHHHecc
Rabindranath Tagore
PaöMHÄpaRaT Tarop
Paö/HÄpaHarx Taropé
Mohandas Karanchand
MoxaHuac KapauuxaHS
Gandhi
UoxaHAac KapaimaHA raHjw
George Sand
Sopì C a w
reopre C a w
TaHjücu
Sops 3âHA F r a n c i s Willughby
(ppeHCHC Haiuiodw ípBHCHC BWUIOÓH
OpaHUMC BnjuiyrxtìH
3ean Auguste
San OnDrT-ÜOMHHHK
Dominique Ingres
?ΗΓρ
HeaH AyrycTe flowiHHKye Itarpec
M i h a i l Eninescu
Huxaiu 3ttHHecKy
toaaiw
Eiomecuy
Ifoxafl 3ühhbckh 3osquin des Pres
locKeH üenpe Bockhh ne npe
HocKtjMH w e npec
Mitchell Wilson
ÏMTMeJl ¡Jhjicoh
MifTUxeJui Bhjicoh
46
Footnote 1. Mezhduvedomstvennaía katalogizafsionnaía komissiía pri Gosudarstvenno'i biblioteke SSSR imeni V.l. Lenina, Provila sostavlenim bibliograficheskogo opisaniía. (Moskva: Knigi, 1986-1993). Knigi i serial nye izdaniia (1986). 527 p. (International Committee on Cataloguing by the Lenin State Library, Rules for the Compilation of the Bibliographical Description, (Moscow: Kniga, 1986-1993). Books and Serials publications (1986)).
47
SYSTEM OF MULTILINGUAL CATALOGUES AND PROBLEMS ARISING AT THE INITIAL STAGE OF ELECTRONIC DATABASE CREATION Ludmila A. Terekhova Rudomino Library for Foreign Literature Before initiating a project to computerize the catalogues of the Rudomino State Library for Foreign Literature (RLFL) in Moscow, it was necessary to identify the main range of questions: •
RLFL's system of catalogues and files;
•
Library's holdings (monographs, serials, audio-visual materials, microforms, diskettes, tapes, etc.);
•
history of creating catalogues organized by language;
•
choosing the proper system for the project;
•
automation of other library processes;
•
database development; bibliographic record format; compatibility of national and international standards, such as ISBD, GOST 7.1-84, AACR2;
•
general aims and priorities;
•
experimental stage and its duration;
•
ensuring system protection, discontinuance of catalogue cards and card catalogues;
•
contents analysis of the system: retrieval technique (classification, subject headings, indices, thesaurus);
•
transition to automated retrieval: its losses and benefits;
48
•
use of transcription and/or transliteration;
•
electronic catalogue and traditional card catalogue;
•
users' training in the use of the electronic catalogue; and,
•
maintaining catalogues of both types.
Before turning to the main subject of this paper, the author would like to say a few words about the Russian Library for Foreign Literature. It was named after Margarita Rudomino who was its founder and director for 52 years, and who also served as IFLA's vice-president from 1967 to 1973. In 1922 the Library's collection numbered about 200 books which belonged to her. It has grown to comprise about five million items in 140 languages. RLFL's catalogues include more than ten million bibliographic cards ~ certainly, an immense file for conversion to electronic form. Bibliographic description of foreign books does not differ from that used for materials in Russian; in both cases, the Russian state standard 7.1.84 (Bibliographic Description for Documents. General Requirements and Rules) is used. This standard is generally based on the ISBDs, but it does have some distinctions - for example, more detailed specification for multi-volume editions. But, the Russian state standards do not always accommodate materials in foreign languages, so the Library's cataloguing experts have developed guidelines and instructions for use at RLFL. They also assisted in the formulation of state standards, since RLFL is a member of the Interregional Committee on Cataloguing, the body which is responsible for development of Russian cataloguing rules as well as for translation of documents used in formulating these rules. Undoubtedly, this involvement has also contributed to the development of Russian national cataloguing theory. Generally, Russian cataloguers are all practicing librarians, but they cannot help being involved in library research, especially in the present circumstances. Now national librarianship in Russia is changing. Librarians are joining international associations and participating in their activities. More and more Russian libraries have become IFLA members, raising awareness of problems in Russia.
49
"Perestroika" has brought social, economic and ideological changes; some of these changes are directly applicable to libraries. For example, in 1990 RLFL opened a department and reading room for religious literature. Shortly thereafter, the cataloguing staff discovered that existing bibliographic standards and rules did not provide for works by ecclesiastical authors, since religious literature had never been allowed within public or general research libraries. As a result, Russian librarians now are facing serious problems in preparing headings for ecclesiastical authors of all confessions in all languages of the world. Because there are no theology experts in RLFL, the staff has had to establish close contacts with the Synodal Library of the Moscow Patriarchate. Major changes have also occurred in the system of state bodies and authorities. Such bodies as Supreme Soviets, Soviets of People's Deputies, regional, territorial, and village Soviets are now abandoned in favor of new names such as "Parliaments," "Municipalities," "Prefectures," "Federations," "Legislative Assemblies," "Dumas", etc. This has complicated the cataloguing of government documents and official publications. The main problem is lack of uniformity. It would help if all documents issued by the same corporate author were grouped together under a single heading. Authority files are needed to ensure consistency of the headings for these bodies for use in Russian catalogues, indexes, and databases. In Russia, geographical names are changing too: towns, regions, republics, and regional organizations are being renamed, and the new names are reflected in their publications. It is sometimes quite difficult to collect all the new titles in one file; care must be taken not to confuse the reader by too many references from one title to another. Such problems as these are encountered every day. In addition, radical changes have taken place in the publishing industry. Numerous new private publishing houses have been established. Many issue new translations, reprints, facsimile editions, and second editions of titles published before the revolution. Often these publishers do not observe - or perhaps they just do not know - requirements and standards. Motivated by purely commercial interests, they publish low-standard books, trash literature, crime stories, erotica, thrillers, etc. They fail to register their firms and editions. They often steal translations and ISBN's, giving them different titles. This results in disorder in the book trade. These negative factors have adversely affected library work. Actually, there was no library legislation or legal deposit provisions until January 17, 1995, when the Library Law and Legal Deposit Acts were enacted.
50
This survey has covered only some of the problems which are encountered in maintaining today's catalogues, particularly those which are multilingual in nature. In RLFL, catalogues are arranged by languages, with separate catalogues for English, French, German, etc. For literature that is written in oriental languages, the description is transliterated into Russian and the original title is added. Separate catalogues by language are very convenient for users, as they generally require literature in the languages they know. Separate language catalogues have another advantage: they allow the user to develop a clearer idea of the status of the libraiy's collection within languages. Separation also makes it easier to maintain an appropriate balance between the language sections that comprise the total collection. Thus, the visitor to RLFL will find that the major files of the alphabetic catalogue cover literature in English, French, German, and Russian. The Russian catalogue includes foreign literature in Russian translation as well as manuals and textbooks for students, dictionaries, encyclopedias, bibliographies, and various documents of state and international institutions. This structure has been in place since 1922. The cataloguers compile descriptions of books in foreign languages for readers whose native language is Russian. When an entry for an original edition is prepared, the author's name is given as it appears on the title page. Usually, the cataloguers experience difficulty in establishing headings for translations. In these cases, the cataloguer uses the headings as they appear in the author's native language. When providing entry elements for publications which are translations of Russian works, transliterated forms of the name are used, and they are determined according to Documentation-Transliteration of Slavic Cyrillic Characters into Latin Characters (ISO-9-1986=(E)/ISO/TC 46). No doubt, a catalogue structure which separates files by language has its disadvantages. The most significant drawback is its inability to collocate the works of one author in different languages under a single heading. In unified catalogues all the works by a particular author are grouped under the uniform heading for the author. This feature facilitates indexing and classification. About 15 years, ago RLFL started a unified catalogue and continued to maintain it for 10 years; however, but it had to be abandoned, and the Library is back to its traditional separate language catalogues. The ten-year experiment established that the RLFL's users preferred the separate catalogues. Moreover, arrangement of entries in these catalogues was different from that in the unified catalogue, causing confusion to users. In the unified catalogue, the access points are given in the original language
51
with the descriptive elements appearing in the language of translation. In the separate catalogues the entries are given in the language of translation. Of course, the cataloguers provide cross references to help the users in searching the catalogues. When dealing with translations, the cataloguers are expected to have some knowledge of the languages involved and to be experienced in using a wide range of reference books. While the Library currently employs such experts, its staff is unfortunately aging. As a result of today's economic situation in Russia, library professionals are poorly paid. Young specialists with diplomas in languages seek more remunerative positions and do not want to work in libraries. In time, this trend will inevitably affect the effectiveness of library services as well as the quality of catalogues whether card-based, printed, in microform, or online. Another problem facing cataloguers is selecting which language to choose in the case of items with bibliographic statements in more than one language. According to prevailing rules, the language of the text should serve as the language for the bibliographic description. When other languages are involved, they are mentioned in notes. This practice is not applicable to language study manuals which are catalogued in the native language of the students for whom the publication is intended. National cataloguing rules have undergone considerable change over the years. Soviet party ideology sometimes intruded to affect particular rules. Works with parallel titles and with the text in several languages - for example, multilingual dictionaries - were to be described first in the languages of the European Socialist states and then in the languages of capitalist states, in English, French, German, Italian, etc. But, even in those times of complicated political situations, librarians felt that this criterion with its emphasis on the existing social system could not be justified because it discriminated in favor of Eastern languages and languages of minorities. Fortunately, the librarians' professionalism prevailed, and the rules were simplified and brought into closer alignment with international standards. Nevertheless, one can still trace in Russian catalogues not only the development of cataloguing rules, but also the political history of the state. With this background, it becomes quite clear that Russian libraries will face considerable difficulties in planning for computerization of card catalogues, since database structure, access points, and retrieval possibilities in an electronic
52
environment are quite different - especially when multilingual catalogues are involved. Before further considering problems related to automation, it is important to understand the system of catalogues now in place at RLFL. This system comprises several parts: First, there is an alphabetic book catalogue (general and public) which is organized by language. Second is an alphabetic serial catalogue (general and public) consisting of several sections: one for serials in the roman script, one for serials in Cyrillic, another for serials in the Russian language, and one for serials in oriental languages in Russian transcription. Third, the library has a subject catalogue, based on a domestic classification scheme with some elements of Universal Decimal Classification (UDC) and some aspects of the Bibliothecal-Bibliographic Classification (BBC) system. Recently, RLFL decided to switch wholly to UDC, as it is in use in Russia and has been approved by the Russian Committee on Standards. However, adapting the Library's catalogues to the new classification and computerizing at the same time will be complicated. Fourth, there is a subject periodical catalogue which is based on a classification system developed in-house. In addition there are catalogues maintained by various departments for their own collections. These local catalogues will be merged into RLFL's integrated computer network. Now, for a few words about the arrangement of the alphabetic catalogues mentioned above. Foreign languages use different alphabets and written forms. Most European and some Indo-European languages use the roman alphabet, which also forms the basis for certain Asian languages, such as Vietnamese, Indonesian, and Turkish. The Cyrillic script has been applied in the creation of alphabets for some people in the former Soviet Union and also used in Bulgaria, Macedonia, and Serbia. Many oriental languages utilize the Arabic alphabet while other languages, such as Greek, Burmese, and Tibetan have their own national scripts. Chinese and
53
Japanese use hieroglyphic writing. To recapitulate, there are several variants by which the R L F L ' s alphabetic catalogues are arranged: The first approach is to arrange all the entries alphabetically using a single alphabet. For this method, translation, transliteration, and conversion are used in order to arrange all the entries by headings in this single alphabet. Transliteration is necessary in a multi-script situation to convert data from one writing system to another; this is accomplished by giving each character of the source language an equivalent character in the target language. ISO transliteration standards are well known, and in many countries they are followed in organizing alphabetic catalogues which include records for works of foreign literature. Transcription can also be used to organize a single alphabetic catalogue. Conversion is the process used to represent the speech sounds of a source language by writing them phonetically into the target language. At RLFL, this is the method used to arrange the catalogue for orientalia. Generally, such conversion is a national or regional phenomenon, and only transliteration schemes are appropriate for international standardization. The second approach is to cumulate in a single alphabetical arrangement all the entries in languages using the same script. This is the way R L F L ' s serial catalogue is organized. For example, records for periodicals in European languages which employ the roman script are grouped together. The third variation is to maintain separate alphabetic catalogues for different languages. This approach is followed in organizing the catalogue for monographic publications. Because of these different arrangements, library staff have produced many manuals and other aids which are intended to help users orient themselves in the R L F L ' s catalogues. Whereas in performing cataloguing and classification, in organizing catalogues, and in pursuing other similar library processes, libraries in Russia have had the opportunity to follow international standards (such as IFLA's ISBDs and ISO's standards), they have had nothing similar upon which to rely when initiating automation projects. Of course, major libraries in Russia and in the former USSR gained considerable experience, but each library pursued an independent course.
54
Before "Perestroika", Russia was an "outsider" as far as applying technological advances to automate library processes. Strange as it may seem, centralization in librarianship aggravated the situation, as computerization of major libraries was considered a financial priority. Thus, the primary experimental automation projects were those undertaken by the national libraries in Moscow and St. Petersburg. But each went its own way, with some establishing cooperative arrangements with foreign libraries and others engaging in joint programs internally. Acting on recommendations of the Ministry of Culture, the Russian State Library initiated a general library automation program for the entire Soviet Union. The Rudomino Library was intended to be a part of this system and was provided with telecommunication access to the Russian State Library's local database containing entries of foreign publications. RLFL was offered a program for database searching that had been prepared for the use of the Council for Mutual Economic Aid's membership. This was Abbreviation of Words and Word Combinations in Foreign European Languages for Bibliographical Entries (COMECON 2012-72) intended for search by words in abbreviated form. But, after the Council was dissolved, this standard fell into disuse, because it was not compatible with the state standard 7.113-78 (Information and bibliographical documentation system. Abbreviations of words in foreign European languages for bibliographical entries of publications). As a result, this automation experiment did not, prove successful, although to a small extent automation projects did work in some other large libraries, e.g., All-Russian Institute for Scientific and Technical Information of the State Committee for Science and Technology and of the Russian Academy of Sciences (Vserossijskij Institut naucnoj i techniceskoj informacii Gosudarstvennogo Komiteta po nauke i technike i RAN, (VINITI)); Russian National Public Library for Science and Technology (Gosudarstvennaja publicnaja naucno-techniceskaja biblioteka Rossii, (GPNTB));Institute for Scientific Information on Social Sciences of the Russian Academy of Sciences (Institut naucnoj informacii po obscestvennym naukam RAN (INION)).
After the disintegration of the Soviet Union, the library situation in Russia changed dramatically. Libraries themselves have had to survive. Throughout Russia, library associations and societies are being established - for example, the Moscow Library Association, the Russian Library Association, the Association of Research Libraries and Libraries for Science and Technology in the CIS.
55
This latter has become an international organization and proved its vitality, efficiency and good prospects for the future. Its membership is not limited by any barrier whether geographic, professional or other. This body ~ including members in some republics of the former Soviet Union, such as the Ukraine, Kazakhstan, Byelorussia, and the Baltic States - holds annual conferences. Professionals from foreign countries are regular participants. A recurring feature of these conferences is a seminar series for libraries that use the ISIS system. These annual meetings provide a basis for discussions about library automation projects in Russia and thus have helped to identify libraries interested in establishing electronic database networks. As a result of these initiatives, it has become clear that a common format is needed for exchange of records. In Russia, however, such a format does not now exist. Working under the auspices of the Association of Research Libraries and Libraries for Science and Technology in the CIS, the Interregional Cataloguing Committee has begun creation of a national exchange format known as RUS/MARC, which is based on that in use at the Russian State Library. Nevertheless, experts from some major Russian libraries, including both cataloguers and computer engineers, have concluded that this format does not satisfy the needs of a national standard. Translated by experts from the Russian National State Public Library for Science and Technology, the international exchange format UNIMARC has been adopted as a national standard in Russia ~ although UNMARC has not received the final approval of the appropriate government bodies. The Rudomino Library is also interested in using UNIMARC for exchange of data. However, despite many advantages, UNIMARC's appeal is limited by some contradictions of RLFL's cataloguing practices. UNIMARC is compatible with the ISBD, but Rudomino is required to apply the Russian State standard 7.1.84, which, although mostly accommodating the ISBD, does feature some differences. Staff at RLFL are now developing an electronic database designed for use with the USMARC format as applied in the TINLIB system, which is a product of the British firm IME. Even so, some difficulties have been encountered: USMARC conforms with the Anglo-American Cataloguing Rules while RLFL's staff must follow Russian national cataloguing rules. The TINLIB format which is applied by RLFL offers some advantages that are considered superior. For example, it features a multi-character set with UNICODE using WINDOWS, a indispensable feature in forming different authority files that extends bibliographic searching possibilities in the database. Everyone 56
can use the author file, the publishers file, series-subseries files, etc. Thus TINLB demonstrates the propriety of its acronym: TIN stands for The Information Navigator. But, despite its strengths, this system presents some negative features: it will not display bibliographic information in the catalogue card format so well known to Russian readers nor will it yield such a product as a bibliographic card in the national standard 7.1.84. Internationally the catalogue card may have become an anachronism, but in Russian libraries it will continue to be used for many years. The Rudomino Library functions as a bibliographic information agency for many libraries that have departments for foreign literature, and they receive catalogue cards from RLFL by subscription. Card catalogues will persist in Russian libraries as part of the national information system due to slowness of technical development, and there is some sentiment in the professional literature in favour of card catalogues. As a result, experimentation with electronic systems will occur while the manual card catalogue environment is maintained. Preparing for computerization has required attention to several key issues, including administrative, technical, technological, financial, and normative perspectives. Of these, some are of sufficient importance to merit elaboration: First is the human, or psychological, aspect. As already mentioned, serious demographic shifts have resulted from the drastic political and economic changes in Russia; these changes have affected such spheres of human activity as culture, education, medicine, and science. The age of professional staff in institutions and organizations has become much older, with younger specialists leaving for commercial enterprises to assume higher-paying positions. In libraries, museums, academic institutions, schools, hospitals, there remain devoted and well-qualified personnel - but, unfortunately, staffing overall is not young. It is much to their credit that those who continue have not lost interest in their work; indeed, they are very receptive to innovations. With clear enthusiasm Russia's cataloguers are learning to handle new equipment, and fears that new technology would encounter psychological barriers are proving unfounded. Nevertheless, low salaries in cultural institutions, including libraries, constitute the most serious obstacle to attracting younger people to work for them. Computerization seems a remedy to this situation. The second aspect is related to professional work. Until recently, the nature
57
of a cataloguer's duties were quite clear: to produce bibliographic descriptions in conformity with national cataloguing rules and standards. Now, however, the cataloguer's functions have changed radically. Computers have produced an operational environment in which cataloguers' options have expanded; they occasionally have to break rules and discard normal instructions. For example, rules on standard abbreviations do not work well when the cataloguer is creating authority records. Another challenge in the face of computerization will necessitate letting go of traditional practice in order to devise and maintain a single multilingual automated database instead of the traditional practice of separate files for each language. With materials in 140 languages, preserving the traditional approach of continuing separate local language databases within an integrated network presents a very complicated technical problem. Particular difficulties are caused by the databases for serials. In this area, it is necessary to start by reassigning responsibility among organizational units. With computerization at RLFL, the Acquisition Department has become responsible for serials registration, while the Cataloguing Department prepares bibliographic descriptions for periodicals. Before, the Cataloguing Department performed both functions. The introduction of an Online Public Access Catalogue (OPAC) will also have its specific distinctions and characteristics: 1. difference in the structure of alphabetic part of catalogues, 2. choice of the Universal Decimal Classification (UDC) for the Library's electronic catalogue. As this scheme was not previously in use, this choice will inevitably complicate the conversion of the entire database of about 10 million catalogue cards. In general, exchange of bibliographic records on both national and international levels will also be complicated by the specific features of multilingual databases. While in many countries the Cyrillic script is transliterated into roman characters, at the Rudomino Library bibliographic files are arranged by original languages. If RLFL follows this transliteration practice, it will contradict Russian national standards.
58
This paper has addressed just some of the problems arising at the inception of our automation project. Beyond these challenges are many more. But, we must prepare to resolve them all in order to enter the global information highway on the way to "the libraries of future".
59
CATALOGUING OF DOCUMENTS FOR MULTILINGUAL CATALOGUES OF LIBRARIES IN RUSSIA ANALYSIS OF THE PROBLEM SITUATION Natalia N. Kasparova Russian State Library SYSTEM OF MULTILINGUAL CATALOGUES AND THE PROBLEMS ARISING AT THE INITIAL STAGE OF ELECTRONIC DATABASE CREATION Ludmila A. Terekhova Rudomino Library for Foreign Literature COMMENTARY Helen F. Schmierer Brown University Library The wider utilization of online catalogues worldwide and reports on the cataloguing of two large libraries with substantial holdings in a wide variety of languages and scripts invite us to revisit the need to maintain the integrity of the language and script of an item when preparing its bibliographic description. As the papers of Kasparova and Terekhova indicate, when a library prepares bibliographic descriptions a tension regularly exists among at least three potential uses for the records, be they machine-readable or in some other form. The three uses, 1.
service to library clientele (e.g., record content including the languages and alphabets used in the library's public catalogue taking into account the preferences of the library's primary users);
2.
contribution to a shared bibliographic database (e.g., a national database); and,
3.
international exchange
60
may be accompanied by other uses (e.g., machine-generated products such as catalogue cards) to be supported by each record. When the records in question are machine-readable, the first distinction made is between the data carrier ~ a MARC format ~ and the content of the record the information in the record including language(s) and script(s). MARC Formats For some thirty years, MARC formats have been in development and in use. Currently a number of MARC communications formats are available (e.g., Canadian MARC communications formats for bibliographic data and for authority data, EBERMARC formato para monografías, INTERMARC(M): format bibliographique d'échange pour les monographies, Formato CALCO, Japan/MARC, MAB-1) as well as the IFLA-maintained common communications format, UN1MARC. As MARC communications formats were developed, systems were created or modified to use MARC records received on disk, on tape, and, more recently, via FTP. Just as it is expected that a library data management system will accept records in at least one MARC communications format, it is now commonly expected that a system will be capable of outputting records in at least one MARC communications format. With the wide availability and use of machine-readable cataloguing records, the expression "MARC format" is often used to comprehend both system specific implementations of internal processing formats and communications formats. Although the work on MARC communications formats cannot be asserted to be completed, a number of communications formats have been in use for some time and have successfully transmitted millions of records in various languages and scripts. This suggests that the need for further development and refinement of communications formats, per se, does not represent a stumbling block to the provision of multilingual and multi-script records. In fact, techniques are available and have been used for some time to transmit multilingual and multi-script records in MARC communications formats.1 There remain some technical issues to be settled, but as more than one commentator has observed, the issues to be addressed are political more than technical2 and related to the creation of bibliographic descriptions.
61
Record Content Since the publication of the first proposed ISBD in 1971, there has been systematic development of ISBDs and worldwide incorporation of ISBD principles and specifications in rules for descriptive cataloguing. As a result, catalogue records created today in all parts of the world are remarkably similar in the content of bibliographic description and often consistent in choice of descriptive access points. A major factor in the differences that do occur in catalogue records prepared for the same item in different countries is the character sets used to create the records. Pertinent to our consideration is ISBD guidance regarding language and script of the description in ISBD(G) at section 0.6 (Language and script of the description): Information in areas 1, 2, 3, 4, and 6 is normally taken from the item and is, therefore, in the language and/or script in which it appears there. Information in areas 5, 7, and 8 is given in the language and/or script of the national bibliographic agency, other than the "key-title", original title, and quotations in notes. Prescribed Latin abbreviations are given as instructed in 0.7. Texts in scripts not used by the national bibliographic agency may be transliterated or transcribed into the script of the agency.3 The fourth paragraph of this section is a surrender to the practicable. surrender necessary?
Is this
In the days of handwritten, typewritten, and typeset cards, any number of scripts could be and were included in catalogue records. In the heyday of the card catalogue in the United States (which is in the not too distant past), many libraries followed the practice of including the title in transliterated form when the bibliographic description was prepared using the non-roman script of the item.
62
Often instructions to transliterate4 are without explanation; it is apparent, however, from sources as early as Cutter's rules of 1876 that transliteration was used to regularize name forms5 and to aid filing. In fact, one suspects that a powerful motivation for transliteration and transcription is to make possible processing by staff who are unfamiliar with the original language or script. In the early history of machine-readable cataloguing (which is also in the not too distant past) devices were not readily available to print or display many scripts, let alone multiple scripts. Now such devices are increasingly available and international standards are in place that make possible consistent approaches to input script worldwide.1,6 The continued use of transliteration or transcription as a, or the only, mode of catalogue record creation appears to be largely justified on the basis of precedent and the costs of changing to a more enlightened approach that acknowledges the primacy of the language and script of the item over systems of transliteration and transcription. In the interests of universal bibliographic control, national bibliographic agencies (and other providers of bibliographic data) should be encouraged to follow the first paragraph of ISBD(G) section 0.6 and ignore the fourth. Transition to the Multi-Script Environment As we move to the regular and routine creation of multi-script bibliographic records, it seems axiomatic that we must be willing to share information about successes and failures. We must also be prepared to consider new approaches to known problems and to work toward solutions that will be accepted internationally. Where there are deficiencies we must work to correct them. Several concerns raised by Kasparova and Terekhova point to topics that will benefit from wider discussion: •
authority file creation and maintenance in the multilingual and multi-script environment, including names, uniform titles, and subjects, with attention to the role of national bibliographic agencies in the preparation and maintenance of authority data;
•
catalogue architecture (a single catalogue for all languages and
63
scripts versus separate catalogues for different languages and/or scripts, or, in a machine-readable environment, both?); filing rules in a multi-language and multi-script environment; production of printed products from multi-language and multi-script files of machine-readable data (e.g., bibliographies of various types, catalogue cards); abbreviations (where is the use of abbreviations appropriate in bibliographic records, with particular attention to the growing use of "word search" strategies?); adoption and use of the international standard for the universal coded character set within the library community and by others involved in data exchange, such as text encoders and those preparing documents for the World Wide Web; and strategies for planning and implementation of multilingual and multi-script records and systems.
64
References and Footnotes 1. Sally H. McCallum, "Tokyo to Barcelona: Progress in Multiscript Automation." In Automated Systems for Access to Multilingual and Multiscript Library Materials: Proceedings of the Second IFLA Satellite Meeting, Madrid, August 18-19, 1993, ed. for the Section on Information Technology, the Section on Library Services to Multicultural Populations, and the Section on Cataloguing by Sally McCallum and Monica Ertel, IFLA Publication, 70 (München: K.G. Saur, 1994), 13-22. 2. See, for example, James Edward Agenbroad, Nonromanization: Prospects for Improving Automated Cataloging Systems of Items in Other Writing Systems, Opinion Papers, no. 3 (Washington, D.C.: Library of Congress, 1992), which includes a useful list of titles for further reading. 3. ISBD(G): General International Standard Bibliographic Description: Annotated Text, prepared by the Working Group on the General International Standard Bibliographic Description set up by the IFLA Committee on Cataloguing (London: IFLA International Office for UBC, 1977), 5. 4. Kasparova, Terekhova and ISBD(G) properly distinguish as do linguists between "transliteration" (the conversion from one writing system to another where each character in the source language is given an equivalent character in the target language) and "transcription" (the conversion from one writing system to another where the sounds in the source word are conveyed by letters in the target language). The word "transcription" is used in the Anglo-American Cataloguing Rules, second edition, at rule 1.1B and throughout with the general meaning of "copy." In general use, the word "transliterate" is often used to comprehend both of the more precise terms "transliteration" and "transcription." 5. Cf. Charles A. Cutter, Rules for a Printed Dictionary Catalogue, issued as Public Libraries in the United States, Their History, Condition, and Management: Special Report, Department of the Interior, Bureau of Education, Part II (Washington: Government Printing Office, 1876), which at rule 25 conveys instructions for transliterating several languages including those written Latin alphabetic, e.g., "In German names used as headings, for ä, δ, ü, write ae, oe, ue, and arrange accordingly."
65
6. Alan M. Tucker, "Non-Roman and Multi-Script Bibliographic Databases: Basic Issues in Design and Implementation." In Automated Systems for Access to Multilingual and Multiscript Library Materials: Problems and Solutions: Papers from the Pre-Conference held at Nihon Daigaku Kaikan Tokyo, Japan, August 21-22,1986, ed. for the Section on Library Services to Multicultural Populations and the Section on Information Technology by Christine Boßmeyer and Stephen W. Massil, IFLA Publications, 38 (München: K.G. Saur, 1987), 3448 provides an instructive summary of the issues and considerations including searching and sorting.
66
MULTILINGUAL AND MULTI-CHARACTER S E T DATA IN L I B R A R Y S Y S T E M S AND N E T W O R K S : E X P E R I E N C E S AND P E R S P E C T I V E S F R O M SWITZERLAND AND FINLAND Riitta Lehtinen Helsinki University Library and Genevieve Clavel-Merrin Swiss National Library The problem of multilingual access to bibliographic databases affects not only searchers in countries in which several languages are spoken, but also all those who search material in databases containing material in more than one language. This is the case in the majority of scientific or research databases. In addition, the growth of networks means that we can easily access catalogues outside our own immediate circle — in another town, another country, another continent. In doing so, we encounter problems concerning not only search interfaces, but also concerning subject access or even author access in another language. At the Second IFLA Satellite Meeting on Automated Systems for Access to Multilingual and Multi-script Library Materials (held in Madrid in August 1993), Dr. Vinod Chachra of VTLS Inc. gave a paper that provided an excellent overview of the questions related to thesaurus management in a multilingual and multi-thesaurus environment. It is not our intention today to repeat his summary, but rather to provide some examples that illustrate some of the questions related to multilingual subject access and multi-character access to bibliographic data encountered in Finland and Switzerland. The questions we are raising and the solutions we are introducing can provide experience not only in the context of our own countries but also in today's wider context of networking and international access. We will discuss multilingual questions in subject indexing with special reference to the Swiss situation, then multi-character questions with reference to the Finnish experience. The emphasis will be on the structures necessary to provide such access and not on the questions such as choice of multilingual vocabulary, management, and translation of headings.
67
Case 1: Multilingual Access in Switzerland Languages in Switzerland Switzerland is unique in its linguistic diversity: a country of nearly seven million people with four national languages (see Figure A, below): German French Italian Romantsch
75% 20% 4% 1%
The first three mentioned are the main languages of the country, and it is these, plus English, with which we are chiefly concerned. Switzerland is a confederation of states, with a federal government and individual state governments that have a large degree of autonomy in all areas, including those of education and libraries. The three main languages are taught in the different federal states, for example, German as the first additional language in the French and Italian speaking states (plus English), French in the German speaking states, and both French and German in the Italian speaking states. Consequently, by the end of high school, Swiss students and particularly those who intend to carry on their education at university or technical high school are expected to have a working knowledge of at least one of the other official languages and English. In meetings at a federal level it is generally accepted that each person present may speak his or her own language (provided it is one of the national languages), and it is assumed that those present have at least a passive knowledge of the other participants' language(s). In practice, one language may dominate the meeting, and gradually all speakers will move to converse in that language. The degree to which the country may be said to be multilingual varies from area to area. While some parts of the country, such as Fribourg or Biel, may be said to be bilingual, other areas are to all intent and purposes mono-lingual, such as Lausanne and St. Gallen (although in these and other towns there are multilingual communities using languages other than the official ones). Respect for the minority 68
official languages (i.e., all those except German) requires that official publications be translated into all languages and that each language have its own television station. In the library context, the small size of the country and its local institutions means that libraries, especially in the research and university sectors, have long cooperated in sharing resources through interlibrary loan. Researchers frequently need to find and use materials in other languages, and also need to extend their search for material outside their own library and linguistic context. There are currently three main university and research library networks (and a number of smaller ones), all of which use different subject headings, thus complicating searching for the user. Within a library or library network, the range of material in languages other than the "main" language may be very high. In the Swiss-French network, for example, French-language material accounts for only just over one third of the documents catalogued. The printed production of the country largely reflects the linguistic balance: German language material predominates (60%), followed by French (21%) and, surprisingly, English (10%). Italian accounts for only 2%. A university student must have as a minimum, a passive reading knowledge of another language in order to make use of material found in Swiss libraries or in libraries abroad. What Is Multilingual Access? Multilingual access is frequently used to cover a variety of topics, and it is necessary to define our use of the term "multilingual." We frequently hear talk of multilingual systems in reference to search environments (e.g., display screens, user dialogue, help screens, etc.), whereas the actual access points themselves are in fact monolingual. Multilingual user environments are now standard in the majority of systems. The ISBD is independent of the cataloguing language or the user environment and reflects the language of the item catalogued. Once again, we cannot talk about multilingualism in this context. However, it is interesting to note that the European project CANAL/LS is working on a linguistic server which will translate a user's keyword search (from ISBD and other fields) into one or more languages, thus extending the search.
69
Notes fields may be drawn up in the language of the cataloguing agency or in the language of the documents. For example, in the Swiss National Library (SNL) notes are written in French for documents in French, or in the Romance languages (except for Italian and Romantsch); in Italian for documents in Italian; and, in German for documents in all other languages (including Romantsch and English). The main areas affected by the multiplicity of languages used in the country are the access points: authors, both personal names and corporate names, and above all, subjects. The problems posed by multilingual access to names should not be underestimated (e.g., Greek and Latin classical authors, international organizations, federal institutions in Switzerland that have three official names - one in each language). There are certain differences however when comparing needs with those of subject indexing, in particular the fact that the majority of author names will never need to be translated. Unfortunately, time constraints prevent us from considering multilingual access to authors in detail here. We shall concentrate on multilingual subject access. What Is Multilingual Subject Access? In any discussion about subject indexing and searching in a multilingual environment, it soon becomes clear that many different interpretations are possible in terms of what is indexed, in which language, and how the searcher may use the resulting terms or headings. (See Figure 2 below, Subject indexing and searching in a multilingual environment.) The following description presents the viewpoint in the Swiss context and serves as a basis for the proposals that will follow. Individuals from other countries and dealing with different contexts and needs may find this view too "strict" in some ways. Ideally the indexer should be able to analyze a document and assign subject headings in his or her native language, while the user should be able to enter subject search terms in his or her native language. The language of the document should have no influence on the indexing language nor on the language used for searching. Practically speaking, there will be restrictions. There is a limit to the number of languages in which a subject headings list will be maintained, and thus in which the user may search. For our purposes here, we can imagine that it exists in four languages (German, French, Italian, English). Therefore, a German speaking indexer would use the German-language subject headings to index documents in all languages. In this instance, whether the document in question is 70
in English, French, German, Italian, or Greek, the subject headings assigned by this indexer will be in German. The same would then be the case for the other language-based indexers. The user may search for a document using subject headings in French, German, Italian, or English. The language of the document will influence the ease with which an indexer may assign subject terms. If the document is in Turkish and the indexer does not understand the language, there will be a problem. Likewise, a user may retrieve a document in a language she or he does not understand and therefore will not use the document in question. In the light of the above needs for multilingual subject access, the SNL had lengthy discussions with VTLS Inc. Our goal was to provide the necessary system structure in which a multilingual subject heading list may be loaded and used as an indexing and searching tool. Together, SNL and VTLS Inc. have agreed on the following strategy. Multilingual Subject Access: User Searching The following describes the planned multilingual subject access from the user's point of view during searching. Technical requirements and cataloguing rules will be examined in another section. Note that many of the display options presented are parameterised and represent the SNL's choice of display. The system development will be designed so that different sites in different countries may choose to set other display parameters or profiles. a) Searching in the subject headings lists The system will be pre-set to a default interface language. However, when carrying out a subject search in the system, the user may change the default interface language of the subject search to another language offered within the system. For example, if I log on and the initial system interface is in German, the language of subject searching will be in German. If I change my interface language to French, my subject searching will be in French. This option assumes that users will choose the user interface language with which they are most at ease, and that this becomes the primary search language for subject searching. However, the user may also choose to specify another subject search language. Help screens will indicate the possibility to change subject search languages. The headings
71
displayed in response to the user's search request are in the chosen language, and different languages are not interfiled in the same list. The user may enter a subject heading in the appropriate language and browse the subject heading list. Once the user has selected a subject from the list, the system displays the bibliographic records associated with that heading (or linked to its equivalent in another language via an authority file), or any existing cross references in the language of the selected heading. The user may then request to see the bibliographic records or alter the search strategy in response to the system's suggestions. When a bibliographic record is displayed, the subject headings in that record are displayed in the search language used. If the user has carried out an author, title, or keyword search and then displays bibliographic records, the subject headings displayed will be in the default interface language or in a different user-selected language if this has been specified. If the user gives a command to see tracings for a bibliographic record, subject heading tracings will be displayed in the language of the subject search (or the language of the interface when the search was not a subject search) b) Keyword and Boolean searching B y default, keyword searches apply to all indexed fields in the database (e.g., on author, title, subject, and notes fields). However, keyword and boolean searching using subject terms is possible using any of the languages. Users may specify that they are carrying out a keyword search using only subject terms.
Why Choose the Above Options? The options chosen by the SNL and described above reflect our perception of our users' needs. It is clear that, in other contexts, other choices may be made. Below, we explain some of our choices: a) Why not interfile subject headings in different languages? This is a very difficult question to resolve, and it provided a topic for much debate when planning the system. One could argue that by interfiling all languages, we would simplify searching for the user who wouldn't have to specify
72
a search language if the default interface search language is unsatisfactory. On the other hand, the SNL believed that the mixture of up to four different languages in one filing sequence would result in broken sequences of headings and subheadings thereby confusing the user, especially in cases concerning terms with different meanings in different languages. The following examples show a mixed language subject sequence, and the same sequence in French only. The examples are taken from the ETHICS system in Switzerland, which offers a multilingual subject access based on UDC. Subject headings in English (E), French (F) and German (D) interfiled in the ETHICS system. REGISTRE-MATIERES ALPHABETIQUE: LANGUE REG-MAT.: A 1 ADAPTATION/BEWEGUNGSADAPTATION (ANATOMIE U.PHYSIOLOGIE) D 0,Q 2 ADAPTATION/BOTANY E 0,Q 3 ADAPTATION/BRIGHT TO DARK ADAPTATION (VISION) E 0,Q 4 ADAPTATION/CELLULAIRE A L'ENVIRONNEMENT (CYTOLOGIE) FO.Q 5 ADAPTATION/CELLULAR ADAPTATION E 0,Q 6 ADAPTATION/CLIMAT (ANATOMIE ET PHYSIOL.) F 0,Q 7 ADAPTATION/CLIMATIQUE ET EDAPHIQUE (PHYTOGENETIQUE) FO 8 ADAPTATION/COLORATION (ANIMAL ETHOLOGY) EO.Q 9 ADAPTATION/CULTIVATED PLANTS EO 10 ADAPTATION/CULTURAL EO 11 ADAPTATION/DARK ADAPTATION (VISION) EO,Q 12 ADAPTATION/DE L AGRICULTURE FO 13 ADAPTATION/DUNKELADAPTATION (PHYSIOLOGISCHE OPTIK) D 0,Q 14 ADAPTATION/ECOLOGIE VEGETALE F 0,Q,E,U 15 ADAPTATION/EVOLUTIONARY FACTORS (BIOLOGICAL EVOLUTION) E 0,Q,U
The same sequence showing only subject headings in French (F) CLE D'ACCES: ADAPTATION REGISTRE-MATIERES ALPHABETIQUE: LANGUE REG-MAT. : F 1 ADAPTATION (BIOLOGIE) 0,Q,U 2 ADAPTATION (EVOLUTION BIOL.) 0,Q,U 3 ADAPTATION/ANIMAUX TERRESTRES 0,Q,U 4 ADAPTATION/AU TYPE D EXPLOITATION(ECONOMIE D'ENTREPRISE) 0 5 ADAPTATION/CELLULAIRE A L'ENVIRONNEMENT (CYTOLOGIE) O.Q 6 ADAPTATION/CLIMAT (ANATOMIE ET PHYSIOL.) O.Q 7 ADAPTATION/CLIMATIQUE ET EDAPHTQUE (PHYTOGENETIQUE) 0 8 ADAPTATION/DE L'AGRICULTURE 0 9 ADAPTATION/ECOLOGIE VEGETALE O.Q.E.U 10 ADAPTATION/PHYSIOLCGIE GENERALE O.Q 11 ADAPTATION/PHYSIOLOGIE VEGETALE 0,Q 12 ADAPTATION PHYSIOLOGIQUE (ECOLOGIE ANIMALE) 0,Q
73
13 ADAPTATION/QUADRIPOLES D (TECHN. OSCILLAT. ELECTR.) 14 ADAPTATION/ZOOLOGIE 15 ADAPTATIONS/CINEMATOGRAPHIQUES D'OEUVRES LITTERAIRES
0,U
0,Q O
b) If cross references do not exist in the language used for searching but do exist in another language, why not display them? It is obvious that whichever subject heading list is chosen, it is inevitable that at one time or another, one language may have more fully developed subject headings and cross reference structures than other languages. Common reasons for these delays are time lags in translation and authorization of vocabulary. We could imagine that when this happens, the user might be interested in seeing cross references in other languages. However, at SNL we decided not to display this information (except in the MARC format, see below), since we concluded that displaying references in another language would only serve to confuse the user. In addition, the options we chose would result in the user who selects a cross reference in another language moving to a browse list in that language and thus leaving the language of the initial search. We considered this situation to be even more difficult for the user to understand than the former. It is clear that if in the initial stages of the system that one vocabulary is clearly more developed than the others, we would need to indicate this to the user via help screens and initial messages so that she or he realizes that choosing a particular language may significantly reduce search results. cl When displaying bibliographic records or tracings, why not display headings in all languages? Once again, the SNL prefers to reduce the number of headings in different languages for questions of clarity for the user. If we have subject headings in four different languages and have assigned three headings to a document, we would have to display twelve subject headings if we chose to display all language variations. Users would need to scroll through screens of mainly redundant data. Again though, we recognize that this option implies that we have translated the majority of our subject headings.
74
Multilingual Subject Access: Technical Aspects a) Authority record structure According to the SNL's plans, each concept or subject heading assigned to a bibliographic record will need to be expressed in up to four languages (French, German, Italian, and English). In each case there may be one or more non-preferred terms (see references), and one or more related terms (see also, broader, narrower terms). After exploring different options, such as creating one authority record per concept and per language with links, we decided to extend the MARC authority record by defining multiple lxx tags. We realize that this does not conform to current MARC standards, but it presents the best way to manage the data during cataloguing, searching, and record exchange and extraction. This will be discussed later. As a result, all subject authority records are held in one subject authority file. Each concept has one authority record, which contains the headings for that concept in the languages supported by the system. The language of each heading is identified by a language code in a separate subfield. One heading in each language is considered the preferred form for that language and as such is encoded in a lxx field. Non-preferred forms are encoded in 4xx fields and are identified by language using the same code as for the preferred terms. Hypothetical example (note: subfield codes and indicators are not included here): 150 150 150 450 450 450 450 450 550 550
Bibliothèque, magasins Bibliotheksmagazine Library stacks Magasins, bibliothèque Bibliothèque, rayonnage Rayonnage de bibliothèque Magazine, Bibliothek Library shelving Signature (bibliothéconomie) Signatur (Bibliothekswesen)
(fre) (ger) (eng) (fre) (fre) (fre) (ger) (eng) (fre) (ger)
Note that there is more than one non-preferred term in French and also that the structure of headings (heading, sub-heading, or simply heading) may differ
75
according to the language. This question of structure can be a major difficulty in managing a multilingual vocabulary, but will not be addressed here. No attempt will be made to link non-preferred terms (or related terms) to another language equivalent, the only linked terms are those in the lxx fields. Any links between 5xx and their language equivalents must be made in the appropriate authority record in which they appear as lxx terms. bl MARC display of authority records Contrary to the user display in which headings will be displayed in only one language, the MARC display of an authority record will always show all headings in all languages regardless of the interface language chosen. We believe that this display is of primary importance to librarians, although some end users may also find them of value. Because the MARC display is the main thesaurus management tool, all information should be available. c) Creating a bibliographic record Ideally indexers should always copy and paste subject headings from authority records when creating new bibliographic records. If they copy and paste a subject heading in any language, links are automatically created from all other language forms of that heading in the authority record. If an indexer types in a new subject heading, the system will check that heading against the authority file and create a new heading as appropriate. If the cataloguer does not add a language code, the system will assume that the heading is in the default language. (See below for the definition of default language). d) Adding new languages to authority records If a new language heading is added to an authority record, the system will update the browse indexes and the bibliographic word search so that all bibliographic records that have been indexed with that heading in another language are immediately accessible. e) Importing authority and/or bibliographic records When importing records, it will be necessary to specify the language of headings so that the appropriate language codes may be added. 76
f) Exporting authority and/or bibliographic records (to other databases or for printed products) When exporting records, the language of the headings to be exported must be indicated. In the case of authority records, the system will take only the appropriate lxx, 4xx and 5xx headings and scope notes according to the language code specified. In this way, our authority records can become internationally compatible on export despite the multiple lxx fields for internal use. gì Scope notes in authority records While notes for internal use may be in any language, it is important that scope notes in authority records should also be coded by language, so that only the appropriate notes are displayed in response to a user request for authority record note display. Once again, the use of a scope note in one language does not mean that an equivalent note must exist in the other languages used in the authority record. What may be appropriate and necessary to explain for a heading in one language is not necessarily so for the same heading in another language. h) Default language While the ideal situation may be a multilingual list of subject headings with an equivalent in each language for each heading, in the initial phases of such projects often there will be uneven coverage of translations. Any library installing a multilingual option by using an existing subject heading list will have hundreds or thousands of authority records in only one language. By declaring a default language it is not necessary to assign a language code to all these records. A default language is also necessary for database management. This default serves as the language of reference for headings, and will be the preferred language if there are conflicts about headings, translations etc. Our experience has shown us that it is easier to maintain a certain coherence in headings if one language is declared default or predominant and serves as the control from which all others are translated. Conclusion There are of course many other points that need to be taken into consideration when installing a multilingual subject vocabulary. Unfortunately, we cannot 77
undertake a more detailed study here. It is also possible to create a multilingual list using a classification number (e.g., a UDC number) as a "pivot". In this case, the classification number would be the source concept and the subject headings translations of the numerically expressed concept. The multilingual and predominately scientific subject vocabulary in the ETHICS system (a network based in Zurich, CH) is structured this way. The structure of the authority file at the SNL would not exclude the use of such a vocabulary. We are considering the question of facilitating keyword subject (and author) access via non-preferred terms by indexing them. Questions of recall, precision, and clarity for the user will have to be considered before any decision is taken. As discussed above, we have a technical solution to the management of our multilingual subject heading list. We must admit though that at the SNL we have not yet chosen a list for a variety of reasons. The cost of creating and maintaining a multilingual list demands far more resources than we have available, even if we were to pool resources in the country. Ideally, we wish to share a list with other libraries in Europe or elsewhere in the world, but as yet such a list does not exist. The ETHICS list in our own country is multilingual (German, French, English) but the vocabulary itself is not controlled, and it is poorly coordinated among the different branches due to financial constraints. A first step within Switzerland might be to conduct a cooperative clean-up project for the ETHICS list. In the longer term, we believe that multilingual lists should be jointly created and maintained on an international level. Therefore, we await with interest the results of a proposal made to the European Commission under the fourth call for proposals to create a multilingual subject headings list for use by national and university libraries.
78
Case 2: Multilingual and Multi-Character Set Databases in Finland Languages in Finland Finland has two official languages: Finnish and Swedish. The most commonly used language in international cooperation is English. The majority of schools have Finnish as the teaching language. However, there are schools on all levels with Swedish as the teaching language, and two of the twenty-one universities in Finland have Swedish as the primary language. The other languages taught in the schools are English, French, German, and Russian. There are about 5 million people living in Finland in 1993 and their native languages are: Finnish Swedish Other
93% 6% 1% (Russian, Estonian, English, Somali etc.)
In 1993 books were published in Finland in the following languages: Finnish 78% Swedish 5% English 15% Other 2% LINNEA Network All the Finnish university libraries are part of the LINNEA network (See Figure 3 and 4 below for LINDA 1998-93 and List of member libraries). The Library of the Parliament and the National Repository Library are also members. All these libraries use the same system (VTLS), and each library has its own local database with local data (for OPAC/Keyword searching, Circulation, Periodicals check-in and routing, Statistics, etc.). A union catalogue database, LINDA, was created from the local catalogues. Primary cataloguing is done in LINDA, and the bibliographic and authority records already in the database are used as much as possible. The bibliographic record is then copied into the local database, and the data needed for local purposes is added. In this way each title is catalogued only
79
once, and maximum benefit is received from the resources available. Libraries have reported that between 30-80% of the new records are copied from LINDA and need only minor editing. Legal deposit libraries benefit especially from the National Bibliography data, which is updated daily into LINDA. The host organization for the LINNEA network is the Automation Unit of Finnish Research Libraries (ΊΓΚΑΥ), which is part of the Helsinki University Library (also the National Library of Finland). TKAY coordinates the library automation in Finland and is the host for national databases including LINDA. The other national databases are MANDA (the union catalogue for public libraries), ARTO (national database for articles), and VIOLA (national bibliography and the union catalogue database for music). User Environment in the Library Systems In the VTLS system the user can select the language for user dialogue and help screens from those provided by the library. Files for three languages are provided centrally: the English version received from VTLS Inc. and used as the basis for translation; the Finnish version maintained by TKAY; and, the Swedish version maintained by the Swedish School of Economics and Business in Helsinki. Each library can edit the files to reflect its own environment (e.g., library policies, opening hours, etc., in the help texts). The same language files with small modifications are used in the union catalogue databases hosted by TKAY (LINDA, MANDA, ARTO, and VIOLA). In addition, TKAY's WWW server (http://linnea.helsinki.fi/) contains general information about LINNEA network, libraries etc. Multilingual Data in the Databases Individual library databases contain bibliographic records of publications in a variety of languages. Before the creation of the union catalogue, each library used its primary language as the basis for its catalogue since almost all cataloguing records were created locally. Today, bibliographic records are copied from several sources. While the Union Catalogue database (LINDA) is used extensively, databases such as the Swedish National Bibliography and the Library of Congress CD-ROM are also used. The cataloguing language of the "foreign" data is Swedish or English. In addition, the standards used for classical Greek and Latin author names as well as for transliterating names in non-Latin characters (Cyrillic, 80
Chinese, Japanese, etc.) are different from the ones used in Finland. Cataloguers accept different language headings because they use minimum resources for editing. Normally only subject headings in a library's primary language are added. Swedish or English subject headings are kept although cataloguers add our own headings for variant transliterations. The subject headings in the union catalogue database appear in different tags depending on the language so that libraries may filter away languages not needed in their local databases. However, most libraries retain all headings. Each library can also decide which tags for non-controlled terms they want to index for subject searching -- tags for controlled subject headings are indexed automatically. Index entries are displayed together regardless of language. Most subject headings are in Finnish with Swedish and English headings the dominant "minority" languages. Finnish cataloguers use the United States National Library of Medicine's MeSH authority files as the basis for medical subject headings. While all subject headings are currently interfiled in the subject index, users have few difficulties because the languages are so different.
Authority Files The Helsinki University Library maintains authority records for Finnish personal and corporate authors as a part of the cataloguing of the National Bibliography database, FENNICA. The authority records for Finnish corporate authors contain "see" and "see also" references for other forms, and the official translations of the names in several languages, including Swedish, English, French, and German. YSA, the Finnish general thesaurus, is part of the FENNICA database and is maintained by Helsinki University Library. YSA contains approximately 14,000 subject headings with see references for non-preferred forms and see also references for broader, narrower, and parallel terms. A Swedish translation of YSA will be available later this year. In addition to YSA, individual libraries maintain twenty lists of subject headings on various areas (e.g., history, linguistics, education, sports and physical education, law, social sciences, and library and information science). Some lists have a thesaurus structure and some are created for multi-national databases and are therefore multilingual. Most libraries coordinate their lists with YSA to ensure no conflicts between terms. Also, these terms supplement the more general ones in
81
the YSA. Libraries use non-controlled terms for those subject areas where no thesaurus or list of terms exist. Finnish academic libraries began using subject headings in their databases in the 1980's, and, due to a number of budgetary and staffing reasons, have not added subject headings to their older records. Earlier, the Universal Decimal Classification System (UDC) was the chief method for subject description, and in some libraries it was the only method used. Libraries still add UDC classifications to records because UDC remains an important method for searching since it represents the only means for searching the entire database. Multilingual Subject Access SNL's concept for multilingual subject access described in the first part of this paper could be adopted also in Finland. In future (when all the necessary work with authority files has been done), it could support the searching of multilingual databases necessary for Finnish searchers. It should be a relatively straight forward procedure to load the Finnish and Swedish versions of the general Finnish subject headings (YSA). During the load, languages could be added and links created between the two forms. The same model could be used also for other available authority record files for subject headings. Finnish libraries have considered the alternative concept of creating one authority record per concept per language and linking them together via a separate link record. It might also be suitable to link UDC classification numbers to corresponding subject headings. This would improve the searching of older records with only UDC numbers because only a few end users are familiar enough with UDC to be successful searchers. However, creating links between UDC numbers and subject headings would require extensive resources, resources also needed for subject headings analysis. Decisions on these library policy questions will be made later this year. Multi-Character Set Data in Library Databases Finnish libraries also maintain large collections of Russian literature. For example, the Helsinki University Library received a legal deposit copy of
82
everything printed in Russia from 1828 until 1917 and thereafter has actively collected Russian literature. The Slavonic library collections in the Helsinki University Library probably represent the best ones outside the former Soviet Union. In the libraries' old card catalogues these Russian books were catalogued using Cyrillic characters. When the catalogues were converted into OPACs, the Cyrillic characters had to be transliterated because the old automated systems could not cope with Cyrillic. The card catalogue of the Slavonic library's large collection has not yet been converted, and therefore researchers from all over the world come to the library to consult the catalogue. Cataloguing and Searching Cyrillic Data in LINNEA Network: The Current Solution In the Finnish VTLS databases, the ISO 6937/2 character set is used as the internal character set until UNICODE is available. Cyrillic characters have been entered in Latin characters transliterated according to ISO R 9 standard. The Finnish name forms (transliterated according to an old national standard) of some authors have been used in parallel to the ISO-based form of the name, with "see also" references from one to the other. In November 1994 the internal character set of the Finnish VTLS databases was changed into ISO 6937/2 + Modem Cyrillic. Data entered as Cyrillic is also stored as Cyrillic. Now it is possible to catalogue and search using Cyrillic characters (See Figure 5 below. Authority record). The only limitation is that now this works only with so called HP Roman-8 terminals (HP 700/92 - HP 700/98). This is the terminal selected originally for cataloguing in Finnish VTLS libraries. No date has been set for a PC version but it will be available later. The data needed for the keyboard control is downloaded from HP3000 to the terminal. The three first function keys are used to switch between Latin and two Cyrillic values for the keys. Therefore, there are three values for each key and additional combinations using the Ext key. Cyrillic characters can be used together with Latin characters in the same record, tag, subfield, or even word. The escape sequences stored in the data mark the character set used. Depending on the terminal type selected, the Cyrillic data is either displayed as Cyrillic (for the Roman-8 terminal types) or mapped into the transliterated form according to ISO 9. Author, title, subject, and classification indexes are sorted by
83
the transliterated value, which means that in most cases the Latin and Cyrillic form of the word are sorted after each other (See Figure 6, Search for title beginning with "sovetskaâ"). Searching for "Cehov" in Latin or Cyrillic will therefore take the user to the same place in the index. Keyword searching (e.g., "W/Russian" ) finds entries with the word "Russian" in Latin and in Cyrillic and produces only one set of entries. A user familiar with Cyrillic can also enter the search term in Cyrillic - the command (A/, T/ etc.) has to be in Latin, but the actual search term can be in Cyrillic - the result is the same as for Latin characters. For extraction and loading bibliographic and authority records, the choices are to keep the ISO 6937/2 + Modern Cyrillic or map everything into ISO 6937/2. If mapping is selected, again a transliteration according to ISO 9 is done. When records are copied online and both databases have the Cyrillic option turned on, the internal character set is used regardless of the terminal type selected during log-in. Due to changes in cataloguing practices and different standards used in different countries, there are various transliterated forms plus the Cyrillic form in authority records (especially for personal and corporate authors) (See Figure 7, Changes in transliteration). When a non-Cyrillic terminal type is used for searching, even the Cyrillic form is transliterated and generally looks exactly the same as the one entered as transliterated. This is confusing to users who cannot understand why there are two entries for the same heading, and therefore think that the Cyrillic form should be marked somehow in the index. Another problem can occur in the case of names of non-Russian origin that have been written in Cyrillic phonetically and then transliterated back into Latin characters: they may file differently from the original Latin form (See Figure 8, Names of non-Russian origin). In these cases we need to provide see also references to assist searching. The concept planned for multilingual subject access would probably not be used for the Cyrillic data because the various forms resulting from different transliterations are necessary in searching in any language. Conclusion Searching for answers to questions related to multilingual and multi-character set databases is an interesting experience. It has proved again that cooperation with other libraries and library system providers is the best way to find viable solutions. It has also shown us that in both cases the technical solution, while complex, is feasible, and the wider questions of data management still needs solving.
84
Bibliography Clavel-Merrin, Genevieve, "Multilingual Subject Access in Switzerland." In Library Automation and Networking = L'automatisation et les reseaux de bibliothèques, ed. Herman Liebaers and Marc Walckiers, 215-228. München; London: Saur, 1991. Goossens, Paula, "Multilingual Bibliographie Access Via Subject in Europe." In Library Automation and Networking = L'automatisation et les reseaux de bibliothèques, ed. Herman Liebaers and Marc Walckiers, 202-214. München; London: Saur, 1991. Soini, Antti, "LINNEA - Library Information Network for Finnish Academic Libraries." In European Library Automation Group, 14th Library Systems Seminar, Brussels, 7-8 May 1990, ed. Paula Goossens, 39-40. Brussels: Koninklijke Bibliotheek Albert I, 1991.
85
Figure 1
LANGUAGES IN SWITZERLAND German French Italian Romanisch
86
Figure 2
Subject indexing and searching in a multilingual environment •M
^
•
indexer assigns headings in German
Κ
user searches using English headings and retrieves doc.
^
linked headings in four languages
87
Figure
3
L I M D A 1988-93
2. 3. 4. 5. 6. 7. 8. 9.
"sovetskaâ"
1 1 1 1 1 1
Sovetskaâ arhitektura Sovetskaâ arhitektura Sovetskaja justicija Sovetski entsiklopeditseski slovar Sovetskoe gosudarstvo i pravo Soviet, East European and Slavonic studies in Britain 1 Soviet education 1 Soviet geography 1 Soviet law and government
When a non-Cyrillic terminal type is used, there seems to be no difference between entries on line 1 and line 2, but if a Cyrillic terminal type is selected, it looks like this:
1> 2. 3. 4. 5. 6. 7. 8. 9.
1 1 1 1 1 1
CoeeTCKaa apxHTeicrypa Sovetskaâ arhitektura Sovetskaja justicija Sovetski entsiklopeditseski slovar Sovetskoe gosudarstvo i pravo Soviet, East European and Slavonic studies in Britain 1 Soviet education 1 Soviet geography 1 Soviet law and government
90
Figure 7
3. CHANGES IN TRANSLITERATION
Cyrillic
ISO 9
Ά
â
ISO R9 (National form)
rau m
ja ju
s
sí, áts
Figure 8
4. NAMES OF NON-RUSSIAN ORIGIN
Cyrillic
ISO 9
In original language
ΠΙεκαικρ
Sekspir
Shakespeare
flxoBaHtojiH Dzovan'oli
Giovannoli
OJIÄPHJPK
Oldridz
Aldridge
Βρτφ
Vulf
Woolf
91
MULTILINGUAL AND MULTI-CHARACTER SET DATA IN LIBRARY SYSTEMS AND NETWORKS: EXPERIENCES AND PERSPECTIVES FROM SWITZERLAND AND FINLAND Riitta Lehtinen Helsinki University Library and Genevieve Clavel-Merrin Swiss National Library COMMENTARY Eeva Murtomaa Helsinki University Library The paper presented above describes how libraries can effectively respond to different user needs in their native languages and scripts within multilingual and multi-character environments. The two authors promote the idea of a multilingual thesaurus as a solution for meeting the information needs of different users irrespective of their language background. The conventional hierarchical and linear structure of the MARC authority format has been "broken" to provide equal access for all variant forms. The Swiss example deals with many languages within one cultural environment. However, new problems arise when creating multilingual access within a multicultural environment. This is because one language may have more linguistic expressions for one single concept than another, such as different colour scales. There are problems even in a mono-lingual environment; for example, homonyms and synonyms in the German language used in Austria, Germany, and Switzerland. There are some important issues in the multilingual environment that require further investigation. One important one is the structure of the subject authority file and the reasonable management of relationships. How can this be realized? For some languages, such as Finnish, truncation used in keyword and Boolean searching is very important for obtaining relevant search results. The use of truncation and sorting mechanisms in multilingual environments also needs investigation. 92
The paper also describes a user-friendly approach to searching multi-character set data. The authors cite an interesting example of using automated transliteration of Cyrillic script and giving the user the option of the original character sets or transliterated forms. There are differences in cataloguing practices and authority files when national or international transliteration tables are used. Besides variant name forms, differences may arise among standards when switching back to the original form. In the future, Unicode may offer us solutions in a multilingual, multi-character context. Today the impact of technology allows multilingual or multi-script access to bibliographic and authority files nationally and internationally through online networks. Users now search for relevant information from an enormous mass of data available on the Internet. There is a need for description and organization of the networked electronic resources. That is why the new standards created to facilitate resource discovery on the Internet (e.g., Dublin Core) and the markup languages, such as SGML and HTML, should be investigated. Great efforts have been taken for creating national authority files based on existing standards and tools. However, another question arises: what is the value of creating and maintaining one preferred form to which all non-preferred and related forms are linked? When we talk about names, the referent is unique, but the visual form is dependent on the script. Today there is a tendency to move towards coordinated national and international services. Moreover, efforts are being made to adjust standards and technology to the new environment. In a multilingual, multi-script and multi-cultural environment, why not link existing variant forms of the languages to each other without any hierarchies? This would mean that all forms in the authority record are equally valued. The goal is that the user should get all records, irrespective if he uses a preferred or a nonpreferred form. This would eliminate the problems and difficulties caused by different cultural environment, and cataloguing polices, as well as problems of different scripts or romanized forms. It would also facilitate the exchange of authority data and integration of retrospective material in the authority files. The major challenge of creating this kind of authority structure is to find a common referent describing the idea and semantic content of the referent, to which all related variants should be linked. Could it be a word, a group of words, a symbol or an identification number? All this sounds very idealistic. However, modern technology gives us the opportunity for this, although not without enormous human
93
efforts. Realizing this goal in a global environment requires the pooling of all cooperative resources.
94
THE UNICODE™ STANDARD AN OVERVIEW WITH EMPHASIS ON BIDIRECTIONALITY Joan M. Aliprand Research Libraries Group, Inc. To answer the question What is the Unicode Standard?1, let me quote from the Unicode® Consortium's invitation to membership:2 The Unicode Standard is a 16-bit encoding that enables worldwide distribution of applications. Encompassing the principal scripts of the world, it provides the foundation for the internationalization and localization of software. With its simple and efficient approach to special and modified characters, the Unicode Standard defines a new degree of data portability between platforms and across borders. You may then wonder: Why is the Unicode Standard needed? Don't we do multilingual and even multi-script computing perfectly well today? One of the problems with current multi-script implementations is that the same multi-script text can be encoded in different ways. Figure 1 shows text in multiple languages and scripts: English, French, and Spanish written in Latin script; Russian written in Cyrillic script; Chinese written in simplified ideographs; and Arabic written in Arabic script. The sentence contains six languages and four scripts. Examples of text in the six official languages of the United Nations:
English, français, español, pyccKiift, φ
, ^.JAJI üDl
Figure 1: Text in Multiple Languages and Scripts 95
One way to encode this sentence for computer processing is by using the ISO 8859 series of character sets, developed by the International Organization for Standardization (ISO). These character sets, especially ISO 8859-l(Latin alphabet no. I), 3 are widely used in commercial software applications. The sentence can be encoded using four different character sets: three from the ISO 8859 series (Latin alphabet no. 1, Latin/Cyrillic4 and Latin/Arabic5), plus one for the ideographs. There are several possible options to encode the ideographs, for example, GB 2312, 6 a national character set standard of China. Another way to encode the sentence is to use character sets developed by libraries to meet their needs. These include character sets developed by ISO Technical Committee 46, and sets developed in the United States and specified by the Library of Congress for use in USMARC records.7 In some cases, the ISO standard and its USMARC equivalent are identical. However, this is not always true. ISO 5426, 8 the Extended Latin set for bibliographic use, has many characters in common with the USMARC Latin set, but the two sets are not the same in all respects. Four of the character sets approved by the Library of Congress for USMARC can be used to encode die sentence: the Latin character set, the Basic Cyrillic character set, the East Asian Character Code,9 and the Basic Arabic character set. Note that this sentence could not be encoded using the UNIMARC character set repertoire because no East Asian or Arabic character set has been approved for UNIMARC.
96
Figure 2 compares alternative methods of encoding the text from Figure 1. For each piece of text, the column on the right shows the text's encoding in an ISO 8859 character set (or in GB 2312 in the case of Chinese) followed by its encoding in a USMARC character set on the next line.
Encoded text (in hexadecimal notation)
Text
Character set
English
ISO 8859-1
4 5 6 E 67 6C 6 9 73 6 8
USMARC Latin
4 5 6 E 67 6 C 69 7 3 6 8
ISO 8859-1
66 72 ò l 6li L 7 61 69 73
USMARC Latin
6 6 72 6I 6Γ. FO 6 3 61 69 73
ISO 8859-1
65 73 70 61 FI 6F 6C
USMARC Latin
65 73 70 61 E4 6 E 6 F 6 C
ISO 8859-5
7 0 7 3 71 71 6A 6 8 6 9
USMARC Cyrillic
5 2 5 5 5 3 5 3 4 B 4 9 4A
GB 2312
5448 4636
USMARCEACC
213034 214258
ISO 8859-6
C 7 E4 E4 DA C 9
USMARC Arabie
4 7 6 4 6 4 5A 4 9
français
español
pyocMMì
t *
(ANSI Z39.64) &JI
Differences between encodings are shown in bold face
Figure 2: Encoding the Text — Comparison of Options
97
The only encodings in common in the two examples are those for unaccented letters in the English, French, and Spanish text. This is because the first half of each ISO 8859 standard corresponds to ISO 646,10 the International Reference Version of ASCII, and the first half of the USMARC Latin character set is ASCII. The abbreviation ASCII stands for American National Standard Code for Information Interchange.11 All other encodings are different, even though the text that is encoded is the same. It costs time and effort to convert between different encodings. There is real benefit to have one worldwide character set containing all major scripts for all applications. Instead of multiple versions of software for various world markets, there could be one universal version, which could be easily adapted for a particular cultural need. The Unicode Standard is this character set for the world. Overview of the Unicode Standard A common question is Does the Unicode Standard include my script? The Unicode Standard has room for over 65,000 characters, and includes the scripts listed in Figure 3. The Unicode Standard is consistent with International Standard ISO/IEC 10646.12 In 1995, Tibetan script and a complete set of Korean hangul were accepted for addition to the Unicode Standard by the Unicode Technical Committee. They are included in The Unicode Standard, Version 2.0. In 1996, the Unicode Technical Committee accepted the Ethiopie script and has proposed its addition to the standard.
98
Goura I Scripts,
Latin
«le. f IPA/MOOtüers
Latin i G r · « I Symbols I
I Diacritics Greek
CJK A u r t a r y , K a n t . Hangul I
Cyrillic
HangU I I Armenian
Hangul I
Hebrew
Basic Multilingual Plane
General Scripts, etc.
East Asian Ideographs I Oevanagan Bengali Gurmukhi Gujarati Oflya Tamil Telugu Kannada I Watayalam That Lao Private Use Area Compatibly Spectals
Georgian Korean J a m o s (ptus m U H n à entended Latin 4 Greek)
Figure 3: General Layout of the Unicode Code Space The assignment of all scripts within the code space of the Unicode Standard, Version 2.0, is shown on the left. Alphabetic scripts are shown in more detail on the right. The developers of the Unicode Standard wanted it to be universal, efficient, uniform, and unambiguous. They also established a number of principles, including: •
the distinction between character and glyph;
99
•
unification of characters across languages; dynamic composition of accented forms from component characters; and
•
logical order for characters internally.
The Distinction Between Character and Glvph The distinction between character and glyph is very well illustrated by Arabic script Arabic letters assume positional forms. They have different shapes depending on where they are in a word, and what the neighboring letters are. Figure 4 shows the various positional shapes that individual Arabic letters can have. Initial
Medial
Final
Isolated
>
Λ
£
ε
u *
υ *
e
t
Ά
P
A
Figure 4: Character versus Glyph as shown by Arabic
Script
The positional shapes that we see on paper or a computer screen are called "glyphs" in Unicode terminology. The underlying letters themselves are a concept For example, the Arabic letter jeem can be rendered by any of four possible glyphs (depending on where the letter jeem is positioned in the text). The first row in Figure 4 shows the positional forms of the letter jeem. Some Latin script characters also illustrate the difference between the character and its rendering as a particular glyph. Figure 5 shows different ways of rendering the Latin script characters lowercase g and lowercase a. Although the images are different we recognize that lowercase g and lowercase a are unique conceptual constructs. Other examples of variant glyphs include a few numeric 100
digits, such as four and seven, and the dollar sign. For example, the number seven, whether it is written with or without a horizontal stroke across the stem, represents the same numeric quantity.
Latin letter lowercase g
Latin letter lowercase a
a α Figure 5: Character versus Glyph as shown by certain Latin Script Letters
A Unicode character encodes the underlying intellectual concept (represented for computer processing by a pattern of bits). When one or more Unicode characters representing an element of text must be rendered for reading, we use glyphs. The encoding of a single element of text by multiple Unicode characters is discussed below in the section Dynamic Composition from Component Characters. Paired, or mirrored, punctuation, such as parentheses, also embodies the character/glyph distinction. The Unicode Standard does not encode a left parenthesis and a right parenthesis (although these names are used for consistency within international standards, specifically, ISO/IEC 10646). The parentheses are conceptual characters with the functions of opening parenthesis and closing parenthesis respectively. The shape (or glyph) that each parenthesis takes at display depends on its environment when it is rendered, as shown in Figure 6. English,
101
written in Latin script and read from left-to-right, is contrasted with Yiddish, written in Hebraic script and read from right-to-left
(English) opening
closing
(EP-p) closing
opening
Figure 6: Mirrored Punctuation Unification of Characters Across Languages Arabic script in the Unicode Standard (shown in Figure 7) also illustrates the principle of unification across languages. Many languages use Arabic script Languages in current use include Arabic itself, Farsi (also known as Persian), Urdu, and Pashto. Ottoman Turkish is the best known historic language written in Arabic script. The alphabets of many of these languages contain the same letters (although they may not have the same name or pronunciation). Is there any need to encode these letters repeatedly because they occur in different alphabets, or should they be encoded only once, with unification across languages? Unification across languages applies to most of the scripts in the Unicode Standard, including the "unified Han" set of ideographs for Chinese, Japanese, Korean, and historical Vietnamese. 102
0 6 )0
Arabic 060
οβί
OSI
0«)
067F
064
OÍS
06«
Arabic
0680 OSI
067 1
0
(") V
j
£
t- j
1
•
\
• 2
Τ
J
j
4
j J·
J
5
1 ft
f
Λ
i 0
1
4
*
&Ì 1
1
t-J
A
(
A
J
Cj t
(S A
J
J>
üi
c f
0
3